Page MenuHomePhabricator times out
Closed, ResolvedPublic

Description times out using both HTTP and HTTPS since hours (first observed around 14-15:00 UTC).

Event Timeline

Tacsipacsi triaged this task as Unbreak Now! priority.May 16 2017, 9:44 PM

@Nikerabbit said they're aware and the server isn't connected to the Internet.

Around 1300Z first our main server went unreachable, then few minutes after our secondary server went unreachable. 1343Z our secondary server came up, but not the primary one. We can access the console via our control panel. First we saw hhvm failing to start, but after disabling it we noticed that networking does not come up. Running dhclient -v directly shows DHCPDISCOVERs going out but no replies going in. The configuration for our two servers is pretty much the same, so we filed a support request with them.

We could temporarily use our secondary server, but it doesn't have the production database, so we would need to restore it from backups and put in read-only mode hoping we still get the main server online.

Maybe a read-only server with a sitenotice would be better if it doesn’t need much work so that users could get some information about what’s happening.

so that users could get some information about what’s happening.

For that, the domain could "just" redirect here. I don't know whether it would be advisable though.

Nikerabbit claimed this task.

We are back online, with some slowness. We have identified multiple actionables based on this experience.

There was another similar outage today, though recovery was fast this time because there was no issue with disks. Because this happened during middle of the night, and because we haven't yet done most of the planned follow-ups, there weren't much we could have done ourselves.

I forgot to add, none of our custom services came up after reboot automatically, even though both puppet and I manually enabled them with systemctl enable. Should be fixed by