translatewiki.net system continuity and disaster recovery
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	siebrand
	May 19 2017, 11:23 AM

Description

translatewiki.net recently had a long downtime. According to our external monitor, uptimerobot.com, we were down starting Tue, May 16, 2017 at 3:00 PM and back up Wed, May 17, 2017 at 4:18 PM for a downtime of 25 hours, 17 minutes and 28 seconds.

Both translatewiki.net servers, web1, the nginx server, and es, the Elasticsearch/development server were down initially. After 40 minutes, es was back up (Tue, May 16, 2017 at 3:43 PM), but web1 remained down. It turned out that it was no longer given an IPv4 address by the DHCP server and communication over IPv6 was not possible.

All times UTC+2

Tuesday:

14:54: Netcup reports es and web1 are down ("Meldung über Ausfall Ihres vServers").
15:00: uptimerobot.com reports es and web1 down.
15:43: uptimerobot.com reports es up.
18:09: @siebrand creates a ticket with Netcup through https://ccp.netcup.net/.
19:24: Support replies the server is up.
20:51: @siebrand replies: Yes, the server is up, but as said, we are not being issued an IPv4 address! We are also not able to communicate using IPv6 on the node.
21:16: Netcup support replies it escalated to another group.
21:27: Netcup asks questions: Check by MTR (my traceroute) or win-MTR whether a packet loss occurs at the nodes of the route. For this, please perform at least 500 pings. Boot your server into the rescue system. If the operation takes longer than 5 minutes, the cause is in the network.
21:31: @siebrand replies: This doesn't appear to be very helpful. We can clearly see there is no DHCPOFFER.
21:34: @siebrand provides screenshots of the root consoles with the output of "dhclient -v" which shows things work for es, and fail with "no DHCPOFFER" for web1.
21:43: T165539: translatewiki.net times out is reported

Wednesday:

09:43: @Nikerabbit replies: I tried to boot into the rescue system, but it also failed on not having a network connection.
10:10: Netcup support asks if they can boot into rescue mode.
10:13: @Nikerabbit replies: Sure, go ahead if it helps you resolve this issue.
12:43: Netcup replies: the issue has been solved. The server is now again getting DHCP leases.
13:08: @siebrand replies: Thank you for looking into this. We can access the server again. Can you please provide some details as to what was going on? At the moment we experience a very slow server still, which is atypical. Are there any residual issues after the hardware outage of yesterday?
13:30?: @Nikerabbit adds a custom error page to nginx instead of the default 50x error pages. It does not contain any information specific to this incident.
13:42: @siebrand replies: Something is still really wrong! dd if=/dev/zero of=/root/testfile bs=1G count=1 oflag=dsync: 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 783.949 s, 1.4 MB/s. Please update us on the situation.
13:49: Netcup replies: we are experiencing a high i/o on the host system right now. the issue should be fixed during the day.
14:01: @siebrand replies: Isn't this a bit weird? Who's taking such a huge part of the hosts resources that our (and possibly other) nodes hardly function? What can be done against that, in absence of a better word, abuse?
14:15: Netcup replies: we actually are fixing the issue. Please be patient.
14:52: @siebrand replies: Can you please provide us with an update? Our environment has now effectively been down for almost a day. We can be patient, but providing details about what we should be waiting for will certainly help. Otherwise the only option we have is to complain and look for hosting elsewhere because of a lack of transparency.
15:00 Nginx on web1 is responding to web queries again, but still very slow because of degraded disk IO performance. It took about 2 hours to get to this level.
15:01: Netcup replies: the issue is still in progress. We provide al information necessary, but we cannot update e. g. every 15 minutes of status. Please be patient, the issue should be solved in a short amount of time.
16:17: @Nikerabbit restarts hhvm because it is trying to process the long backlog of requests which have already timed out.
16:18: uptimerobot.com reports translatewiki.net is back up.

Additionally:

We found out that the passphrase to the backup recovery GPG key was only known to @Nikerabbit and that he had forgotten it. The keys were updated, and @siebrand had never been given those, nor the passphrase. This meant that backups would not have been available. Later @Nikerabbit found out where he had stored the passphrase. There were confusions about which were the right GPG keys (there were old ones around) and restoration script had not been updated for backup locations etc. In addition after these changes the restoration script didn't work because decryption failed with cryptic error message.

Initially we thought the issue is related to the fact the hhvm and hhvm-development services to failed start up repeatedly. The service files had been recently updated, but not noticed they did not work in a clean state after reboot. The service files were updated again after the incident to resolve the issue.

We should post-mortem this outage, and see if and where we can improve. Questions that may be relevant:

What did we do right to fix this outage?
What did we do wrong to fix this outage?
Are there things we don't yet understand regarding this outage?

More specifically:

What could we have done to reduce downtime?
Could we have responded differently to Netcup support, so they could have better helped us?
How can be ensure backups are secure and accessible by at least two server maintainers?
How can we ensure that the console root password is known to at least to server maintainers? We found out @siebrand and @Nikerabbit didn't know the root password (or didn't find it quickly enough), so an additional server restart was needed to change the password.
What is a reasonable and timely way of communicating downtime for translatewiki.net? We announced first on Twitter (@translatewiki) at 12:09 (more than 20 hours into the downtime).
Could we benefit from a cold stand-by?
Should we improve our failover and redundancy in general?
Should we communicate the outcome of this session relevant to Netcup to them and if so, what do we plan to get out of that?

This session is scheduled for Wikimedia Hackathon 2017 in room Powidl on Saturday from 09:15-10:00.

Related Objects

Mentioned Here: T228575: Decrease number of open tickets with assignee field set for more than two years (aka cookie licking) (March-June 2020 edition)
T165539: translatewiki.net times out

Event Timeline

siebrand created this task.May 19 2017, 11:23 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 19 2017, 11:23 AM

siebrand updated the task description. (Show Details)May 19 2017, 12:16 PM

Nikerabbit updated the task description. (Show Details)May 19 2017, 12:56 PM

Nikerabbit updated the task description. (Show Details)May 19 2017, 12:58 PM

Nemo_bis subscribed.May 19 2017, 1:01 PM

siebrand updated the task description. (Show Details)May 19 2017, 1:29 PM

This session was scheduled for Wikimedia Hackathon 2017 in room Powidl on Saturday from 09:15-10:00.

siebrand triaged this task as Medium priority.May 19 2017, 1:34 PM

siebrand updated the task description. (Show Details)

Nikerabbit added a subscriber: Amire80.May 19 2017, 3:01 PM

Nikerabbit moved this task from Backlog to Proposed Sessions on the Wikimedia-Hackathon-2017 board.May 20 2017, 12:15 PM

Peachey88 subscribed.May 21 2017, 3:29 AM

Liuxinyu970226 subscribed.May 25 2017, 5:41 AM

Nikerabbit moved this task from Backlog to System admin stuff on the translatewiki.net board.Nov 21 2017, 2:22 PM

@siebrand: Hi! This task has been assigned to you for a while. Could you maybe share an update? Do you still plan to work on this task?

This task has been assigned to the same task owner for more than two years. Resetting task assignee due to inactivity, to decrease task cookie-licking and to get a slightly more realistic overview of plans. Please feel free to assign this task to yourself again if you still realistically work or plan to work on this task - it would be welcome!

For tips how to manage individual work in Phabricator (noisy notifications, lists of task, etc.), see https://phabricator.wikimedia.org/T228575#6237124 for available options.
(For the records, two emails were sent to assignee addresses before resetting assignees. See T228575 for more info and for potential feedback. Thanks!)

Nikerabbit moved this task from System admin stuff to Incident follow-ups on the translatewiki.net board.Jul 30 2020, 7:22 AM

Closing this old task. Some of the suggestions have been implement, some others are still to be done, but can be tracker separately.

translatewiki.net system continuity and disaster recoveryClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

translatewiki.net system continuity and disaster recovery
Closed, ResolvedPublic
Actions