translatewiki.net recently had a long downtime. According to our external monitor, uptimerobot.com, we were down starting Tue, May 16, 2017 at 3:00 PM and back up Wed, May 17, 2017 at 4:18 PM for a downtime of 25 hours, 17 minutes and 28 seconds.
Both translatewiki.net servers, web1, the nginx server, and es, the Elasticsearch/development server were down initially. After 40 minutes, es was back up (Tue, May 16, 2017 at 3:43 PM), but web1 remained down. It turned out that it was no longer given an IPv4 address by the DHCP server and communication over IPv6 was not possible.
All times UTC+2
Tuesday:
- 14:54: Netcup reports es and web1 are down ("Meldung über Ausfall Ihres vServers").
- 15:00: uptimerobot.com reports es and web1 down.
- 15:43: uptimerobot.com reports es up.
- 18:09: @siebrand creates a ticket with Netcup through https://ccp.netcup.net/.
- 19:24: Support replies the server is up.
- 20:51: @siebrand replies: Yes, the server is up, but as said, we are not being issued an IPv4 address! We are also not able to communicate using IPv6 on the node.
- 21:16: Netcup support replies it escalated to another group.
- 21:27: Netcup asks questions: Check by MTR (my traceroute) or win-MTR whether a packet loss occurs at the nodes of the route. For this, please perform at least 500 pings. Boot your server into the rescue system. If the operation takes longer than 5 minutes, the cause is in the network.
- 21:31: @siebrand replies: This doesn't appear to be very helpful. We can clearly see there is no DHCPOFFER.
- 21:34: @siebrand provides screenshots of the root consoles with the output of "dhclient -v" which shows things work for es, and fail with "no DHCPOFFER" for web1.
- 21:43: T165539: translatewiki.net times out is reported
Wednesday:
- 09:43: @Nikerabbit replies: I tried to boot into the rescue system, but it also failed on not having a network connection.
- 10:10: Netcup support asks if they can boot into rescue mode.
- 10:13: @Nikerabbit replies: Sure, go ahead if it helps you resolve this issue.
- 12:43: Netcup replies: the issue has been solved. The server is now again getting DHCP leases.
- 13:08: @siebrand replies: Thank you for looking into this. We can access the server again. Can you please provide some details as to what was going on? At the moment we experience a very slow server still, which is atypical. Are there any residual issues after the hardware outage of yesterday?
- 13:30?: @Nikerabbit adds a custom error page to nginx instead of the default 50x error pages. It does not contain any information specific to this incident.
- 13:42: @siebrand replies: Something is still really wrong! dd if=/dev/zero of=/root/testfile bs=1G count=1 oflag=dsync: 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 783.949 s, 1.4 MB/s. Please update us on the situation.
- 13:49: Netcup replies: we are experiencing a high i/o on the host system right now. the issue should be fixed during the day.
- 14:01: @siebrand replies: Isn't this a bit weird? Who's taking such a huge part of the hosts resources that our (and possibly other) nodes hardly function? What can be done against that, in absence of a better word, abuse?
- 14:15: Netcup replies: we actually are fixing the issue. Please be patient.
- 14:52: @siebrand replies: Can you please provide us with an update? Our environment has now effectively been down for almost a day. We can be patient, but providing details about what we should be waiting for will certainly help. Otherwise the only option we have is to complain and look for hosting elsewhere because of a lack of transparency.
- 15:00 Nginx on web1 is responding to web queries again, but still very slow because of degraded disk IO performance. It took about 2 hours to get to this level.
- 15:01: Netcup replies: the issue is still in progress. We provide al information necessary, but we cannot update e. g. every 15 minutes of status. Please be patient, the issue should be solved in a short amount of time.
- 16:17: @Nikerabbit restarts hhvm because it is trying to process the long backlog of requests which have already timed out.
- 16:18: uptimerobot.com reports translatewiki.net is back up.
Additionally:
- We found out that the passphrase to the backup recovery GPG key was only known to @Nikerabbit and that he had forgotten it. The keys were updated, and @siebrand had never been given those, nor the passphrase. This meant that backups would not have been available. Later @Nikerabbit found out where he had stored the passphrase. There were confusions about which were the right GPG keys (there were old ones around) and restoration script had not been updated for backup locations etc. In addition after these changes the restoration script didn't work because decryption failed with cryptic error message.
- Initially we thought the issue is related to the fact the hhvm and hhvm-development services to failed start up repeatedly. The service files had been recently updated, but not noticed they did not work in a clean state after reboot. The service files were updated again after the incident to resolve the issue.
We should post-mortem this outage, and see if and where we can improve. Questions that may be relevant:
- What did we do right to fix this outage?
- What did we do wrong to fix this outage?
- Are there things we don't yet understand regarding this outage?
More specifically:
- What could we have done to reduce downtime?
- Could we have responded differently to Netcup support, so they could have better helped us?
- How can be ensure backups are secure and accessible by at least two server maintainers?
- How can we ensure that the console root password is known to at least to server maintainers? We found out @siebrand and @Nikerabbit didn't know the root password (or didn't find it quickly enough), so an additional server restart was needed to change the password.
- What is a reasonable and timely way of communicating downtime for translatewiki.net? We announced first on Twitter (@translatewiki) at 12:09 (more than 20 hours into the downtime).
- Could we benefit from a cold stand-by?
- Should we improve our failover and redundancy in general?
- Should we communicate the outcome of this session relevant to Netcup to them and if so, what do we plan to get out of that?
This session is scheduled for Wikimedia Hackathon 2017 in room Powidl on Saturday from 09:15-10:00.