Good morning everyone!
What a lovely day to catch fire! 🌈
The server fabula.wikimedia.it is offline due to a major incident in the datacenter SBG in Strasbourg owned by our service provider. For instance the datacenter SBG2 has caught fire and was destroyed. Literally.
- http://travaux.ovh.net/?do=details&id=49484 - ticket about SBG
- http://travaux.ovh.net/?do=details&id=49471 - ticket about SBG1
- https://twitter.com/olesovhcom/status/1369478732247932929 - official tweet
- https://twitter.com/arthfl/status/1369604089034735620 (!)
This Task tracks info about the upstream outage and our Disaster Recovery. Spoiler: there will be no automagic recovery today.
The good part is that we should have a lot of cute snapshots safely stored in the datacenter SBG3 (which was not damaged) but at the time of writing we have not any possibility to migrate these snapshots from that datacenter to another one because everything is freezed.
- 2021-03-10 09:00 CET requested restore of the latest snapshot in another datacenter (DE1)
- 2021-03-10 09:36 CET creation of another fallback VM (called intreccio) in another datacenter (thanks to @Nemo_bis)
- 2021-03-10 09:44 CET start pushing last backups in our hands into intreccio and VM preparation
- 2021-03-10 10:28 CET start migrating traffic of https://www.wikimedia.it/ to be served from intreccio (thanks to @M7)
- 2021-03-10 10:56 CET make https://www.wikimedia.it/ up 'n' running again thanks of an off-site backup dated 2021-03-02 made during T276206
- 2021-03-10 11:22 CET setup a generic captch-all failure notice in intreccio
- 2021-03-31 19:46 CET info from our service provider about our server fabula (yeeh!)
- 2021-03-31 22:30 CET deploy latest snapshot in another datacenter (SBG7)
- 2021-04-01 07:40 CET experiencing issues in the latest snapshot, start troubleshooting in emergency mode
- 2021-04-01 09:20 CET full-restore https://cinquepermille.wikimedia.it/ from latest snapshot (T278462)
- 2021-04-02 09:20 CET full-restore Matomo from latest snapshot
- 2021-04-07 11:50 CET backup analysis
At the end we imported from other backups and restored other services like Framapad by hand.
Backup analysis
The provider recovered some backups but the creation date is wrong and someone has to to manually analyze each one to discover the actual content and actual date. This process is not quick because each snapshot does not automatically startup after being restored because if a missing filesystem to be removed from /etc/fstab in recovery mode (because it was lost during the incident and it's not attached anymore). Here is a summary.
Official backup name (wrong date) | Size | When Analyzed | Extimated date (~) |
---|---|---|---|
✅ Settimanale 31 Mar 2021 11:50:45 | 37.76 GiB | 2021-04-03 11:00 CET | 2021-03-06 05:50 CET |
✅ Settimanale 31 Mar 2021 11:50:45 | 37.64 GiB | 2021-04-06 19:10 CET | 2021-01-30 04:46 CET |
✅ Settimanale 31 Mar 2021 11:49:48 | 38.29 GiB | 2021-04-04 11:00 CET | 2021-02-20 05:16 CET |
✅ Settimanale 31 Mar 2021 11:47:47 | 37.47 GiB | 2021-04-07 10:00 CET | 2021-01-23 04:47 CET |
✅ Settimanale 31 Mar 2021 11:47:36 | 38.38 GiB | 2021-04-06 18:29 CET | 2021-02-27 04:47 CET |
✅ Settimanale 31 Mar 2021 11:46:47 | 43.32 GiB | 2021-04-07 10:45 CET | 2021-03-06 04:47 CET |
✅ Settimanale 31 Mar 2021 11:46:25 | 37.97 GiB | 2021-04-06 11:30 CET | 2021-02-13 04:44 CET |