WARNING: For general safety, please keep children away from this Task. The //blasphemy-o-meter// installed in the IT department is going off the expected scale.
Good morning everyone!
What a lovely day to catch fire! ๐
The server `fabula.wikimedia.it` is offline due to a major incident in the datacenter `SBG` in Strasbourg owned by our service provider. For instance the datacenter `SBG2` has caught fire and was destroyed. Literally.
* http://travaux.ovh.net/?do=details&id=49484 - ticket about `SBG`
* http://travaux.ovh.net/?do=details&id=49471 - ticket about `SBG1`
* https://twitter.com/olesovhcom/status/1369478732247932929 - official tweet
* https://twitter.com/arthfl/status/1369604089034735620 (!)
This Task tracks info about the upstream outage and our Disaster Recovery. Spoiler: there will be no automagic recovery today.
The good part is that we should have a lot of cute snapshots safely stored in the datacenter `SBG3` (which was not damaged) but at the time of writing we have not any possibility to migrate these snapshots from that datacenter to another one because everything is freezed.
[X] `2021-03-10 09:00 CET` requested restore of the latest snapshot in another datacenter (`DE1`)
[X] `2021-03-10 09:36 CET` creation of another fallback VM (called `intreccio`) in another datacenter (thanks to @Nemo_bis)
[X] `2021-03-10 09:44 CET` start pushing last backups in our hands into `intreccio` and VM preparation
[X] `2021-03-10 10:28 CET` start migrating traffic of https://www.wikimedia.it/ to be served from `intreccio` (thanks to @M7)
[X] `2021-03-10 10:56 CET` make https://www.wikimedia.it/ up 'n' running again thanks of an off-site backup dated `2021-03-02` made during T276206
[X] `2021-03-10 11:22 CET` setup a generic captch-all failure notice in `intreccio`
[X] `2021-03-31 19:46 CET` info from our service provider about our server `fabula` (yeeh!)
[X] `2021-03-31 22:30 CET` deploy latest snapshot in another datacenter (`SBG7`)
[X] `2021-04-01 07:40 CET` experiencing issues in the latest snapshot, start troubleshooting in emergency mode
[X] `2021-04-01 09:20 CET` full-restore https://cinquepermille.wikimedia.it/ from latest snapshot (T278462)
[X] `2021-04-02 09:20 CET` full-restore Matomo from latest snapshot
[ ] `2021-04-08 09:20 CET` backup analysis
[ ] `2021-04-09 09:20 CET` full-restore other services from latest snapshot
[ ] `CANCELED ` ~~restore incoming traffic to `fabula` (DNS rollback)~~
* โ this will not be done and we will just go with `intreccio.wikimedia.it` as production server
== Backup analysis ==
The provider recovered some backups but the creation date is wrong and someone has to to manually analyze each one to discover the actual content and actual date. This process is not quick because each snapshot does not automatically startup after being restored because if a missing filesystem to be removed from `/etc/fstab` in recovery mode (because it was lost during the incident and it's not attached anymore). Here is a summary.
NOTE: The extimated date is provided looking at `ls -lrt /var/log`. Nothing special.
| Official backup name (wrong date) | Size | When Analyzed | Extimated date (~) |
|------------------------------------|-----------|---------------------------|------------------|
| โ
Settimanale 31 Mar 2021 11:50:45 | 37.76 GiB | `2021-04-03 11:00 CET` | `6 Feb 2021 05:50` |
| Settimanale 31 Mar 2021 11:50:45 | 37.64 GiB | | |
| โ
Settimanale 31 Mar 2021 11:49:48 | 38.29 GiB | `2021-04-04 11:00 CET` | `20 Feb 2021 05:16` |
| Settimanale 31 Mar 2021 11:47:47 | 37.47 GiB | `2021-04-06 09:00 CET` | |
| Settimanale 31 Mar 2021 11:47:36 | 38.38 GiB | | |
| Settimanale 31 Mar 2021 11:46:47 | 43.32 GiB | |
| โ
Settimanale 31 Mar 2021 11:46:25 | 37.97 GiB | `2021-04-05 11:30 CET` | |