Page MenuHomePhabricator

Server fabula outage, www.wikimedia.it offline (provider incident)
Open, Unbreak Now!Public

Assigned To
Authored By
valerio.bozzolan
Mar 10 2021, 7:23 AM
Tokens
"Burninate" token, awarded by Daimona."Burninate" token, awarded by GioRan."Love" token, awarded by Sannita."Burninate" token, awarded by YacineBoussoufa."Love" token, awarded by Joe."Burninate" token, awarded by lucamauri."Burninate" token, awarded by Ferdi2005."Burninate" token, awarded by valerio.bozzolan.

Description

WARNING: For general safety, please keep children away from this Task. The blasphemy-o-meter installed in the IT department is going off the expected scale.

Good morning everyone!

What a lovely day to catch fire! 🌈

The server fabula.wikimedia.it is offline due to a major incident in the datacenter SBG in Strasbourg owned by our service provider. For instance the datacenter SBG2 has caught fire and was destroyed. Literally.

This Task tracks info about the upstream outage and our Disaster Recovery. Spoiler: there will be no automagic recovery today.

The good part is that we should have a lot of cute snapshots safely stored in the datacenter SBG3 (which was not damaged) but at the time of writing we have not any possibility to migrate these snapshots from that datacenter to another one because everything is freezed.

  • 2021-03-10 09:00 CET requested restore of the latest snapshot in another datacenter (DE1)
  • 2021-03-10 09:36 CET creation of another fallback VM (called intreccio) in another datacenter (thanks to @Nemo_bis)
  • 2021-03-10 09:44 CET start pushing last backups in our hands into intreccio and VM preparation
  • 2021-03-10 10:28 CET start migrating traffic of https://www.wikimedia.it/ to be served from intreccio (thanks to @M7)
  • 2021-03-10 10:56 CET make https://www.wikimedia.it/ up 'n' running again thanks of an off-site backup dated 2021-03-02 made during T276206
  • 2021-03-10 11:22 CET setup a generic captch-all failure notice in intreccio
  • 2021-03-31 19:46 CET info from our service provider about our server fabula (yeeh!)
  • 2021-03-31 22:30 CET deploy latest snapshot in another datacenter (SBG7)
  • 2021-04-01 07:40 CET experiencing issues in the latest snapshot, start troubleshooting in emergency mode
  • 2021-04-01 09:20 CET full-restore https://cinquepermille.wikimedia.it/ from latest snapshot (T278462)
  • 2021-04-02 09:20 CET full-restore Matomo from latest snapshot
  • 2021-04-07 11:50 CET backup analysis
  • 2021-04-09 09:20 CET full-restore other services from latest snapshot
  • CANCELED restore incoming traffic to fabula (DNS rollback)
    • ↑ this will not be done and we will just go with intreccio.wikimedia.it as production server

Backup analysis

The provider recovered some backups but the creation date is wrong and someone has to to manually analyze each one to discover the actual content and actual date. This process is not quick because each snapshot does not automatically startup after being restored because if a missing filesystem to be removed from /etc/fstab in recovery mode (because it was lost during the incident and it's not attached anymore). Here is a summary.

NOTE: The extimated date is provided looking at ls -lrt /var/log. Nothing special.
Official backup name (wrong date)SizeWhen AnalyzedExtimated date (~)
✅ Settimanale 31 Mar 2021 11:50:4537.76 GiB2021-04-03 11:00 CET2021-03-06 05:50 CET
✅ Settimanale 31 Mar 2021 11:50:4537.64 GiB2021-04-06 19:10 CET2021-01-30 04:46 CET
✅ Settimanale 31 Mar 2021 11:49:4838.29 GiB2021-04-04 11:00 CET2021-02-20 05:16 CET
✅ Settimanale 31 Mar 2021 11:47:4737.47 GiB2021-04-07 10:00 CET2021-01-23 04:47 CET
✅ Settimanale 31 Mar 2021 11:47:3638.38 GiB2021-04-06 18:29 CET2021-02-27 04:47 CET
✅ Settimanale 31 Mar 2021 11:46:4743.32 GiB2021-04-07 10:45 CET2021-03-06 04:47 CET
✅ Settimanale 31 Mar 2021 11:46:2537.97 GiB2021-04-06 11:30 CET2021-02-13 04:44 CET

Event Timeline

Note that the "Public Cloud" panel of our service provider is crashing since a couple of hours causing inability of create any new service.

(This image is not under CC-BY-SA or GPL)

503 Service Unavailable

I will retry ASAP.

Hi @valerio.bozzolan, from this tweet

https://twitter.com/olesovhcom/status/1369535787570724864

it seems that all of SBG will be down at least for the day.

Thank you @Joe,

Thank you to @Nemo_bis who was able to allocate a new virtual machine, I've migrated the website from one datacenter to another one, now online:

https://www.wikimedia.it/
https://wikimedia.it/
https://wikimedia.it/
http://wikimedia.it/

Now I'm trying to restore other services from our tracked configuration:

rWIIN wikimedia-it-wmit-infrastructure

It will be a long day. asd

valerio.bozzolan triaged this task as Unbreak Now! priority.
valerio.bozzolan updated the task description. (Show Details)

Upstream update some seconds ago.

Status of Strasbourg Datacenter
SBG1 : Network Room is OK - 4 room destroyed - 8 Rooms OK
SBG2 : Destroyed
SBG3 : UPS Down - Check server still in progress
SBG4 : No physical impact

No restart today for SBG1,SBG3 and SBG4

Plan for the next 2 weeks:
1)Rebuilding 20KV for SBG3
2)Rebuilding 240V in SBG1/SBG4
3)Verifying DWDM/routers/switchs in the network room A (SBG1).
Checking the fibers Paris/Frankfurt
4)Rebuilding the network room B (SBG5) cheking fiber Paris/Frankfurt
http://travaux.ovh.net/?do=details&id=49484

The good part is that we should have a lot of cute snapshots safely stored in the datacenter SBG3

That's my understanding too per https://www.ovh.com/manager/public-cloud/#/pci/projects/ <redacted> /billing/history/2021/2 . So we should be able to recover "everything" by 2021-03-19 at the latest, per http://travaux.ovh.net/?do=details&id=49484 :

The restoration of the site's power supply of SBG1 and SBG4 is estimated for Monday, March 15. A recovery for SBG3 is estimated for Friday, March 19

Just my opinion but I think a few of the services and microsites on the machine can wait 9 days. In the meanwhile we may want to consider the main Wordpress instance as read-only, to make it easier to later pick either side as new master.

In the meanwhile we may want to consider the main Wordpress instance as read-only, to make it easier to later pick either side as new master.

I thought so too but in the meanwhile it seems the staff has some important work in the queue (e.g. for the bando wiki-docente 2021). We can also assist them to:

  1. allow to update www.wikimedia.it
  2. wait for upstream news about our snapshot in SBG3
  3. re-import that complete snapshot once ready, but overriding the specific data of www.wikimedia.it

If I understand https://www.ovh.com/fr/images/sbg/index-en.html correctly, we'll know more about the location(s) of the snapshots in the coming 24-48 hours.

If I understand https://www.ovh.com/fr/images/sbg/index-en.html correctly, we'll know more about the location(s) of the snapshots in the coming 24-48 hours.

I have understood the same thing, but what is the meaning of room? The number of SBGx (so if SBG2 if destroyed and our backup is not in SBG1, Room 4) our data are not lose.
But, if the backups of SBG2 are in another Rooms of SBG2...

I'm sharing here yesterday's update:

From http://travaux.ovh.net/?do=details&id=49484 (Tuesday, 16 March 2021, 18:09PM)

SBG-3 Situation:

  • Servers undamaged

Electrical restart:

  • Temporarily repowered on 12th March and will be restored permanently on 16th March

Network restart:

  • New network room to be deployed and powered on 16th March
  • Internal network to be redeployed on 17th March
  • Server restart : Provisional ETA: Monday, 22 March for gradual restart

Update

SBG1 datacenter is currently closed and secured.
SBG2 datacenter is closed and secured.
SBG4 datacenter is closed and secured as a precautionary measure.

The recovery scenario will be defined during the weekend for SBG1 and SBG4 data centers, always focusing on security.

SBG3 is partially restored:
Teams continue to work on the SBG3 data center, as of today, respecting security measures.

Update

SBG3 Public Cloud Instance 86%*
The restoration of services is carried out according to a restart schedule room by room, aisle by aisle and rack by rack. However, server cleanup is required, and this will determine when certain racks are put back into to service. Find out more about our cleaning process via this link (here).

#iosperiamochemelacavo
(only for Italian people, sorry)

Update

Public Cloud Instance - 90%*
Services not yet returned to customers will be available again on Monday 29/03 and Tuesday 30/03.