⚓ T277008 Server fabula outage, www.wikimedia.it offline (provider incident)

Official backup name (wrong date)	Size	When Analyzed	Extimated date (~)
✅ Settimanale 31 Mar 2021 11:50:45	37.76 GiB	`2021-04-03 11:00 CET`	`2021-03-06 05:50 CET`
✅ Settimanale 31 Mar 2021 11:50:45	37.64 GiB	`2021-04-06 19:10 CET`	`2021-01-30 04:46 CET`
✅ Settimanale 31 Mar 2021 11:49:48	38.29 GiB	`2021-04-04 11:00 CET`	`2021-02-20 05:16 CET`
✅ Settimanale 31 Mar 2021 11:47:47	37.47 GiB	`2021-04-07 10:00 CET`	`2021-01-23 04:47 CET`
✅ Settimanale 31 Mar 2021 11:47:36	38.38 GiB	`2021-04-06 18:29 CET`	`2021-02-27 04:47 CET`
✅ Settimanale 31 Mar 2021 11:46:47	43.32 GiB	`2021-04-07 10:45 CET`	`2021-03-06 04:47 CET`
✅ Settimanale 31 Mar 2021 11:46:25	37.97 GiB	`2021-04-06 11:30 CET`	`2021-02-13 04:44 CET`

Status	Assigned	Task
Resolved	valerio.bozzolan	T277008 Server fabula outage, www.wikimedia.it offline (provider incident)
Resolved	valerio.bozzolan	T277144 Disaster Recovery of WMI-LimeSurvey
Resolved	valerio.bozzolan	T278462 Recover https://cinquepermille.wikimedia.it after last outage (before 1 April 2021)
Declined	valerio.bozzolan	T289169 Website at libertadigitali.wikimedia.it offline since last outage

valerio.bozzolan created this task.Mar 10 2021, 7:23 AM

Joe subscribed.Mar 10 2021, 7:30 AM

valerio.bozzolan updated the task description. (Show Details)Mar 10 2021, 7:34 AM

valerio.bozzolan updated the task description. (Show Details)Mar 10 2021, 8:14 AM

Note that the "Public Cloud" panel of our service provider is crashing since a couple of hours causing inability of create any new service.

(This image is not under CC-BY-SA or GPL)

503 Service Unavailable

I will retry ASAP.

Hi @valerio.bozzolan, from this tweet

https://twitter.com/olesovhcom/status/1369535787570724864

it seems that all of SBG will be down at least for the day.

valerio.bozzolan updated the task description. (Show Details)Mar 10 2021, 9:17 AM

valerio.bozzolan updated the task description. (Show Details)Mar 10 2021, 9:48 AM

valerio.bozzolan moved this task from Backlog to In progress on the WMIT-Infrastructure board.Mar 10 2021, 9:56 AM

Thank you @Joe,

Thank you to @Nemo_bis who was able to allocate a new virtual machine, I've migrated the website from one datacenter to another one, now online:

https://www.wikimedia.it/
https://wikimedia.it/
https://wikimedia.it/
http://wikimedia.it/

Now I'm trying to restore other services from our tracked configuration:

rWIIN wikimedia-it-wmit-infrastructure

It will be a long day. asd

valerio.bozzolan added a subscriber: Nemo_bis.Mar 10 2021, 10:02 AM

valerio.bozzolan claimed this task.Mar 10 2021, 10:23 AM

valerio.bozzolan triaged this task as Unbreak Now! priority.

valerio.bozzolan updated the task description. (Show Details)

valerio.bozzolan updated the task description. (Show Details)Mar 10 2021, 11:16 AM

valerio.bozzolan awarded a token.Mar 10 2021, 11:29 AM

Ferdi2005 awarded a token.Mar 10 2021, 11:34 AM

lucamauri awarded a token.Mar 10 2021, 11:36 AM

Joe awarded a token.Mar 10 2021, 11:56 AM

Upstream update some seconds ago.

Status of Strasbourg Datacenter
SBG1 : Network Room is OK - 4 room destroyed - 8 Rooms OK
SBG2 : Destroyed
SBG3 : UPS Down - Check server still in progress
SBG4 : No physical impact

No restart today for SBG1,SBG3 and SBG4

Plan for the next 2 weeks:
1)Rebuilding 20KV for SBG3
2)Rebuilding 240V in SBG1/SBG4
3)Verifying DWDM/routers/switchs in the network room A (SBG1).
Checking the fibers Paris/Frankfurt
4)Rebuilding the network room B (SBG5) cheking fiber Paris/Frankfurt
― http://travaux.ovh.net/?do=details&id=49484

YacineBoussoufa awarded a token.Mar 10 2021, 1:07 PM

Base subscribed.Mar 10 2021, 2:16 PM

Nintendofan885 subscribed.Mar 10 2021, 2:16 PM

Addshore subscribed.Mar 10 2021, 3:17 PM

Sannita awarded a token.Mar 10 2021, 4:44 PM

valerio.bozzolan added a commit: rWIIN8972d8af7563: publish configuration for the new server 'intreccio' after the last outage.Mar 10 2021, 4:49 PM

valerio.bozzolan updated the task description. (Show Details)Mar 10 2021, 5:19 PM

valerio.bozzolan added a subscriber: M7.

The good part is that we should have a lot of cute snapshots safely stored in the datacenter SBG3

That's my understanding too per https://www.ovh.com/manager/public-cloud/#/pci/projects/ <redacted> /billing/history/2021/2 . So we should be able to recover "everything" by 2021-03-19 at the latest, per http://travaux.ovh.net/?do=details&id=49484 :

The restoration of the site's power supply of SBG1 and SBG4 is estimated for Monday, March 15. A recovery for SBG3 is estimated for Friday, March 19

Just my opinion but I think a few of the services and microsites on the machine can wait 9 days. In the meanwhile we may want to consider the main Wordpress instance as read-only, to make it easier to later pick either side as new master.

In T277008#6902236, @Nemo_bis wrote:

In the meanwhile we may want to consider the main Wordpress instance as read-only, to make it easier to later pick either side as new master.

I thought so too but in the meanwhile it seems the staff has some important work in the queue (e.g. for the bando wiki-docente 2021). We can also assist them to:

allow to update www.wikimedia.it
wait for upstream news about our snapshot in SBG3
re-import that complete snapshot once ready, but overriding the specific data of www.wikimedia.it

valerio.bozzolan mentioned this in T277144: Disaster Recovery of WMI-LimeSurvey.Mar 11 2021, 9:23 AM

GioRan awarded a token.Mar 11 2021, 9:27 AM

valerio.bozzolan added a commit: rWIIN40fe28a059db: recover some legacy VirtualHosts in server intreccio after last outage.Mar 11 2021, 12:41 PM

valerio.bozzolan mentioned this in T276229: Fix fabula.wikimedia.it:/mnt/archivio partition - no space available.Mar 11 2021, 5:16 PM

If I understand https://www.ovh.com/fr/images/sbg/index-en.html correctly, we'll know more about the location(s) of the snapshots in the coming 24-48 hours.

valerio.bozzolan closed subtask T277144: Disaster Recovery of WMI-LimeSurvey as Resolved.Mar 12 2021, 7:38 AM

taavi subscribed.Mar 12 2021, 7:46 AM

valerio.bozzolan mentioned this in T277269: Create LimeSurvey for "Bando WMI 2021 -Diventa wiki-docente-".Mar 12 2021, 7:46 AM

In T277008#6905929, @Nemo_bis wrote:

If I understand https://www.ovh.com/fr/images/sbg/index-en.html correctly, we'll know more about the location(s) of the snapshots in the coming 24-48 hours.

I have understood the same thing, but what is the meaning of room? The number of SBGx (so if SBG2 if destroyed and our backup is not in SBG1, Room 4) our data are not lose.
But, if the backups of SBG2 are in another Rooms of SBG2...

I'm sharing here yesterday's update:

From http://travaux.ovh.net/?do=details&id=49484 (Tuesday, 16 March 2021, 18:09PM)

SBG-3 Situation:

Servers undamaged

Electrical restart:

Temporarily repowered on 12th March and will be restored permanently on 16th March

Network restart:

New network room to be deployed and powered on 16th March

Internal network to be redeployed on 17th March

Server restart : Provisional ETA: Monday, 22 March for gradual restart

Update

http://travaux.ovh.net/?do=details&id=49484

SBG1 datacenter is currently closed and secured.
SBG2 datacenter is closed and secured.
SBG4 datacenter is closed and secured as a precautionary measure.

The recovery scenario will be defined during the weekend for SBG1 and SBG4 data centers, always focusing on security.

SBG3 is partially restored:
Teams continue to work on the SBG3 data center, as of today, respecting security measures.

valerio.bozzolan mentioned this in T278462: Recover https://cinquepermille.wikimedia.it after last outage (before 1 April 2021).Mar 25 2021, 5:44 PM

Update

http://travaux.ovh.net/?do=details&id=49484

SBG3 Public Cloud Instance 86%*
The restoration of services is carried out according to a restart schedule room by room, aisle by aisle and rack by rack. However, server cleanup is required, and this will determine when certain racks are put back into to service. Find out more about our cleaning process via this link (here).

#iosperiamochemelacavo
(only for Italian people, sorry)

Update

http://travaux.ovh.net/?do=details&id=49484

Public Cloud Instance - 90%*
Services not yet returned to customers will be available again on Monday 29/03 and Tuesday 30/03.

valerio.bozzolan updated the task description. (Show Details)Apr 1 2021, 5:47 AM

valerio.bozzolan closed subtask T278462: Recover https://cinquepermille.wikimedia.it after last outage (before 1 April 2021) as Resolved.Apr 1 2021, 7:13 AM

RhinosF1 subscribed.Apr 1 2021, 7:27 AM

valerio.bozzolan updated the task description. (Show Details)Apr 1 2021, 8:12 AM

valerio.bozzolan updated the task description. (Show Details)Apr 2 2021, 7:23 AM

valerio.bozzolan updated the task description. (Show Details)Apr 5 2021, 9:45 AM

valerio.bozzolan updated the task description. (Show Details)Apr 6 2021, 1:27 PM

valerio.bozzolan updated the task description. (Show Details)Apr 6 2021, 4:24 PM

valerio.bozzolan updated the task description. (Show Details)Apr 6 2021, 5:11 PM

valerio.bozzolan updated the task description. (Show Details)Apr 7 2021, 8:00 AM

valerio.bozzolan updated the task description. (Show Details)Apr 7 2021, 8:50 AM

Daimona awarded a token.Apr 14 2021, 4:03 PM

valerio.bozzolan added a commit: rWIINdb3e9fc45710: recover Matomo back from hell.Apr 24 2021, 10:20 AM

tstarling lowered the priority of this task from Unbreak Now! to High.May 20 2021, 4:41 AM

What's left to do here really?

valerio.bozzolan closed this task as Resolved.Jul 19 2021, 9:38 AM

valerio.bozzolan updated the task description. (Show Details)

In T277008#7153070, @Nemo_bis wrote:

What's left to do here really?

Unfortunately I don't have a complete overview anyway I'm quite sure we were missing just WMI Framadate. Everything else seems fine to me and since lot of weeks we have not received any more complaints.

About WMI Framadate, here the related updated documentation:

https://wiki.wikimedia.it/wiki/Framadate/Technical_documentation

valerio.bozzolan moved this task from In progress to 🏛️ Organiz/Infra on the WMIT-Infrastructure board.Jul 19 2021, 9:56 AM

valerio.bozzolan added a commit: rWIINef48c75818c4: restore Framadate after last outage.Jul 19 2021, 10:25 AM

valerio.bozzolan added a commit: rWIIN610136ecf170: shutdown fire alarm.Nov 25 2021, 4:54 PM

valerio.bozzolan moved this task from 🏛️ Organiz/Infra to 🌐 Web on the WMIT-Infrastructure board.Jul 26 2022, 1:11 PM

valerio.bozzolan closed subtask T289169: Website at libertadigitali.wikimedia.it offline since last outage as Declined.Mar 20 2023, 9:10 AM

valerio.bozzolan added a commit: rWIIN213a338d1fac: publish very legacy configurations from server intreccio for the glory of….May 9 2023, 9:12 PM

rWIIN wikimedia-it-wmit-infrastructure
	rWIIN213a338d1fac publish very legacy configurations from server intreccio for the glory of…
	rWIIN610136ecf170 shutdown fire alarm
	rWIINef48c75818c4 restore Framadate after last outage
	rWIINdb3e9fc45710 recover Matomo back from hell
	rWIIN40fe28a059db recover some legacy VirtualHosts in server intreccio after last outage
	rWIIN8972d8af7563 publish configuration for the new server 'intreccio' after the last outage

Server fabula outage, www.wikimedia.it offline (provider incident)
Closed, ResolvedPublic
Actions

Description

Backup analysis

Revisions and Commits

Related Objects
Search...

Event Timeline

	F34149296: image.png
	Mar 10 2021, 8:27 AM

Server fabula outage, www.wikimedia.it offline (provider incident)Closed, ResolvedPublicActions

Description

Backup analysis

Revisions and Commits

Related ObjectsSearch...

Event Timeline

Server fabula outage, www.wikimedia.it offline (provider incident)
Closed, ResolvedPublic
Actions

Related Objects
Search...