/srv/mediawiki-staging broken on both scap masters
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Joe
	Feb 2 2016, 12:30 PM

Description

Since a few minutes after reimaging tin, the /srv/mediwaiki-staging directory on both the scap masters is utterly broken.

wikiversions.php is not present
All directories like php-1.27.0-wmf.10/ are empty

It seems that for some reason the /srv/mediawiki-staging directory was cloned from tin to mira before the rsync in the other direction was performed.

We have no idea how that happened at the moment, but the fact is we need to fix this.

Related Objects
Search...

Status	Assigned	Task
Resolved	phuedx	T124220 Close out A/B test for impact of section collapsing
Resolved	hashar	T125475 MW-1.27.0-wmf.12 deployment blockers
Resolved	• demon	T125506 /srv/mediawiki-staging broken on both scap masters
Resolved	jcrespo	T125527 Backup all of /srv on mira and/or tin (deployment servers)

Event Timeline

Joe created this task.Feb 2 2016, 12:30 PM

Joe raised the priority of this task from to Unbreak Now!.

Joe updated the task description. (Show Details)

Joe added projects: SRE, Release-Engineering-Team, Deployments.

Joe subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 2 2016, 12:30 PM

Looking at the audit log on mira, it seems that for some unknown reason

/usr/local/bin/scap-master-sync tin

was run and that wiped out the staging areas. This happened *before* I could try to rsync the directories.

I guess what is needed is to re-create them and to apply the patches.

The current situation is:

Host	Dir	Date	Content
tin	/srv/mediawiki-staging/php-1.27.0-wmf.10/	Feb 2 09:38	Gone
tin	/srv/mediawiki/php-1.27.0-wmf.10/	Jan 19 20:20	Present
mira	/srv/mediawiki-staging/php-1.27.0-wmf.10/	Feb 2 09:38	Gone
mira	/srv/mediawiki/php-1.27.0-wmf.10/	Feb 2 09:38	Gone

Ie we have effectively lost all the code/caches etc from the staging areas. The apps server still have a copy though (albeit without any git directories).

I have copied the scap logs from 8:00 to 13:00 UTC on a Spreadsheet viewable by WMF people https://docs.google.com/a/wikimedia.org/spreadsheets/d/1cDpZgVVuz0ODxH0UyAyBPvMkCnAeL9AQtBad9pApHRo/edit?usp=sharing

The SAL entries:

01:30	ebernhardson@mira	scap failed: CalledProcessError Command '/srv/deployment/scap/scap/bin/refreshCdbJsonFiles --directory="/srv/mediawiki-staging/php-1.27.0-wmf.10/cache/l10n" --threads=10 ' returned non-zero exit status 255 (duration: 03m 31s)
01:30	ebernhardson@mira	Finished scap: Add Cookie statement link to footer of all WMF wikis per legal (duration: 19m 42s)
06:58	_joe	reimagine tin.eqiad.wmnet
07:46	<jynus>	https://phabricator.wikimedia.org/rOMWC2ea9167221d11eb1880e4d26eae64a85cb9b2697 and https://phabricator.wikimedia.org/rOMWCa55d2bf8cd3a2853fac35d5b8239b8e8c2fe6a0f merged but not deployed
08:02	<jynus@mira>	Synchronized wmf-config/db-eqiad.php: Pool db1018; Depool db1021 (duration: 00m 20s)
09:12	<jynus@mira>	Synchronized wmf-config/db-eqiad.php: Depool db1036, repool db1021 (duration: 00m 21s)

On both tin and mira /srv/mediawiki-staging/php-1.27.0-wmf.10 stat reports Modify 2016-02-02 09:38:26.

Events around 9:38:

2016-02-02T09:12:10.000Z	mira	INFO	scap.announce	Synchronized wmf-config/db-eqiad.php: Depool db1036, repool db1021 (duration: 00m 21s)
2016-02-02T09:41:52.000Z	mira	INFO	timer	Started sync-masters
2016-02-02T09:41:53.000Z	tin	INFO	sync_master	Copying to tin.eqiad.wmnet from mira.codfw.wmnet
2016-02-02T09:41:53.000Z	tin	DEBUG	sync_master	Running rsync command: `sudo -n -- /usr/local/bin/scap-master-sync mira.codfw.wmnet`
2016-02-02T09:41:53.000Z	tin	INFO	sync_master.timer	Started rsync master
2016-02-02T09:41:54.000Z	tin	INFO	sync_master.timer	Finished rsync master (duration: 00m 00s)
2016-02-02T09:41:54.000Z	tin	INFO	sync_master.timer	Started rebuild CDB staging files
2016-02-02T09:41:54.000Z	tin	WARNING	merge_cdb_updates	Directory /srv/mediawiki-staging/php-1.27.0-wmf.10/cache/l10n/upstream is empty
2016-02-02T09:41:54.000Z	tin	INFO	sync_master.timer	Finished rebuild CDB staging files (duration: 00m 00s)

The staging dir is gone on tin, most probably because the one on mira was already gone and got deleted by the sync-masters.

I created a tarball with the last known valid version of the deployed code (ironically, on tin) and on one appserver so that data that was not persisted anywhere else, like ./private, can be recovered.

To ensure no one can deploy by accident, I stopped the rsync daemons on both tin and mira, and disabled puppet too.

Earliest files on mira:/srv/mediawiki-staging are from 8:39am

Lets look at the history on either tin or mira (since they got synced):

hashar@tin:/srv/mediawiki-staging$ git reflog --date=iso
914f42f HEAD@{2016-02-02 10:44:00 +0000}: rebase finished: returning to refs/heads/master
914f42f HEAD@{2016-02-02 10:44:00 +0000}: rebase: checkout refs/remotes/origin/master
4812897 HEAD@{2016-02-02 09:50:32 +0000}: rebase finished: returning to refs/heads/master
4812897 HEAD@{2016-02-02 09:50:32 +0000}: rebase: checkout refs/remotes/origin/master
40442d1 HEAD@{2016-02-02 09:40:30 +0000}: rebase finished: returning to refs/heads/master
40442d1 HEAD@{2016-02-02 09:40:30 +0000}: rebase: checkout refs/remotes/origin/master
5a4fbe1 HEAD@{2016-02-02 08:39:14 +0000}: clone: from https://gerrit.wikimedia.org/r/p/operations/mediawiki-config.git

hashar@tin:/srv/mediawiki-staging$ git reflog show --date=iso 'HEAD@{2016-02-02 08:39:14 +0000}' --format=fuller
commit 5a4fbe1
Reflog: HEAD@{2016-02-02 08:39:14 +0000} (mwdeploy <mwdeploy@tin.eqiad.wmnet>)
Reflog message: clone: from https://gerrit.wikimedia.org/r/p/operations/mediawiki-config.git

Which would be puppet creating the staging area on tin. Later, the empty staging would end up being copied to mira effectively deleting all the staging areas :-/

yes @hashar that's exactly what happened.

To better clarify the timeline:

I reimaged tin around 8 UTC;
puppet created a git clone of operations/mediawiki-config in /srv/mediawiki-staging at 08:39
I fixed a but in the puppet manifests so that hhvm and the other support packages were installed, which took me some time
I re-added tin to the list of the scap masters, so that the next sync-file from mira would've rsynced the code to tin correctly around 9:24
before I could run a void sync-file, /usr/local/bin/scap-master-sync tin was run on mira. This effectively synced the blank checkout on tin to mira, wiping out everything that was staged precedingly. This happened at 09:37

I noticed the breakage when trying to run sync-common on a codfw appserver and found it broken afterwards.

The migration procedure was correct, and the unfortunate incident happened because of a human error. The process is of syncing the scap masters is lacking any failsafe at the moment.

CC ing the whole Release-Engineering-Team .

TL;DR: we have lost staging area (code / caches / settings) from BOTH deployment servers and have to rebuild them.

tin.eqiad.wmnet has been reimaged as planned to be ported to Jessie. Puppet provisions a basic staging area which obviously lack private settings, caches ...

09:37 the command scap-master-sync tin was run on mira. Since tin was being provisioned, mira synced itself with an essential empty staging area. That effectively removed all copies we had on the cluster.

@Joe did the diagnostic while I have dig in logs (see previous comments).

None of the deployment servers can be used. To prevent the whole site to be dead for hours, rsync and puppet are disabled on both hosts.

We have a backup of a mediawiki dir on tin /srv/mediawiki-dir-tin-20160202.tar.gz. It lacks anything that is ignored by rsync, i.e. the .git dirs but has cache / private settings.

We will want to cancel today deployment train and SWAT deploys.

phuedx mentioned this in T124959: Remove `navboxes` from HTML in mobile web beta and show the impact.Feb 2 2016, 2:07 PM

phuedx mentioned this in T124220: Close out A/B test for impact of section collapsing.

hoo subscribed.Feb 2 2016, 2:20 PM

Krenair subscribed.Feb 2 2016, 2:51 PM

Nikerabbit subscribed.Feb 2 2016, 3:17 PM

jcrespo subscribed.Feb 2 2016, 3:34 PM

bd808 subscribed.Feb 2 2016, 3:57 PM

phuedx added a parent task: T124220: Close out A/B test for impact of section collapsing.Feb 2 2016, 4:14 PM

phuedx added a parent task: T124959: Remove `navboxes` from HTML in mobile web beta and show the impact.

phuedx subscribed.

akosiaris closed subtask T125527: Backup all of /srv on mira and/or tin (deployment servers) as Resolved.Feb 2 2016, 4:19 PM

Jdforrester-WMF added a parent task: T125475: MW-1.27.0-wmf.12 deployment blockers.Feb 2 2016, 4:39 PM

We have restored 1.27.0-wmf.8 1.27.0-wmf.9 and 1.27.0-wmf.10 . Regenerated the l10n cache.

mw1017 has been synced.

We are now trying to figure out private files that might be needed.

Servers are being synced in small batches. We are proceeding with canary servers first and so far it is going fine.

hashar mentioned this in T125477: refreshCdbJsonFiles in scap fails on mira due to missing dba_open function in hhvm.Feb 2 2016, 10:24 PM

So, all apaches are back in sync with master now. Couple of actionables here off the top of my head:

Backups (done in child T125527)
We need to audit (and limit) the untracked stuff in staging. This took quite a bit of time and careful action to avoid wiping.
Submodules for mw-config need to be initialized at provision time (gerrit 267929). This lead to a lot of confusion.
Scap should probably depool proxies and repool them when scap operations complete. This was part of our restore behavior but makes general sense to avoid taxing those nodes.

Should all go into the official wikitech post-mortem.

ArielGlenn subscribed.Feb 2 2016, 11:15 PM

matmarex subscribed.Feb 3 2016, 3:05 AM

• MZMcBride subscribed.Feb 3 2016, 4:36 AM

Florian subscribed.Feb 3 2016, 7:07 AM

Other than the incident report, we are now "back to normal" and deploying SWAT deploys and plan to do the train (1.27-wmf.12) today. Assigning to Chad for the incident report, but he'll obviously have others (@Joe, etc) help him.

Keeping open until the report is posted, but for the purposes of pushing out wmf.12, this is done.

phuedx removed a parent task: T124959: Remove `navboxes` from HTML in mobile web beta and show the impact.Feb 3 2016, 8:45 PM

JanZerebecki subscribed.Feb 4 2016, 11:14 AM

In T125506#1994318, @greg wrote:

Keeping open until the report is posted, but for the purposes of pushing out wmf.12, this is done.

https://wikitech.wikimedia.org/wiki/Incident_documentation/20160202-deployment-server-loss

/srv/mediawiki-staging broken on both scap mastersClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

/srv/mediawiki-staging broken on both scap masters
Closed, ResolvedPublic
Actions

Related Objects
Search...