Page MenuHomePhabricator

/srv/mediawiki-staging broken on both scap masters
Closed, ResolvedPublic

Description

Since a few minutes after reimaging tin, the /srv/mediwaiki-staging directory on both the scap masters is utterly broken.

  1. wikiversions.php is not present
  2. All directories like php-1.27.0-wmf.10/ are empty

It seems that for some reason the /srv/mediawiki-staging directory was cloned from tin to mira before the rsync in the other direction was performed.

We have no idea how that happened at the moment, but the fact is we need to fix this.

Event Timeline

Joe raised the priority of this task from to Unbreak Now!.
Joe updated the task description. (Show Details)
Joe subscribed.

Looking at the audit log on mira, it seems that for some unknown reason

/usr/local/bin/scap-master-sync tin

was run and that wiped out the staging areas. This happened *before* I could try to rsync the directories.

I guess what is needed is to re-create them and to apply the patches.

The current situation is:

HostDirDateContent
tin/srv/mediawiki-staging/php-1.27.0-wmf.10/Feb 2 09:38Gone
tin/srv/mediawiki/php-1.27.0-wmf.10/Jan 19 20:20Present
mira/srv/mediawiki-staging/php-1.27.0-wmf.10/Feb 2 09:38Gone
mira/srv/mediawiki/php-1.27.0-wmf.10/Feb 2 09:38Gone

Ie we have effectively lost all the code/caches etc from the staging areas. The apps server still have a copy though (albeit without any git directories).


I have copied the scap logs from 8:00 to 13:00 UTC on a Spreadsheet viewable by WMF people https://docs.google.com/a/wikimedia.org/spreadsheets/d/1cDpZgVVuz0ODxH0UyAyBPvMkCnAeL9AQtBad9pApHRo/edit?usp=sharing

The SAL entries:

01:30ebernhardson@mirascap failed: CalledProcessError Command '/srv/deployment/scap/scap/bin/refreshCdbJsonFiles --directory="/srv/mediawiki-staging/php-1.27.0-wmf.10/cache/l10n" --threads=10 ' returned non-zero exit status 255 (duration: 03m 31s)
01:30ebernhardson@miraFinished scap: Add Cookie statement link to footer of all WMF wikis per legal (duration: 19m 42s)
06:58_joereimagine tin.eqiad.wmnet
07:46<jynus>https://phabricator.wikimedia.org/rOMWC2ea9167221d11eb1880e4d26eae64a85cb9b2697 and https://phabricator.wikimedia.org/rOMWCa55d2bf8cd3a2853fac35d5b8239b8e8c2fe6a0f merged but not deployed
08:02<jynus@mira>Synchronized wmf-config/db-eqiad.php: Pool db1018; Depool db1021 (duration: 00m 20s)
09:12<jynus@mira>Synchronized wmf-config/db-eqiad.php: Depool db1036, repool db1021 (duration: 00m 21s)

On both tin and mira /srv/mediawiki-staging/php-1.27.0-wmf.10 stat reports Modify 2016-02-02 09:38:26.

Events around 9:38:

2016-02-02T09:12:10.000ZmiraINFOscap.announceSynchronized wmf-config/db-eqiad.php: Depool db1036, repool db1021 (duration: 00m 21s)
2016-02-02T09:41:52.000ZmiraINFOtimerStarted sync-masters
2016-02-02T09:41:53.000ZtinINFOsync_masterCopying to tin.eqiad.wmnet from mira.codfw.wmnet
2016-02-02T09:41:53.000ZtinDEBUGsync_masterRunning rsync command: sudo -n -- /usr/local/bin/scap-master-sync mira.codfw.wmnet
2016-02-02T09:41:53.000ZtinINFOsync_master.timerStarted rsync master
2016-02-02T09:41:54.000ZtinINFOsync_master.timerFinished rsync master (duration: 00m 00s)
2016-02-02T09:41:54.000ZtinINFOsync_master.timerStarted rebuild CDB staging files
2016-02-02T09:41:54.000ZtinWARNINGmerge_cdb_updatesDirectory /srv/mediawiki-staging/php-1.27.0-wmf.10/cache/l10n/upstream is empty
2016-02-02T09:41:54.000ZtinINFOsync_master.timerFinished rebuild CDB staging files (duration: 00m 00s)

The staging dir is gone on tin, most probably because the one on mira was already gone and got deleted by the sync-masters.

I created a tarball with the last known valid version of the deployed code (ironically, on tin) and on one appserver so that data that was not persisted anywhere else, like ./private, can be recovered.

To ensure no one can deploy by accident, I stopped the rsync daemons on both tin and mira, and disabled puppet too.

Earliest files on mira:/srv/mediawiki-staging are from 8:39am

Lets look at the history on either tin or mira (since they got synced):

hashar@tin:/srv/mediawiki-staging$ git reflog --date=iso
914f42f HEAD@{2016-02-02 10:44:00 +0000}: rebase finished: returning to refs/heads/master
914f42f HEAD@{2016-02-02 10:44:00 +0000}: rebase: checkout refs/remotes/origin/master
4812897 HEAD@{2016-02-02 09:50:32 +0000}: rebase finished: returning to refs/heads/master
4812897 HEAD@{2016-02-02 09:50:32 +0000}: rebase: checkout refs/remotes/origin/master
40442d1 HEAD@{2016-02-02 09:40:30 +0000}: rebase finished: returning to refs/heads/master
40442d1 HEAD@{2016-02-02 09:40:30 +0000}: rebase: checkout refs/remotes/origin/master
5a4fbe1 HEAD@{2016-02-02 08:39:14 +0000}: clone: from https://gerrit.wikimedia.org/r/p/operations/mediawiki-config.git
hashar@tin:/srv/mediawiki-staging$ git reflog show --date=iso 'HEAD@{2016-02-02 08:39:14 +0000}' --format=fuller
commit 5a4fbe1
Reflog: HEAD@{2016-02-02 08:39:14 +0000} (mwdeploy <mwdeploy@tin.eqiad.wmnet>)
Reflog message: clone: from https://gerrit.wikimedia.org/r/p/operations/mediawiki-config.git

Which would be puppet creating the staging area on tin. Later, the empty staging would end up being copied to mira effectively deleting all the staging areas :-/

yes @hashar that's exactly what happened.

To better clarify the timeline:

  • I reimaged tin around 8 UTC;
  • puppet created a git clone of operations/mediawiki-config in /srv/mediawiki-staging at 08:39
  • I fixed a but in the puppet manifests so that hhvm and the other support packages were installed, which took me some time
  • I re-added tin to the list of the scap masters, so that the next sync-file from mira would've rsynced the code to tin correctly around 9:24
  • before I could run a void sync-file, /usr/local/bin/scap-master-sync tin was run on mira. This effectively synced the blank checkout on tin to mira, wiping out everything that was staged precedingly. This happened at 09:37

I noticed the breakage when trying to run sync-common on a codfw appserver and found it broken afterwards.

The migration procedure was correct, and the unfortunate incident happened because of a human error. The process is of syncing the scap masters is lacking any failsafe at the moment.

CC ing the whole Release-Engineering-Team .

TL;DR: we have lost staging area (code / caches / settings) from BOTH deployment servers and have to rebuild them.

tin.eqiad.wmnet has been reimaged as planned to be ported to Jessie. Puppet provisions a basic staging area which obviously lack private settings, caches ...

09:37 the command scap-master-sync tin was run on mira. Since tin was being provisioned, mira synced itself with an essential empty staging area. That effectively removed all copies we had on the cluster.

@Joe did the diagnostic while I have dig in logs (see previous comments).

None of the deployment servers can be used. To prevent the whole site to be dead for hours, rsync and puppet are disabled on both hosts.

We have a backup of a mediawiki dir on tin /srv/mediawiki-dir-tin-20160202.tar.gz. It lacks anything that is ignored by rsync, i.e. the .git dirs but has cache / private settings.

We will want to cancel today deployment train and SWAT deploys.

We have restored 1.27.0-wmf.8 1.27.0-wmf.9 and 1.27.0-wmf.10 . Regenerated the l10n cache.

mw1017 has been synced.

We are now trying to figure out private files that might be needed.

Servers are being synced in small batches. We are proceeding with canary servers first and so far it is going fine.

So, all apaches are back in sync with master now. Couple of actionables here off the top of my head:

  1. Backups (done in child T125527)
  2. We need to audit (and limit) the untracked stuff in staging. This took quite a bit of time and careful action to avoid wiping.
  3. Submodules for mw-config need to be initialized at provision time (gerrit 267929). This lead to a lot of confusion.
  4. Scap should probably depool proxies and repool them when scap operations complete. This was part of our restore behavior but makes general sense to avoid taxing those nodes.

Should all go into the official wikitech post-mortem.

Other than the incident report, we are now "back to normal" and deploying SWAT deploys and plan to do the train (1.27-wmf.12) today. Assigning to Chad for the incident report, but he'll obviously have others (@Joe, etc) help him.

Keeping open until the report is posted, but for the purposes of pushing out wmf.12, this is done.

Keeping open until the report is posted, but for the purposes of pushing out wmf.12, this is done.

https://wikitech.wikimedia.org/wiki/Incident_documentation/20160202-deployment-server-loss