Page MenuHomePhabricator

Don't continue scap if sync to all proxies failed
Open, MediumPublic

Description

reedy@tin:/srv/mediawiki-staging$ sync-file wmf-config/db-eqiad.php Depool db1028, return ES servers back from maintenance
           ___ ____
         ⎛   ⎛ ,----
          \  //==--'
     _//|,.·//==--'    ____________________________
    _OO≣=-  ︶ ᴹw ⎞_§ ______  ___\ ___\ ,\__ \/ __ \
   (∞)_, )  (     |  ______/__  \/ /__ / /_/ / /_/ /
     ¨--¨|| |- (  / ______\____/ \___/ \__^_/  .__/
         ««_/  «_/ jgs/bd808                /_/

No syntax errors detected in /srv/mediawiki-staging/wmf-config/db-eqiad.php
09:32:19 Started sync-proxies
09:32:19 ['/srv/deployment/scap/scap/bin/sync-common', '--no-update-l10n', '--include', 'wmf-config', '--include', 'wmf-config/db-eqiad.php'] on mw1010.eqiad.wmnet returned [255]: Permission denied (publickey).

09:32:19 ['/srv/deployment/scap/scap/bin/sync-common', '--no-update-l10n', '--include', 'wmf-config', '--include', 'wmf-config/db-eqiad.php'] on mw1161.eqiad.wmnet returned [255]: Permission denied (publickey).

09:32:19 ['/srv/deployment/scap/scap/bin/sync-common', '--no-update-l10n', '--include', 'wmf-config', '--include', 'wmf-config/db-eqiad.php'] on mw1070.eqiad.wmnet returned [255]: Permission denied (publickey).

09:32:19 ['/srv/deployment/scap/scap/bin/sync-common', '--no-update-l10n', '--include', 'wmf-config', '--include', 'wmf-config/db-eqiad.php'] on mw1097.eqiad.wmnet returned [255]: Permission denied (publickey).

09:32:19 ['/srv/deployment/scap/scap/bin/sync-common', '--no-update-l10n', '--include', 'wmf-config', '--include', 'wmf-config/db-eqiad.php'] on mw1033.eqiad.wmnet returned [255]: Permission denied (publickey).

09:32:19 ['/srv/deployment/scap/scap/bin/sync-common', '--no-update-l10n', '--include', 'wmf-config', '--include', 'wmf-config/db-eqiad.php'] on mw1201.eqiad.wmnet returned [255]: Permission denied (publickey).

09:32:19 ['/srv/deployment/scap/scap/bin/sync-common', '--no-update-l10n', '--include', 'wmf-config', '--include', 'wmf-config/db-eqiad.php'] on mw1216.eqiad.wmnet returned [255]: Permission denied (publickey).

09:32:20 ['/srv/deployment/scap/scap/bin/sync-common', '--no-update-l10n', '--include', 'wmf-config', '--include', 'wmf-config/db-eqiad.php'] on mw2041.codfw.wmnet returned [255]: Permission denied (publickey).

09:32:20 ['/srv/deployment/scap/scap/bin/sync-common', '--no-update-l10n', '--include', 'wmf-config', '--include', 'wmf-config/db-eqiad.php'] on mw2119.codfw.wmnet returned [255]: Permission denied (publickey).

09:32:20 ['/srv/deployment/scap/scap/bin/sync-common', '--no-update-l10n', '--include', 'wmf-config', '--include', 'wmf-config/db-eqiad.php'] on mw2080.codfw.wmnet returned [255]: Permission denied (publickey).

09:32:20 ['/srv/deployment/scap/scap/bin/sync-common', '--no-update-l10n', '--include', 'wmf-config', '--include', 'wmf-config/db-eqiad.php'] on mw2001.codfw.wmnet returned [255]: Permission denied (publickey).

09:32:20 ['/srv/deployment/scap/scap/bin/sync-common', '--no-update-l10n', '--include', 'wmf-config', '--include', 'wmf-config/db-eqiad.php'] on mw2187.codfw.wmnet returned [255]: Permission denied (publickey).

sync-proxies: 100% (ok: 0; fail: 12; left: 0)                                   
09:32:20 12 proxies had sync errors
09:32:20 Finished sync-proxies (duration: 00m 00s)
09:32:20 Started sync-apaches
09:32:20 ['/srv/deployment/scap/scap/bin/sync-common', '--no-update-l10n', '--include', 'wmf-config', '--include', 'wmf-config/db-eqiad.php', 'mw1010.eqiad.wmnet', 'mw1033.eqiad.wmnet', 'mw1070.eqiad.wmnet', 'mw1097.eqiad.wmnet', 'mw1216.eqiad.wmnet', 'mw1161.eqiad.wmnet', 'mw1201.eqiad.wmnet', 'mw2001.codfw.wmnet', 'mw2041.codfw.wmnet', 'mw2080.codfw.wmnet', 'mw2119.codfw.wmnet', 'mw2187.codfw.wmnet'] on mw1198.eqiad.wmnet returned [255]: Permission denied (publickey).

Event Timeline

Reedy raised the priority of this task from to Needs Triage.
Reedy updated the task description. (Show Details)
Reedy subscribed.

Or at any stage, really? (sync-proxies, sync-apaches, scap-rebuild-cdbs, and sync_wikiversions)

Or at any stage, really? (sync-proxies, sync-apaches, scap-rebuild-cdbs, and sync_wikiversions)

Making scap in general halting for sync failures on a small number of servers would be bad in my opinion. Except in the case of a new branch being pushed without any accompanying wikiversions bump we would end with mixed code on the cluster that is not fully deployed (eg new code but old messages). Given the relative frequency {{CN}} with which there are one or two servers that fail for various soft reasons this seems bad. The types of soft failures I'm thinking of are:

  • Server depooled for hardware issues but still in the list of MW hosts
  • Dumps server out of disk space
  • Server in non-hot DC failure due to whatever

I agree that failures to sync to the proxies should either:

  • Halt the whole process and require intervention to remove the failed proxy from the pool

OR

  • Automatically drop the failed proxy from the pool of proxies that are sent to the MW servers

The damage of using an out of date rsync proxy somehow eluded my conscious attention until the moment that @Krenair asked me what would happen following seeing such a failure recently.

We also have the capability to code in more advanced halting criteria (fail if N of M hosts fail at stage X) if that is really needed. Scap currently doesn't really have any meta-data about which hosts are "required" and which are "best effort" which could be used to make some more strict decisions. In the case of proxies we should probably define some reasonable limit like if 25% or more of proxies fail then halt the process in order to avoid badly congested (ie really slow) syncs even if we go with the automatic removal option. (/me remembers running a 1.5h scap where everything in eqiad was fetching from tin)

It would also be possible to make scap/sync-* behave more like Trebuchet and require (or allow) user interaction before proceeding from stage to stage. This would probably be more disruptive than helpful with the current sync workflow however as code/config changes are generally live on each host as soon as that host finishes its rsync and adding a confirmation step would increase the mean time that things are half done across the server fleet.

I yield to @bd808

Like I originally titled it, if all fail, it's certainly a noop, and might aswell be aborted. :)

As per Bryan, if it's only one or 2 proxies that are failing, having them not used for syncing sounds good, with some notification that intervention is required. If it's say over 50% failing, maybe that's a good point to not continue either as something bigger is seemingly afoot

mmodell moved this task from Needs triage to Debt on the Scap board.
mmodell subscribed.