Don't continue scap if sync to all proxies failed
Open, MediumPublic
Actions

Assigned To

None

Authored By

	Reedy
	Aug 29 2015, 9:34 AM

Description

reedy@tin:/srv/mediawiki-staging$ sync-file wmf-config/db-eqiad.php Depool db1028, return ES servers back from maintenance
           ___ ____
         ⎛   ⎛ ,----
          \  //==--'
     _//|,.·//==--'    ____________________________
    _OO≣=-  ︶ ᴹw ⎞_§ ______  ___\ ___\ ,\__ \/ __ \
   (∞)_, )  (     |  ______/__  \/ /__ / /_/ / /_/ /
     ¨--¨|| |- (  / ______\____/ \___/ \__^_/  .__/
         ««_/  «_/ jgs/bd808                /_/

No syntax errors detected in /srv/mediawiki-staging/wmf-config/db-eqiad.php
09:32:19 Started sync-proxies
09:32:19 ['/srv/deployment/scap/scap/bin/sync-common', '--no-update-l10n', '--include', 'wmf-config', '--include', 'wmf-config/db-eqiad.php'] on mw1010.eqiad.wmnet returned [255]: Permission denied (publickey).

09:32:19 ['/srv/deployment/scap/scap/bin/sync-common', '--no-update-l10n', '--include', 'wmf-config', '--include', 'wmf-config/db-eqiad.php'] on mw1161.eqiad.wmnet returned [255]: Permission denied (publickey).

09:32:19 ['/srv/deployment/scap/scap/bin/sync-common', '--no-update-l10n', '--include', 'wmf-config', '--include', 'wmf-config/db-eqiad.php'] on mw1070.eqiad.wmnet returned [255]: Permission denied (publickey).

09:32:19 ['/srv/deployment/scap/scap/bin/sync-common', '--no-update-l10n', '--include', 'wmf-config', '--include', 'wmf-config/db-eqiad.php'] on mw1097.eqiad.wmnet returned [255]: Permission denied (publickey).

09:32:19 ['/srv/deployment/scap/scap/bin/sync-common', '--no-update-l10n', '--include', 'wmf-config', '--include', 'wmf-config/db-eqiad.php'] on mw1033.eqiad.wmnet returned [255]: Permission denied (publickey).

09:32:19 ['/srv/deployment/scap/scap/bin/sync-common', '--no-update-l10n', '--include', 'wmf-config', '--include', 'wmf-config/db-eqiad.php'] on mw1201.eqiad.wmnet returned [255]: Permission denied (publickey).

09:32:19 ['/srv/deployment/scap/scap/bin/sync-common', '--no-update-l10n', '--include', 'wmf-config', '--include', 'wmf-config/db-eqiad.php'] on mw1216.eqiad.wmnet returned [255]: Permission denied (publickey).

09:32:20 ['/srv/deployment/scap/scap/bin/sync-common', '--no-update-l10n', '--include', 'wmf-config', '--include', 'wmf-config/db-eqiad.php'] on mw2041.codfw.wmnet returned [255]: Permission denied (publickey).

09:32:20 ['/srv/deployment/scap/scap/bin/sync-common', '--no-update-l10n', '--include', 'wmf-config', '--include', 'wmf-config/db-eqiad.php'] on mw2119.codfw.wmnet returned [255]: Permission denied (publickey).

09:32:20 ['/srv/deployment/scap/scap/bin/sync-common', '--no-update-l10n', '--include', 'wmf-config', '--include', 'wmf-config/db-eqiad.php'] on mw2080.codfw.wmnet returned [255]: Permission denied (publickey).

09:32:20 ['/srv/deployment/scap/scap/bin/sync-common', '--no-update-l10n', '--include', 'wmf-config', '--include', 'wmf-config/db-eqiad.php'] on mw2001.codfw.wmnet returned [255]: Permission denied (publickey).

09:32:20 ['/srv/deployment/scap/scap/bin/sync-common', '--no-update-l10n', '--include', 'wmf-config', '--include', 'wmf-config/db-eqiad.php'] on mw2187.codfw.wmnet returned [255]: Permission denied (publickey).

sync-proxies: 100% (ok: 0; fail: 12; left: 0)                                   
09:32:20 12 proxies had sync errors
09:32:20 Finished sync-proxies (duration: 00m 00s)
09:32:20 Started sync-apaches
09:32:20 ['/srv/deployment/scap/scap/bin/sync-common', '--no-update-l10n', '--include', 'wmf-config', '--include', 'wmf-config/db-eqiad.php', 'mw1010.eqiad.wmnet', 'mw1033.eqiad.wmnet', 'mw1070.eqiad.wmnet', 'mw1097.eqiad.wmnet', 'mw1216.eqiad.wmnet', 'mw1161.eqiad.wmnet', 'mw1201.eqiad.wmnet', 'mw2001.codfw.wmnet', 'mw2041.codfw.wmnet', 'mw2080.codfw.wmnet', 'mw2119.codfw.wmnet', 'mw2187.codfw.wmnet'] on mw1198.eqiad.wmnet returned [255]: Permission denied (publickey).

Related Objects

Mentioned In: T111062: Scap should abort early when Keyholder is not armed
T110794: SCAP fails with Permission denied (publickey)
T110793: scap shouldn't log completion (it should log fail!)

Event Timeline

Reedy created this task.Aug 29 2015, 9:34 AM

Reedy raised the priority of this task from to Needs Triage.

Reedy updated the task description. (Show Details)

Reedy added projects: Release-Engineering-Team, Deployments.

Reedy subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 29 2015, 9:34 AM

Reedy mentioned this in T110793: scap shouldn't log completion (it should log fail!) .Aug 29 2015, 9:36 AM

jcrespo mentioned this in T110794: SCAP fails with Permission denied (publickey).Aug 29 2015, 9:47 AM

Or at any stage, really? (sync-proxies, sync-apaches, scap-rebuild-cdbs, and sync_wikiversions)

Reedy mentioned this in T111062: Scap should abort early when Keyholder is not armed.Sep 1 2015, 3:25 PM

In T110791#1589568, @greg wrote:

Or at any stage, really? (sync-proxies, sync-apaches, scap-rebuild-cdbs, and sync_wikiversions)

Making scap in general halting for sync failures on a small number of servers would be bad in my opinion. Except in the case of a new branch being pushed without any accompanying wikiversions bump we would end with mixed code on the cluster that is not fully deployed (eg new code but old messages). Given the relative frequency {{CN}} with which there are one or two servers that fail for various soft reasons this seems bad. The types of soft failures I'm thinking of are:

Server depooled for hardware issues but still in the list of MW hosts
Dumps server out of disk space
Server in non-hot DC failure due to whatever

I agree that failures to sync to the proxies should either:

Halt the whole process and require intervention to remove the failed proxy from the pool

Automatically drop the failed proxy from the pool of proxies that are sent to the MW servers

The damage of using an out of date rsync proxy somehow eluded my conscious attention until the moment that @Krenair asked me what would happen following seeing such a failure recently.

We also have the capability to code in more advanced halting criteria (fail if N of M hosts fail at stage X) if that is really needed. Scap currently doesn't really have any meta-data about which hosts are "required" and which are "best effort" which could be used to make some more strict decisions. In the case of proxies we should probably define some reasonable limit like if 25% or more of proxies fail then halt the process in order to avoid badly congested (ie really slow) syncs even if we go with the automatic removal option. (/me remembers running a 1.5h scap where everything in eqiad was fetching from tin)

It would also be possible to make scap/sync-* behave more like Trebuchet and require (or allow) user interaction before proceeding from stage to stage. This would probably be more disruptive than helpful with the current sync workflow however as code/config changes are generally live on each host as soon as that host finishes its rsync and adding a confirmation step would increase the mean time that things are half done across the server fleet.

I yield to @bd808

In T110791#1595614, @greg wrote:

I yield to @bd808

Like I originally titled it, if all fail, it's certainly a noop, and might aswell be aborted. :)

As per Bryan, if it's only one or 2 proxies that are failing, having them not used for syncing sounds good, with some notification that intervention is required. If it's say over 50% failing, maybe that's a good point to not continue either as something bigger is seemingly afoot

greg removed a project: Release-Engineering-Team.Sep 24 2015, 1:27 AM

greg set Security to None.

greg edited projects, added scap2; removed Deployments.Feb 9 2016, 11:33 PM

Restricted Application added a subscriber: StudiesWorld. · View Herald TranscriptFeb 9 2016, 11:33 PM

• mmodell edited projects, added Scap; removed scap2.Feb 10 2017, 6:22 PM

• mmodell triaged this task as Medium priority.Feb 1 2018, 12:15 AM

• mmodell moved this task from Needs triage to Debt on the Scap board.

• mmodell subscribed.

Don't continue scap if sync to all proxies failedOpen, MediumPublicActions

Description

Related Objects

Event Timeline

Don't continue scap if sync to all proxies failed
Open, MediumPublic
Actions