Page MenuHomePhabricator

Improve behavior around global Scap lock + communicate changes
Closed, ResolvedPublic

Description

Creation of global file /var/lock/scap-global-lock does not block scap deployments/self-updates anymore.

This above seems to be still the expectation from other teams, plus a cursory review of the TimeoutLock code shows the global lock is still treated as a special case but it's not immediately clear what the designed behavior for it is.

As it turns out, the manual creation of the file was superseded by the command scap lock We should update any references in the docs/wikis and let interested parties know.

Additionally, we should allow both locking for periods longer than 1h + mechanism to forcefully unlock the global lock.

Details

Related Changes in Gerrit:
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
block_execution: do not block all scap commands on hostrepos/releng/scap!117jnuchetargets-in-blocked-modemaster
lock: refactored + improved user feedbackrepos/releng/scap!114jnucherefactor-timeoutlockmaster
config: new value to disable Scap executionrepos/releng/scap!113jnuchedisable-scap-confmaster
rework scap locksrepos/releng/scap!112jnucherework-scap-locksmaster
Customize query in GitLab

Event Timeline

jnuche renamed this task from Global lock file not being honored by Scap to Improve behavior around global Scal lock + communicate changes.Mar 2 2023, 11:10 AM
jnuche updated the task description. (Show Details)
jnuche added a subscriber: dancy.

The Puppet class profile::mediawiki::deployment::server needs to be updated too.

The Puppet class profile::mediawiki::deployment::server needs to be updated too.

Thanks @taavi!

Feedback on scap lock from @Krinkle

dancy: the main reason I avoided the command so far is that it logs to SAL which feels overkill for merely following the (imho useful) practice of locking for the 10min around staging and testing a change, to avoid incomplete or untested deploys. I feel that logging them is putting an undue burden on the channel and logs, feels like too big a deal compared to eg locking because of external reasons. The same way that eg a regular deploy also doesn't log a lock during the first 10min of a sync despite presumably having a lock by then

@Krinkle You can use --no-log-message for that cases where an announcement is not warranted.

Documentation updated, will reupdate when a scap unlock mechanism is implemented.

dancy renamed this task from Improve behavior around global Scal lock + communicate changes to Improve behavior around global Scap lock + communicate changes.Mar 23 2023, 2:52 PM

Change 904502 had a related patch set uploaded (by Jaime Nuche; author: Jaime Nuche):

[operations/puppet@production] scap: block Scap execution on inactive deployment hosts

https://gerrit.wikimedia.org/r/904502

Mentioned in SAL (#wikimedia-operations) [2023-04-04T22:26:58Z] <mutante> deploying change to block scap execution on inactive deployment server via gerrit:904502 T330756

Change 904502 merged by Dzahn:

[operations/puppet@production] scap: block Scap execution on inactive deployment hosts

https://gerrit.wikimedia.org/r/904502

Change 905741 had a related patch set uploaded (by Hashar; author: Elukey):

[operations/puppet@production] Revert "scap: block Scap execution on inactive deployment hosts"

https://gerrit.wikimedia.org/r/905741

Change 905741 merged by Elukey:

[operations/puppet@production] Revert "scap: block Scap execution on inactive deployment hosts"

https://gerrit.wikimedia.org/r/905741

Recording conversation from #wikimedia-operations at IRC. The puppet patch was reverted, because scap couldn't deploy anything with the following error message:

07:07:51 Started sync-apaches
07:08:06 ['/usr/bin/scap', 'pull', '--no-php-restart', '--no-update-l10n', '--exclude-wikiversions.php', 'mw2300.codfw.wmnet', 'mw1420.eqiad.wmnet', 'mw2289.codfw.wmnet', 'mw1486.eqiad.wmnet', 'deploy1002.eqiad.wmnet', 'mw1398.eqiad.wmnet', 'mw2259.codfw.wmnet', 'mw1366.eqiad.wmnet', 'deploy2002.codfw.wmnet', 'mw1404.eqiad.wmnet'] (ran as mwdeploy@deploy1002.eqiad.wmnet) returned [1]: Aborting: Scap is disabled on this host. If you really need to run Scap here, you can override by passing "-Dblock_execution:False" to the call

07:08:29 sync-apaches: 100% (in-flight: 0; ok: 377; fail: 1; left: 0)
07:08:29 Per-host sync duration: average 4.1s, median 3.5s
07:08:29 rsync transfer: average 475,556 bytes/host, total 179,760,423 bytes
07:08:29 1 apaches had sync errors
07:08:29 Finished sync-apaches (duration: 00m 37s)
07:08:29 Started scap-cdb-rebuild
07:08:34 sudo -u mwdeploy -n -- /usr/bin/scap cdb-rebuild (ran as mwdeploy@deploy1002.eqiad.wmnet) returned [1]: Aborting: Scap is disabled on this host. If you really need to run Scap here, you can override by passing "-Dblock_execution:False" to the call

07:08:35 scap-cdb-rebuild: 100% (in-flight: 0; ok: 393; fail: 1; left: 0)
07:08:35 1 hosts had scap-cdb-rebuild errors
07:08:35 Finished scap-cdb-rebuild (duration: 00m 06s)
07:08:35 Started sync_wikiversions
07:08:40 sync_wikiversions: 100% (in-flight: 0; ok: 394; fail: 0; left: 0)
07:08:40 Finished sync_wikiversions (duration: 00m 05s)
07:08:40 Started php-fpm-restarts
07:08:40 Running '/usr/local/sbin/restart-php-fpm-all php7.4-fpm 9223372036854775807' on 325 host(s)
07:11:24 php-fpm-restart: 100% (in-flight: 0; ok: 325; fail: 0; left: 0)
07:11:24 Finished php-fpm-restarts (duration: 02m 43s)
07:11:24 Running /usr/local/bin/mwscript purgeMessageBlobStore.php
07:11:25 Finished scap: Backport for [[gerrit:904952|Remove akwiki from CX config]] (duration: 07m 22s)
07:11:25 backport failed: <CalledProcessError> Command '['/usr/bin/scap', 'sync-world', '--pause-after-testserver-sync', '--notify-user=kartik', 'Backport for [[gerrit:904952|Remove akwiki from CX config]]']' returned non-zero exit status 1.

Change 908212 had a related patch set uploaded (by Jaime Nuche; author: Jaime Nuche):

[operations/puppet@production] scap: block Scap deployments on inactive deployment hosts

https://gerrit.wikimedia.org/r/908212

Change 908212 merged by Clément Goubert:

[operations/puppet@production] scap: block Scap deployments on inactive deployment hosts

https://gerrit.wikimedia.org/r/908212

jnuche added a subscriber: TheresNoTime.

Some improvements added:

  • An operator trying to get a lock already acquired will be shown details of the lock including the owner, reason for the lock and acquisition time
  • It is now possible to forcefully remove a global scap lock if necessary using scap lock --unlock-all <reason>. The holder of the global lock will be shown details of the user unlocking and the reason provided
  • Locks are no longer restricted to a maximum duration of one hour
  • Scap deployment-related commands can now be disabled using scap.cfg. Deployments on the secondary deployment server (deploy1002.eqiad.wmnet at the time of writing) are now disabled this way

I've updated the following docs to reflect the changes:
https://wikitech.wikimedia.org/wiki/Switch_Datacenter/DeploymentServer
https://wikitech.wikimedia.org/wiki/Switch_Datacenter#Phase_0_-_preparation
https://wikitech.wikimedia.org/wiki/Switch_Datacenter#Phase_9_-_Post_read-only

@TheresNoTime I think you were also interested in these changes