Page MenuHomePhabricator

Stop triggering `beta-scap-sync-world` on `beta-mediawiki-config-update-eqiad` completion
Open, Needs TriagePublic

Description

The last few "sets" of beta-mediawiki-config-update-eqiad jobs have got stuck and needed manual actions (i.e. cancelling all other pending beta deployment jobs repeatedly until the backlog of beta-mediawiki-config-update-eqiad jobs have completed)

To note, beta-scap-sync-world gets stuck waiting on beta-mediawiki-config-update-eqiad with the error

#57646
cancel this build
(pending—Waiting for next available executor on ‘deployment-deploy03’; ‘contint1001’ doesn’t have label ‘BetaClusterBastion’; ‘contint2001’ doesn’t have label ‘BetaClusterBastion’; ‘integration-agent-docker-1023’ doesn’t have label ‘BetaClusterBastion’; ‘integration-agent-docker-1024’ doesn’t have label ‘BetaClusterBastion’; ‘integration-agent-docker-1025’ doesn’t have label ‘BetaClusterBastion’; ‘integration-agent-docker-1026’ doesn’t have label ‘BetaClusterBastion’; ‘integration-agent-docker-1027’ doesn’t have label ‘BetaClusterBastion’; ‘integration-agent-docker-1028’ doesn’t have label ‘BetaClusterBastion’; ‘integration-agent-docker-1029’ doesn’t have label ‘BetaClusterBastion’; ‘integration-agent-docker-1030’ doesn’t have label ‘BetaClusterBastion’; ‘integration-agent-docker-1031’ doesn’t have label ‘BetaClusterBastion’; ‘integration-agent-docker-1032’ doesn’t have label ‘BetaClusterBastion’; ‘integration-agent-docker-1033’ doesn’t have label ‘BetaClusterBastion’; ‘integration-agent-docker-1034’ doesn’t have label ‘BetaClusterBastion’; ‘integration-agent-docker-1035’ doesn’t have label ‘BetaClusterBastion’; ‘integration-agent-docker-1036’ doesn’t have label ‘BetaClusterBastion’; ‘integration-agent-docker-1037’ doesn’t have label ‘BetaClusterBastion’; ‘integration-agent-docker-1038’ doesn’t have label ‘BetaClusterBastion’; ‘integration-agent-docker-1039’ doesn’t have label ‘BetaClusterBastion’; ‘integration-agent-pkgbuilder-1001’ doesn’t have label ‘BetaClusterBastion’; ‘integration-agent-pkgbuilder-1002’ doesn’t have label ‘BetaClusterBastion’; ‘integration-agent-puppet-docker-1003’ doesn’t have label ‘BetaClusterBastion’; ‘integration-agent-qemu-1003’ doesn’t have label ‘BetaClusterBastion’; ‘integration-castor05’ doesn’t have label ‘BetaClusterBastion’; ‘pcc-worker1001.puppet-diffs.eqiad1.wikimedia.cloud’ doesn’t have label ‘BetaClusterBastion’; ‘pcc-worker1002.puppet-diffs.eqiad1.wikimedia.cloud’ doesn’t have label ‘BetaClusterBastion’; ‘pcc-worker1003.puppet-diffs.eqiad1.wikimedia.cloud’ doesn’t have label ‘BetaClusterBastion’)

nb. just sat and watched a set of deploys (what else do you do at 11pm?) — this seems to occur when a beta-mediawiki-config-update-eqiad is running and a beta-code-update-eqiad job is triggered via timer. There's either no lockfile to prevent the two from running at the same time, or it ignores it?

This has occurred on-and-off fairly consistently (i.e. almost always the beta-mediawiki-config-update-eqiad job starting off the deadlock situation) — seeing that beta-code-update-eqiad (and thus beta-scap-sync-world) runs every 10 minutes, can we try not triggering a beta-scap-sync-world on beta-mediawiki-config-update-eqiad completion and instead wait for the timer? Having both running/queuing at least appears to contribute to the deadlock.

(you can always manually trigger beta-scap-sync-world if needed..)

Event Timeline

Mentioned in SAL (#wikimedia-releng) [2022-08-04T10:01:13Z] <TheresNoTime> clearing out stuck beta deployment jobs T314378 T72597

Change 820405 had a related patch set uploaded (by Majavah; author: Majavah):

[integration/config@master] zuul: stop triggering beta-mediawiki-config-update-eqiad jobs

https://gerrit.wikimedia.org/r/820405

Change 820406 had a related patch set uploaded (by Majavah; author: Majavah):

[integration/config@master] jjb: remove beta-mediawiki-config-update-eqiad job

https://gerrit.wikimedia.org/r/820406

Change 820405 merged by jenkins-bot:

[integration/config@master] zuul: stop triggering beta-mediawiki-config-update-eqiad jobs

https://gerrit.wikimedia.org/r/820405

Change 820406 merged by jenkins-bot:

[integration/config@master] jjb: remove beta-mediawiki-config-update-eqiad job

https://gerrit.wikimedia.org/r/820406