Page MenuHomePhabricator

Move nightly image build from releases-jenkins to deployment.eqiad.wmnet
Closed, ResolvedPublicFeature

Description

Adopt the behavior described in the design agreements that came out of T379683: [FY24-25 WE6.2.6] Create design document for Pretrain (née Group -1) deployment.

A new wmf/next image will be created once per day starting at 01:00 UTC. This image will be deployed via an automated process triggered by a scheduled task running from the active deployment server Monday through Thursday at 02:00 UTC.

We will be leaving the nightly branch cut process on deployment-jenkins at this time. The job that is moving is the image build and publish for that branch.

Event Timeline

bd808 triaged this task as High priority.Jul 7 2025, 8:32 PM
bd808 changed the subtype of this task from "Task" to "Feature Request".

@dduvall There's a hazard in the way that the MediaWiki branch and publish WMF single-version image job handles the wmf/next branches. It deletes the branches first, then creates fresh branches in the relevant repos. Then it creates a commit to set up .gitmodules, etc and pushes it to Gerrit and it waits for CI to finish. If CI fails (which happens about 10% of the time), we're left with an unusable wmf/next branch until the next run. If scap prep next runs before MediaWiki branch and publish WMF single-version image runs again, it will fail like so:

dancy@deploy1003:/srv/mediawiki-staging$ scap prep next
21:56:24 Started scap prep next
21:56:24 Copying patches from /srv/patches/1.45.0-wmf.9 to /srv/patches/next
21:56:24 Clone https://gerrit.wikimedia.org/r/mediawiki/core (wmf/next branch) in /srv/mediawiki-staging/php-next
21:56:41 https://gerrit.wikimedia.org/r/mediawiki/core checked out at commit a9e4ca532000bcf0f190da549db728e71b1c943c
21:56:41 Finished scap prep next (duration: 00m 16s)
/srv/mediawiki-staging/php-next/.gitmodules does not exist. Did the train branch commit get merged?

dancy merged https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/888

utils.py: Make select_latest_patches treat /srv/patches/next as latest

Mentioned in SAL (#wikimedia-operations) [2025-07-16T19:14:51Z] <dancy@deploy1003> Started scap build-images: Testing T398873

Mentioned in SAL (#wikimedia-operations) [2025-07-16T19:19:25Z] <dancy@deploy1003> Finished scap build-images: Testing T398873 (duration: 04m 34s)

If CI fails (which happens about 10% of the time), we're left with an unusable wmf/next branch until the next run

The several most recent failures look to be related to flaky selenium tests.

Perhaps we could configure Zuul to skip these tests for the initial branch commit? (They should not be skipped for backports to the branch.)

@dduvall I'd like to see MediaWiki branch and publish WMF single-version image changed so that instead of destroying and recreating the wmf/next branch each time it runs, it updates wmf/next if it already exists. This means being able to handle added/dropped extensions.

@dduvall I'd like to see MediaWiki branch and publish WMF single-version image changed so that instead of destroying and recreating the wmf/next branch each time it runs, it updates wmf/next if it already exists. This means being able to handle added/dropped extensions.

Wouldn't that still require an update to the .gitmodules in mediawiki/core? Meaning a new change submitted to Gerrit and a CI run (and potential failure).

@dduvall I'd like to see MediaWiki branch and publish WMF single-version image changed so that instead of destroying and recreating the wmf/next branch each time it runs, it updates wmf/next if it already exists. This means being able to handle added/dropped extensions.

Wouldn't that still require an update to the .gitmodules in mediawiki/core? Meaning a new change submitted to Gerrit and a CI run (and potential failure).

Yes, but in the meantime scap prep next would continue to work correctly (assuming at least one prior successful branch cut was merged). As it stands, if MediaWiki branch and publish WMF single-version image fails, a subsequent scap prep next will fail.

Yes, but in the meantime scap prep next would continue to work correctly (assuming at least one prior successful branch cut was merged). As it stands, if`MediaWiki branch and publish WMF single-version image` fails, a subsequent scap prep next will fail.

That makes sense. I was thinking more about the root cause, but making scap prep next more tolerant of failure seems just as important.

Change #1172678 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[operations/puppet@production] deployment_server: Add pretrain systemd timer

https://gerrit.wikimedia.org/r/1172678

Change #1172678 merged by Clément Goubert:

[operations/puppet@production] deployment_server: Add pretrain systemd timer

https://gerrit.wikimedia.org/r/1172678

Change #1173446 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[operations/puppet@production] data.yaml: Allow release-engineering to administer pretrain timer

https://gerrit.wikimedia.org/r/1173446

Mentioned in SAL (#wikimedia-operations) [2025-07-28T19:46:42Z] <dancy@deploy1003> Started deploy [releng/jenkins-deploy@b89eed0] (releasing): Disabling the MediaWiki publish WMF single-version image job (T398873)

Mentioned in SAL (#wikimedia-operations) [2025-07-28T19:47:30Z] <dancy@deploy1003> Finished deploy [releng/jenkins-deploy@b89eed0] (releasing): Disabling the MediaWiki publish WMF single-version image job (T398873) (duration: 01m 11s)

Change #1173446 merged by Clément Goubert:

[operations/puppet@production] data.yaml: Allow release-engineering to administer pretrain timer

https://gerrit.wikimedia.org/r/1173446

Change #1173991 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[operations/puppet@production] pretrain: Use bash to execute multiple commands

https://gerrit.wikimedia.org/r/1173991

Change #1173991 merged by Dzahn:

[operations/puppet@production] pretrain: Use bash to execute multiple commands

https://gerrit.wikimedia.org/r/1173991

Change #1174051 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[operations/puppet@production] pretrain: Use && instead of ';' to separate commands

https://gerrit.wikimedia.org/r/1174051

Change #1174051 merged by Dzahn:

[operations/puppet@production] pretrain: Use && instead of ';' to separate commands

https://gerrit.wikimedia.org/r/1174051

Change #1230952 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[operations/puppet@production] pretrain: Run one hour later, at 02:00UTC

https://gerrit.wikimedia.org/r/1230952

Change #1230952 merged by Dzahn:

[operations/puppet@production] pretrain: Run one hour later, at 02:00UTC

https://gerrit.wikimedia.org/r/1230952

Mentioned in SAL (#wikimedia-operations) [2026-01-26T13:07:53Z] <mutante> pre-train sync shifted to one hour later T398873