Page MenuHomePhabricator

Automated Tuesday Train via a timer
Closed, ResolvedPublic5 Estimated Story Points

Description

Let's make a timer that automates Tue "stage-train" tasks

Acceptance Criteria

  • Agree on a time for automated deployment every week. Selected 20:00 US/Pacific, an hour after the branch cut job runs (which should finish in about 30 minutes)
  • Patch the deployment calendar to add a new window
  • User exists for running timer (T303857)
  • Timer exists on deployment servers
  • Timer determines the wmf.X branch/version for the week. Information from https://train-blockers.toolforge.org/api.php (which pulls from Phabricator) is used
    • The right thing happens if the current week's train task is declined. scap stage-train will terminate if the task status is not "open".
  • Timer executes scap stage-train --yes auto and it runs to completion without human intervention and without errors under normal conditions.
  • Timer can be re-executed if it previously failed.

Alerting on failure is T310396

Event Timeline

Change 805464 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[operations/puppet@production] Allow deployers to sudo -u mwpresync

https://gerrit.wikimedia.org/r/805464

Change 807972 had a related patch set uploaded (by Hashar; author: Hashar):

[operations/puppet@production] Boilerplate for automatic MediaWiki deployment

https://gerrit.wikimedia.org/r/807972

For the logic behind "Timer determines the wmf.X branch/version for the week" I propose that we let the automated train branch job figure that out (as it does currently) and simply look for the latest wmf/* branch and check that it isn't already in wikiversions.json.

Alternately, the branch cut job could write the branch name and/or version to an artifact (say MW_VERSION) and we could poll https://releases-jenkins.wikimedia.org/job/Automatic%20branch%20cut/lastSuccessfulBuild/artifact/MW_VERSION for changes

Alternately, the branch cut job could write the branch name and/or version to an artifact (say MW_VERSION) and we could poll https://releases-jenkins.wikimedia.org/job/Automatic%20branch%20cut/lastSuccessfulBuild/artifact/MW_VERSION for changes

I like this idea.

Change 809220 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[mediawiki/tools/scap@master] Add auto mode to stage-train

https://gerrit.wikimedia.org/r/809220

Change 809297 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[operations/puppet@production] Setup .gitconfig for mwpresync system user

https://gerrit.wikimedia.org/r/809297

Change 809297 merged by Alexandros Kosiaris:

[operations/puppet@production] Setup .gitconfig for mwpresync system user

https://gerrit.wikimedia.org/r/809297

Change 809712 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[operations/puppet@production] Allow mwbuilder group to access mwdeploy key

https://gerrit.wikimedia.org/r/809712

Change 809712 merged by Alexandros Kosiaris:

[operations/puppet@production] Allow mwbuilder group to access mwdeploy key

https://gerrit.wikimedia.org/r/809712

Change 810069 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[mediawiki/tools/scap@master] Move serializing_lock_file to a setgid directory

https://gerrit.wikimedia.org/r/810069

Change 810397 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[mediawiki/tools/scap@master] Move serializing_lock_file to a setgid directory

https://gerrit.wikimedia.org/r/810397

Change 810397 merged by jenkins-bot:

[mediawiki/tools/scap@master] Move serializing_lock_file to a setgid directory

https://gerrit.wikimedia.org/r/810397

Change 810069 abandoned by Ahmon Dancy:

[mediawiki/tools/scap@master] Add directory mode check to TimeoutLock

Reason:

Decided not to go this route.

https://gerrit.wikimedia.org/r/810069

Change 809220 merged by jenkins-bot:

[mediawiki/tools/scap@master] Add auto mode to stage-train

https://gerrit.wikimedia.org/r/809220

Change 815329 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[operations/puppet@production] scap.cfg.erb: Set gerrit_push_user: trainbranchbot

https://gerrit.wikimedia.org/r/815329

Change 815329 merged by RLazarus:

[operations/puppet@production] scap.cfg.erb: Set gerrit_push_user: trainbranchbot

https://gerrit.wikimedia.org/r/815329

Change 816221 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] deployment_server: add gerrit host key for mwpresync pushing to gerrit

https://gerrit.wikimedia.org/r/816221

Change 816715 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] P:gerrit: Export sshkey for gerrit shared services

https://gerrit.wikimedia.org/r/816715

Change 816221 abandoned by Dzahn:

[operations/puppet@production] deployment_server: add gerrit host key for mwpresync pushing to gerrit

Reason:

per John's explanation - would lead to a double edit on each puppet run

https://gerrit.wikimedia.org/r/816221

Change 816715 merged by Jbond:

[operations/puppet@production] P:gerrit: Export sshkey for gerrit shared services

https://gerrit.wikimedia.org/r/816715

Change 819180 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[operations/puppet@production] Add systemd timer to run scap stage-train on Tuesday morning

https://gerrit.wikimedia.org/r/819180

Change 819506 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] P:gerrit: add ipaddress to host_aliases

https://gerrit.wikimedia.org/r/819506

Change 819506 merged by Jbond:

[operations/puppet@production] P:gerrit: add ipaddress to host_aliases

https://gerrit.wikimedia.org/r/819506

Change 823209 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[labs/private@master] Add placeholder for scap's phabricator API token

https://gerrit.wikimedia.org/r/823209

Change 823209 merged by Dzahn:

[labs/private@master] Add placeholder for scap's phabricator API token

https://gerrit.wikimedia.org/r/823209

Change 824288 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[mediawiki/tools/scap@master] Use train-blockers.toolforge for scap stage-train auto information

https://gerrit.wikimedia.org/r/824288

Change 824288 merged by jenkins-bot:

[mediawiki/tools/scap@master] Use train-blockers.toolforge for scap stage-train auto information

https://gerrit.wikimedia.org/r/824288

hashar subscribed.

Change 826361 had a related patch set uploaded (by Thcipriani; author: Thcipriani):

[mediawiki/tools/release@master] Calendar: Automated testwiki deploy

https://gerrit.wikimedia.org/r/826361

Change 819180 merged by Dzahn:

[operations/puppet@production] Add systemd timer to run scap stage-train on Tuesday morning

https://gerrit.wikimedia.org/r/819180

Change 826361 merged by jenkins-bot:

[mediawiki/tools/release@master] Calendar: Automated testwiki deploy

https://gerrit.wikimedia.org/r/826361

dancy triaged this task as Medium priority.Aug 25 2022, 8:41 PM
dancy updated the task description. (Show Details)

The timer/service added to deploy1002 recently failed and alerted in Icinga as "CRITICAL - degraded: The following units failed: train-presync.service" (but I don't think anyone gets notified via email, just saw it by coincidence).

When checking the status manually it said:

● train-presync.service - Perform beginning-of-week train operations
   Loaded: loaded (/lib/systemd/system/train-presync.service; static; vendor preset: enabled)
   Active: failed (Result: exit-code) since Tue 2022-08-30 03:00:22 UTC; 1 day 18h ago
  Process: 16841 ExecStart=/usr/local/bin/systemd-timer-mail-wrapper -T root@deploy1002.eqiad.wmnet /usr/bin/scap stage-train --yes auto (code=exited, status=70)
 Main PID: 16841 (code=exited, status=70)

Aug 30 03:00:22 deploy1002 scap[16841]:     raise child_exception_type(errno_num, err_msg, err_filename)
Aug 30 03:00:22 deploy1002 scap[16841]: FileNotFoundError: [Errno 2] No such file or directory: '/srv/mediawiki-staging/php-1.39.0-wmf.27/extensions/GrowthExperiments': '/s
Aug 30 03:00:22 deploy1002 scap[16841]: 03:00:22 apply-patches failed: <FileNotFoundError> [Errno 2] No such file or directory: '/srv/mediawiki-staging/php-1.39.0-wmf.27/ex
Aug 30 03:00:22 deploy1002 scap[16841]: Applying patch /srv/patches/1.39.0-wmf.27/core/01-T309894.patch in /srv/mediawiki-staging/php-1.39.0-wmf.27
Aug 30 03:00:22 deploy1002 scap[16841]: Applying patch /srv/patches/1.39.0-wmf.27/core/02-T307278.patch in /srv/mediawiki-staging/php-1.39.0-wmf.27
Aug 30 03:00:22 deploy1002 scap[16841]: Applying patch /srv/patches/1.39.0-wmf.27/extensions/GrowthExperiments/01-T313205.patch in /srv/mediawiki-staging/php-1.39.0-wmf.27
Aug 30 03:00:22 deploy1002 scap[16841]: [183B blob data]
Aug 30 03:00:22 deploy1002 scap[16841]: [1B blob data]
Aug 30 03:00:22 deploy1002 systemd[1]: train-presync.service: Main process exited, code=exited, status=70/SOFTWARE
Aug 30 03:00:22 deploy1002 systemd[1]: train-presync.service: Failed with result 'exit-code'.

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=deploy1002&service=Check+systemd+state indicates the status

Should email notifications be added (via the systemd timer itself if not icinga?) for releng-team?

Hi @Dzahn Thanks for checking in. The service is configured for email:

systemd::timer::job { 'train-presync':
     ensure                  => $primary_deploy_ensure,
     description             => 'Perform beginning-of-week train operations',
     user                    => 'mwpresync',
     command                 => '/usr/bin/scap stage-train --yes auto',
     send_mail               => true,
     send_mail_only_on_error => false,
     environment             => {'MAILTO' => 'releng@lists.wikimedia.org'},
     interval                => {'start' => 'OnCalendar', 'interval' => $auto_deploy_interval},
 }

xref: https://gerrit.wikimedia.org/r/c/operations/puppet/+/819180/9/modules/profile/manifests/mediawiki/deployment/server.pp#179

An email was sent to releng@lists.wikimedia.org and was captured in the moderator queue. @thcipriani released it and set things up to allow it to pass through freely next time around.

@dancy Ah, you are right, it's already configured. Thanks for the extra details. Sounds good :)

Change 807972 abandoned by Hashar:

[operations/puppet@production] Boilerplate for automatic MediaWiki deployment

Reason:

most probably obsolete now

https://gerrit.wikimedia.org/r/807972

dancy updated the task description. (Show Details)
dancy set Final Story Points to 11.