Page MenuHomePhabricator

train presync failed
Closed, ResolvedPublic

Description

The systemd timer in charge of preparing the MediaWiki train has failed

FAIL: train-presync
Systemd timer ran the following command:

    /usr/bin/scap stage-train -Dfull_image_build:True --yes auto

Its return value was 70 and emitted the following output:

<...>
04:36:00 Started sync-prod-k8s
04:36:13 K8s deployment progress:   0% (ok: 0; fail: 0; left: 2532)             
04:36:44 K8s deployment progress:   1% (ok: 27; fail: 0; left: 2505)            
04:37:36 K8s deployment progress:   2% (ok: 53; fail: 0; left: 2479)            
04:38:14 K8s deployment progress:   2% (ok: 62; fail: 0; left: 2470)            
04:38:44 K8s deployment progress:  12% (ok: 316; fail: 0; left: 2216)           
04:39:15 K8s deployment progress:  16% (ok: 428; fail: 0; left: 2104)           
04:39:45 K8s deployment progress:  20% (ok: 520; fail: 0; left: 2012)           
04:40:16 K8s deployment progress:  26% (ok: 660; fail: 0; left: 1872)           
04:40:46 K8s deployment progress:  35% (ok: 892; fail: 0; left: 1640)           
04:41:16 K8s deployment progress:  43% (ok: 1097; fail: 0; left: 1435)          
04:41:47 K8s deployment progress:  50% (ok: 1283; fail: 0; left: 1249)          
04:42:17 K8s deployment progress:  57% (ok: 1459; fail: 0; left: 1073)          
04:42:47 K8s deployment progress:  65% (ok: 1666; fail: 0; left: 866)           
04:43:17 K8s deployment progress:  72% (ok: 1828; fail: 0; left: 704)           
04:43:48 K8s deployment progress:  77% (ok: 1954; fail: 0; left: 578)           
04:44:18 K8s deployment progress:  84% (ok: 2151; fail: 0; left: 381)           
04:44:49 K8s deployment progress:  89% (ok: 2268; fail: 0; left: 264)           
04:45:20 K8s deployment progress:  91% (ok: 2307; fail: 0; left: 225)           
04:45:58 K8s deployment progress:  92% (ok: 2334; fail: 0; left: 198)           
04:46:30 K8s deployment progress:  92% (ok: 2351; fail: 0; left: 181)           
04:47:01 K8s deployment progress:  90% (ok: 2289; fail: 0; left: 243)           
04:47:31 K8s deployment progress:  88% (ok: 2229; fail: 0; left: 303)         
04:47:31 K8s deployment progress:  88% (ok: 2229; fail: 0; left: 303)           
04:47:37 Command '['helmfile', '-e', 'codfw', '--selector', 'name=next', 'apply']' returned non-zero exit status 1.
04:47:37 Stdout/stderr follows:
<...>
04:48:02 K8s deployment progress:  86% (ok: 2186; fail: 0; left: 346)           
04:48:03 Command '['helmfile', '-e', 'codfw', '--selector', 'name=next', 'apply']' returned non-zero exit status 1.

04:48:03 K8s deployment progress:  86% (ok: 2186; fail: 0; left: 346)           
04:48:03 K8s deployment to stage production failed: K8s Deployment had the following errors:
 codfw: Deployment of mw-web-next failed: Command '['helmfile', '-e', 'codfw', '--selector', 'name=next', 'apply']' returned non-zero exit status 1.
Deployment of mw-api-ext-next failed: Command '['helmfile', '-e', 'codfw', '--selector', 'name=next', 'apply']' returned non-zero exit status 1.

So looks like deploying to mw-api-ext-next failed due to a 10 minutes timeout.

Event Timeline

hashar triaged this task as Unbreak Now! priority.Mar 4 2025, 7:40 AM
$ systemctl status train-presync.service
● train-presync.service - Perform beginning-of-week train operations
     Loaded: loaded (/lib/systemd/system/train-presync.service; static)
     Active: inactive (dead) since Tue 2025-03-04 04:53:29 UTC; 2h 49min ago
TriggeredBy: ● train-presync.timer
       Docs: https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
    Process: 2174241 ExecStart=/usr/local/bin/systemd-timer-mail-wrapper --subject train-presync --mail-to releng@lists.wikimedia.org /usr/bin/scap stage-train -Dfull_image_build:True --yes auto (code=exited, status=70)
   Main PID: 2174241 (code=exited, status=70)

Short of manually copy pasting the command line AND reproducing the unit environment, one would need to systemctl start train-presync.

Oh I can do: scap stage-train -Dfull_image_build:True --yes auto

hashar claimed this task.
hashar added a subscriber: dcausse.

I cancelled the scap stage-train since the backport window was starting. @dcausse did a backport and a result ended up pushing the train, which takes a while unfortunately.

From the information @jnuche gave me, when we run scap stage-train -Dfull_image_build:True --yes auto the full image build causes the code to be in a single layer, that is to avoid pilling up layers on top of each others and reaching a limit of 127 layers for an image.

I went to diff the layers of the php7.4 images generated at 4am and 8am and there were various changes (P74049 - requires WMF-NDA). That affected only 27 files or 672 kilobytes.

When the image has been successfully built but the deployment failed at a later stage, we should output the command to retry without the full_image_build:True so that the delta can be pilled up.

Change #1124462 had a related patch set uploaded (by Hashar; author: Ahmon Dancy):

[operations/puppet@production] deployment server: Don't pass -Dfull_image_build:True to scap stage-train

https://gerrit.wikimedia.org/r/1124462

Change #1124462 merged by Clément Goubert:

[operations/puppet@production] deployment server: Don't pass -Dfull_image_build:True to scap stage-train

https://gerrit.wikimedia.org/r/1124462

The root cause is the Kubernetes deployment reached a ten minutes timeout and aborted. Those time out have been noticed by @Scott_French who is working on the Php 8.1 deployment (T383845). He has made a patch to raise the timeout https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1125265

Change #1130947 had a related patch set uploaded (by Hashar; author: Hashar):

[operations/puppet@production] Allow releng to resume train related systemd timers

https://gerrit.wikimedia.org/r/1130947

Change #1130947 merged by Alexandros Kosiaris:

[operations/puppet@production] Allow releng to resume train related systemd timers

https://gerrit.wikimedia.org/r/1130947

Change to allow Release-Engineering-Team members to start train-presync, train-clean and view logs has been merged and deployed.