Page MenuHomePhabricator

CI is overwhelmed and lots of jobs are failing randomly (2025-04-29)
Closed, ResolvedPublic

Description

Title. Looking at the agent stats (retouched slightly to fit more panels):

image.png (1×3 px, 962 KB)

It seems that something is eating up resources. In turn, this causes lots of jobs to fail with various unrelated error messages, for example:

https://integration.wikimedia.org/ci/job/wmf-quibble-selenium-php81/11711/console
20:06:02 ERROR:zuul.Repo:Unable to initialize repo for https://gerrit.wikimedia.org/r/mediawiki/extensions/WikiEditor
20:06:02 Traceback (most recent call last):
20:06:02   File "/usr/local/lib/python3.9/dist-packages/zuul/merger/merger.py", line 51, in __init__
20:06:02     self._ensure_cloned()
20:06:02   File "/usr/local/lib/python3.9/dist-packages/zuul/merger/merger.py", line 63, in _ensure_cloned
20:06:02     git.Repo.clone_from(self.remote_url, self.local_path)
20:06:02   File "/usr/lib/python3/dist-packages/git/repo/base.py", line 1071, in clone_from
20:06:02     return cls._clone(git, url, to_path, GitCmdObjectDB, progress, multi_options,
20:06:02   File "/usr/lib/python3/dist-packages/git/repo/base.py", line 1001, in _clone
20:06:02     finalize_process(proc, stderr=stderr)
20:06:02   File "/usr/lib/python3/dist-packages/git/util.py", line 329, in finalize_process
20:06:02     proc.wait(**kwargs)
20:06:02   File "/usr/lib/python3/dist-packages/git/cmd.py", line 455, in wait
20:06:02     raise GitCommandError(self.args, status, errstr)
20:06:02 git.exc.GitCommandError: Cmd('git') failed due to: exit code(128)
20:06:02   cmdline: git clone -v -- https://gerrit.wikimedia.org/r/mediawiki/extensions/WikiEditor /workspace/src/extensions/WikiEditor
20:06:02   stderr: 'Cloning into '/workspace/src/extensions/WikiEditor'...
20:06:02 POST git-upload-pack (142 bytes)
20:06:02 POST git-upload-pack (gzip 1995 to 954 bytes)
20:06:02 error: index-pack died of signal 9
20:06:02 fatal: index-pack failed
20:06:02 '
https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-php74-noselenium/91935/console
19:04:06 * npm install in /workspace/src, in 46.203 s
19:04:06 DEBUG:quibble.cmd:No quibble.yaml in /workspace/src/extensions/CampaignEvents
19:04:06 INFO:backend.MySQL:Terminating MySQL
19:04:08 Traceback (most recent call last):
19:04:08   File "/usr/local/bin/quibble", line 8, in <module>
19:04:08     sys.exit(main())
19:04:08   File "/usr/local/lib/python3.7/dist-packages/quibble/cmd.py", line 995, in main
19:04:08     dry_run=args.dry_run,
19:04:08   File "/usr/local/lib/python3.7/dist-packages/quibble/cmd.py", line 643, in execute
19:04:08     raise called_process_error
19:04:08   File "/usr/local/lib/python3.7/dist-packages/quibble/cmd.py", line 629, in execute
19:04:08     quibble.commands.execute_command(command)
19:04:08   File "/usr/local/lib/python3.7/dist-packages/quibble/commands.py", line 48, in execute_command
19:04:08     command.execute()
19:04:08   File "/usr/local/lib/python3.7/dist-packages/quibble/commands.py", line 1313, in execute
19:04:08     raise error
19:04:08 subprocess.CalledProcessError: Command '['npm', 'ci']' died with <Signals.SIGKILL: 9>.
https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-php74-selenium/88019/console
19:19:42 Execution of 4 workers started at 2025-04-29T19:19:42.631Z
19:19:42 
19:19:44 [0-0] RUNNING in chrome - /tests/selenium/specs/editEventRegistration.js
19:20:57 [0-0] RETRYING in chrome - /tests/selenium/specs/editEventRegistration.js
19:20:59 [0-0] RUNNING in chrome - /tests/selenium/specs/editEventRegistration.js
19:32:27 [0-0] 2025-04-29T19:32:25.477Z ERROR @wdio/local-runner: Failed launching test session: Error: Protocol error (Runtime.evaluate): Session closed. Most likely the page has been closed.
19:32:27 [0-0]     at CDPSession.send (/workspace/src/extensions/CampaignEvents/node_modules/puppeteer-core/lib/cjs/puppeteer/common/Connection.js:218:35)
19:32:27 [0-0]     at ExecutionContext._evaluateInternal (/workspace/src/extensions/CampaignEvents/node_modules/puppeteer-core/lib/cjs/puppeteer/common/ExecutionContext.js:166:18)
19:32:27 [0-0]     at ExecutionContext.evaluate (/workspace/src/extensions/CampaignEvents/node_modules/puppeteer-core/lib/cjs/puppeteer/common/ExecutionContext.js:110:27)
19:32:27 [0-0]     at DevToolsDriver.checkPendingNavigations (/workspace/src/extensions/CampaignEvents/node_modules/devtools/build/devtoolsdriver.js:230:36)
19:32:27 [0-0]     at async DevToolsDriver.checkPendingNavigations (/workspace/src/extensions/CampaignEvents/node_modules/devtools/build/devtoolsdriver.js:251:9)
19:32:27 [0-0]     at async Browser.wrappedCommand (/workspace/src/extensions/CampaignEvents/node_modules/devtools/build/devtoolsdriver.js:123:13)
19:32:27 [0-0]     at async Browser.wrapCommandFn (/workspace/src/extensions/CampaignEvents/node_modules/@wdio/utils/build/shim.js:137:29)
19:32:27 [0-0]     at async Runner.endSession (/workspace/src/extensions/CampaignEvents/node_modules/@wdio/runner/build/index.js:348:9)
19:32:27 [0-0]     at async Runner.run (/workspace/src/extensions/CampaignEvents/node_modules/@wdio/runner/build/index.js:164:13)
19:32:27 [0-0] FAILED in chrome - /tests/selenium/specs/editEventRegistration.js (1 retries)
19:35:20 [0-1] RUNNING in chrome - /tests/selenium/specs/enableEventRegistration.js
19:36:27 [0-1] 2025-04-29T19:36:23.928Z ERROR @wdio/runner: Error: connect ECONNREFUSED 127.0.0.1:38059
19:36:27 [0-1]     at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1611:16)
19:36:29 [0-1] RETRYING in chrome - /tests/selenium/specs/enableEventRegistration.js
19:39:49 [0-1] RUNNING in chrome - /tests/selenium/specs/enableEventRegistration.js
19:40:55 [0-1] Error in "Enable Event Registration @daily."before all" hook for "is configured correctly""
19:40:55 Error: Timeout of 60000ms exceeded. The execution in the test "Enable Event Registration @daily "before all" hook for "is configured correctly"" took too long. Try to reduce the run time or increase your timeout for test specs (https://webdriver.io/docs/timeouts). (/workspace/src/extensions/CampaignEvents/tests/selenium/specs/enableEventRegistration.js)
19:40:55     at createTimeoutError (/workspace/src/extensions/CampaignEvents/node_modules/mocha/lib/errors.js:498:15)
19:40:55     at Runnable._timeoutError (/workspace/src/extensions/CampaignEvents/node_modules/mocha/lib/runnable.js:429:10)
19:40:55     at Timeout.<anonymous> (/workspace/src/extensions/CampaignEvents/node_modules/mocha/lib/runnable.js:244:24)
19:40:55     at listOnTimeout (node:internal/timers:581:17)
19:40:55     at process.processTimers (node:internal/timers:519:7)
19:42:54 [0-1] FAILED in chrome - /tests/selenium/specs/enableEventRegistration.js (1 retries)
19:42:58 [0-2] RUNNING in chrome - /tests/selenium/specs/eventPage.js
19:46:08 [0-2] Error in "Event page.can have one user register publicly"
19:46:18 Error: Timeout of 60000ms exceeded. The execution in the test "Event page can have one user register publicly" took too long. Try to reduce the run time or increase your timeout for test specs (https://webdriver.io/docs/timeouts). (/workspace/src/extensions/CampaignEvents/tests/selenium/specs/eventPage.js)
19:46:18     at createTimeoutError (/workspace/src/extensions/CampaignEvents/node_modules/mocha/lib/errors.js:498:15)
19:46:18     at Runnable._timeoutError (/workspace/src/extensions/CampaignEvents/node_modules/mocha/lib/runnable.js:429:10)
19:46:18     at Timeout.<anonymous> (/workspace/src/extensions/CampaignEvents/node_modules/mocha/lib/runnable.js:244:24)
19:46:18     at listOnTimeout (node:internal/timers:581:17)
19:46:18     at process.processTimers (node:internal/timers:519:7)
19:49:38 [0-2] Error in "Event page.can have one user register privately"
19:49:38 Error: Timeout of 60000ms exceeded. The execution in the test "Event page can have one user register privately" took too long. Try to reduce the run time or increase your timeout for test specs (https://webdriver.io/docs/timeouts). (/workspace/src/extensions/CampaignEvents/tests/selenium/specs/eventPage.js)
19:49:38     at createTimeoutError (/workspace/src/extensions/CampaignEvents/node_modules/mocha/lib/errors.js:498:15)
19:49:38     at Runnable._timeoutError (/workspace/src/extensions/CampaignEvents/node_modules/mocha/lib/runnable.js:429:10)
19:49:38     at Timeout.<anonymous> (/workspace/src/extensions/CampaignEvents/node_modules/mocha/lib/runnable.js:244:24)
19:49:38     at listOnTimeout (node:internal/timers:581:17)
19:49:38     at process.processTimers (node:internal/timers:519:7)
19:50:48 [0-2] Error in "Event page.can have a user cancel registration"
19:50:48 Error: Timeout of 60000ms exceeded. The execution in the test "Event page can have a user cancel registration" took too long. Try to reduce the run time or increase your timeout for test specs (https://webdriver.io/docs/timeouts). (/workspace/src/extensions/CampaignEvents/tests/selenium/specs/eventPage.js)
19:50:48     at createTimeoutError (/workspace/src/extensions/CampaignEvents/node_modules/mocha/lib/errors.js:498:15)
19:50:48     at Runnable._timeoutError (/workspace/src/extensions/CampaignEvents/node_modules/mocha/lib/runnable.js:429:10)
19:50:48     at Timeout.<anonymous> (/workspace/src/extensions/CampaignEvents/node_modules/mocha/lib/runnable.js:244:24)
19:50:48     at listOnTimeout (node:internal/timers:581:17)
19:50:48     at process.processTimers (node:internal/timers:519:7)
19:53:06 Build timed out (after 60 minutes). Marking the build as failed.
19:53:06 Build was aborted
https://integration.wikimedia.org/ci/job/mwext-php74-phan/92805/console
19:08:01 ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 1890 / 9860 ( 19%) 504MB
19:08:10 ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 1944 / 9860 ( 19%) 520MB
19:08:32 ░░░░░░░░░░░░░░░░░░░░░░░░░/run.sh: line 10:     8 Killed                  /run-generic.sh "$SOURCE_ROOT/$THING_SUBNAME" "$@"
19:08:33 Build step 'Execute shell' marked build as failure

Some of the failures are in the gate-and-submit pipeline and cause the queue to be reset over and over again, effectively preventing merges.

Details

Other Assignee
Daimona

Event Timeline

From #wikimedia-releng:

<thcipriani> yes...there are a massive number of ffmpeg processes running on integration-agent-docker-1048
<thcipriani> load average: 232.20, 234.51, 231.36

So, I killed both running jobs for https://gerrit.wikimedia.org/r/c/mediawiki/core/+/721790, which runs selenium tests 100 times to find flaky tests. Normally, that kind of patch would not be a problem for the CI infrastructure. However, this particular patch is dependent on r1002578, which bumps wdio to version 8 (T324766). Apparently, the new version of wdio breaks our screen recording code and leaves the ffmpeg processes hanging?

Mentioned in SAL (#wikimedia-releng) [2025-04-29T20:28:19Z] <mutante> integration-agent-docker-1048.integration - killall -9 ffpmeg - T392963

Mentioned in SAL (#wikimedia-releng) [2025-04-29T21:04:51Z] <mutante> integration-agent-docker-1051.integration - killall -9 ffmpeg - T392963

FTR, this is being discussed in IRC: https://wm-bot.wmcloud.org/browser/index.php?start=04%2F29%2F2025&end=04%2F29%2F2025&display=%23wikimedia-releng

Things to look into:

21:14 < Daimona> So I'm assuming the other agents I listed above are also flooded by ffmpeg?
21:14 < Daimona> And those would be: 1040, 1041, 1044, 1047, 1062, 1063, 1064
...
..
21:21 < Daimona> But even if an agent happens to be running 3 selenium jobs (3 being our current max concurrency), and each of those has parallel selenium enabled with 4 threads, there should never be more than 12 ffmpeg processes at any given time
...
21:22 < mutante> 1062 - was very busy. killed ffmpeg. now 5 processes
21:23 < mutante> 1048 - number of ffmpeg proces = 48. killing 
21:26 < mutante> 1063 - 55 ffmpeg procs - killall'ed
21:28 < mutante> 1064 - 52 ffmpeg procs - killall'ed

Change #1139943 had a related patch set uploaded (by Daimona Eaytoy; author: Daimona Eaytoy):

[mediawiki/core@master] selenium: set explicit timeout when spawning ffmpeg

https://gerrit.wikimedia.org/r/1139943

Based on what we know now: the shell loops triggering 100x selenium runs used in r721790 outlived the container for some reason. They accumulated and kept spawning ffmpeg processes, which are also executed outside the container. This already gives us a couple action items.

I didn't figure this out, but it's not so important now.

  • If it is correct: why does a single agent bring everything down?

I think it's not a single agent, but rather a bunch of agents that were affected. All new builds on those agents failed.

Still valid.

We saw ffmpeg processes come back after being killed.. then figure out there were shell loops (sh -c for i in ...) as shown below on instances. In most cases one but at least one of them had 2 running at the same time.

Then killed these loops and node processes on a couple more instances.

It's unclear if we can see just from the command lines if it's using "wdio 8" or not but it looked like this: (and on at least one instance this but 2 loops and double the node processes).

nobody   1964455  0.0  0.0   2484   100 ?        S    16:39   0:00 sh -c for i in $(seq 1 100); do wdio ./tests/selenium/wdio.conf.js; done
nobody   2197070  2.7  0.5 1288868 130420 ?      Sl   22:27   0:01 node /workspace/src/node_modules/.bin/wdio ./tests/selenium/wdio.conf.js
nobody   2197111  3.4  0.5 1293564 140108 ?      Sl   22:27   0:02 /usr/bin/node /workspace/src/node_modules/@wdio/local-runner/build/run.js ./tests/selenium/wdio.conf.js
nobody   2197114  3.0  0.5 1289972 130044 ?      Sl   22:27   0:01 /usr/bin/node /workspace/src/node_modules/@wdio/local-runner/build/run.js ./tests/selenium/wdio.conf.js

We saw ffmpeg processes come back after being killed.. then figure out there were shell loops (sh -c for i in ...) as shown below on instances. In most cases one but at least one of them had 2 running at the same time.

We run selenium tests for both php81 and php74 so that would explain why there would be two shell loops on one instance.

Then killed these loops and node processes on a couple more instances.

It's unclear if we can see just from the command lines if it's using "wdio 8" or not but it looked like this: (and on at least one instance this but 2 loops and double the node processes).

nobody   1964455  0.0  0.0   2484   100 ?        S    16:39   0:00 sh -c for i in $(seq 1 100); do wdio ./tests/selenium/wdio.conf.js; done
nobody   2197070  2.7  0.5 1288868 130420 ?      Sl   22:27   0:01 node /workspace/src/node_modules/.bin/wdio ./tests/selenium/wdio.conf.js
nobody   2197111  3.4  0.5 1293564 140108 ?      Sl   22:27   0:02 /usr/bin/node /workspace/src/node_modules/@wdio/local-runner/build/run.js ./tests/selenium/wdio.conf.js
nobody   2197114  3.0  0.5 1289972 130044 ?      Sl   22:27   0:01 /usr/bin/node /workspace/src/node_modules/@wdio/local-runner/build/run.js ./tests/selenium/wdio.conf.js

wdio is still 7 everywhere that it's merged: see codesearch

So bumping to wdio8 would have to be a patch.

The only way I know of to trace these back to a patch is by finding out the workspace mounted in the container and looking at the change at the top in the src directory of that workspace.

So, in this case, using ps faux (or something else) to find the docker container that initiated pid 1964455 (from the output above), then ls -lA /proc/<parent-pid>/cwd to find that workspace directory. Then git -C <workspace-directory>/src show HEAD should get you the patch.

There is probably an easier way to do that, but that's the way I know how to do it.

thcipriani assigned this task to Dzahn.
thcipriani updated Other Assignee, added: Daimona.

I added a comment on the task that spawned this issue that should point folks in the right direction for preventing it from happening in the near future T324766#10778643

Looks like aggregate CI load is back to normal.

Mischief managed. Thanks all!

I didn't figure this out, but it's not so important now.

There's not a good way to do this from the web UI, IIRC that's a deliberate 403 due to some kind of abuse thing that I've lost all memory of.

Here's a oneliner that should do it: curl -s "https://integration.wikimedia.org/ci/api/json?tree=jobs\[name,url,builds\[id,url,timestamp,duration,builtOn\]\]" | jq -r '.jobs[] | .builds[] | select(.builtOn != null) | "\(.builtOn)\t\(.url)"' | sort

Based on what we know now: the shell loops triggering 100x selenium runs used in r721790 outlived the container for some reason. They accumulated and kept spawning ffmpeg processes, which are also executed outside the container. This already gives us a couple action items.

There are multiple issues happening and the short answer is Docker is a client/server application.

When a signal is sent to the client (docker run), some old versions of Docker did not relay the signals to the daemon (I did report it with an extensive report to upstream and based on my analysis that eventually got solved at some point). If a KILL signal is sent to the client, the process is terminated immediately by the kernel and the signal is not relayed.

The container might have bash running which happily ignores the TERM signal and does not relay it back. Then any running code can choose to catch the TERM signal and ignore it, so that is not a reliable way to terminate the processes.

When Jenkins aborts a job, it sends a TERM signal to the process group (which in our case is solely docker run) and after some seconds (15 IIRC) a KILL signal to the whole process group. But that certainly leaves containers running, we would need Jenkins to be aware of the containers and terminate them using the client (docker stop / docker kill).

Anyway, all of that is moot if we switch to one off virtual machines that are entirely reclaimed at the end of each build. That is the model we had for some time (with Nodepool) and I hope to adopt again as we upgrade Zuul/CI :)

@Daimona have you filed the two action items you mention? Or maybe it is on another task I might have missed. Maybe docker run can be given limits to the number of process/memory/number of CPUs.

I didn't figure this out, but it's not so important now.

There's not a good way to do this from the web UI, IIRC that's a deliberate 403 due to some kind of abuse thing that I've lost all memory of.

Here's a oneliner that should do it: curl -s "https://integration.wikimedia.org/ci/api/json?tree=jobs\[name,url,builds\[id,url,timestamp,duration,builtOn\]\]" | jq -r '.jobs[] | .builds[] | select(.builtOn != null) | "\(.builtOn)\t\(.url)"' | sort

We deny requests to the history of builds. The reason is that the build of history of jobs is kept as XML files, stored per job, one per build . When Jenkins want to retrieve the history of a specific job, it crawls the directory for that job. If one asks for builds that happened on jobs, Jenkins must process the whole build history which might be hundred of thousands of XML files to read. So the Jenkins controller crawls to its death. More or less, I haven't tried in a while and maybe that is less of an issue if the builds are lazy loaded on start and new builds kept in some memory cache.

There's not a good way to do this from the web UI, IIRC that's a deliberate 403 due to some kind of abuse thing that I've lost all memory of.

Here's a oneliner that should do it: curl -s "https://integration.wikimedia.org/ci/api/json?tree=jobs\[name,url,builds\[id,url,timestamp,duration,builtOn\]\]" | jq -r '.jobs[] | .builds[] | select(.builtOn != null) | "\(.builtOn)\t\(.url)"' | sort

We deny requests to the history of builds. The reason is that the build of history of jobs is kept as XML files, stored per job, one per build . When Jenkins want to retrieve the history of a specific job, it crawls the directory for that job. If one asks for builds that happened on jobs, Jenkins must process the whole build history which might be hundred of thousands of XML files to read. So the Jenkins controller crawls to its death. More or less, I haven't tried in a while and maybe that is less of an issue if the builds are lazy loaded on start and new builds kept in some memory cache.

Thanks both! This makes sense, and I'm under the impression that this was recently discussed elsewhere, but sadly I can't find that conversation.

@Daimona have you filed the two action items you mention? Or maybe it is on another task I might have missed.

I didn't :) One would be for the MW selenium setup to make sure that ffmpeg is terminated. This is now tracked in T392974 that Peter filed. The other action item I had in mind would be to figure out why the shell loop (and thus the node processes, and thus ffmpeg) consistently outlived the container. I think you already gave a good explanation of that! So, my question would be:

we would need Jenkins to be aware of the containers and terminate them using the client (docker stop / docker kill).

Anyway, all of that is moot if we switch to one off virtual machines that are entirely reclaimed at the end of each build. That is the model we had for some time (with Nodepool) and I hope to adopt again as we upgrade Zuul/CI :)

Is there a task to track this (either making jenkins aware of containers or switching to one-off VMs)?

Maybe docker run can be given limits to the number of process/memory/number of CPUs.

That sounds sensible, in principle.

Do you feel like there's anything that can be done about these? Maybe some low-hanging fruit?

Is there a task to track this (either making jenkins aware of containers or switching to one-off VMs)?

When the container is terminated, all the contained processes should have terminated. But if the container is never stopped, I imagine they they stay behind. Then I am not sure what happened with ffmpeg :-/ The future is to upgrade Zuul and change the execution model (Continuous-Integration-Infrastructure (Zuul upgrade)), potentially using one of VM, starting builds with a clean environment and removing it entirely is definitely nicer.

Overall, I would not spend much time on the current Zuul/Jenkins system given the aim is to replace it :)