Page MenuHomePhabricator

PHPUnit tests randomly time out in CI
Open, Needs TriagePublic

Description

Twice today, I saw a PHPUnit build time out for no obvious reason:

During the former build, I SSHed into the worker node, and its CPU seemed to be quite busy, with the PHPUnit process above 100% CPU usage IIRC. This feels like a different issue than T389536, though maybe it should still be considered part of T388416?

Event Timeline

This build might be in the same situation, no new console output since 14:09 UTC.

Edit: nevermind, it finished while I was looking into it. Guess this build was just a case of PHPUnit parallel group timing imbalance then… (cf. T392892)

This build might have been in the same situation before it was aborted (41 minutes without output).

https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-php81/23837/console shows the same basic pattern. The "interesting" part of the job log is:

15:14:12 Time: 01:48.436, Memory: 527.01 MB
15:14:12 
15:14:12 OK, but incomplete, skipped, or risky tests!
15:14:12 Tests: 1546, Assertions: 1444, Skipped: 102.
15:14:12 
15:14:12 
15:14:12 You should really speed up these slow tests (>100ms)...
15:14:12  1. 863ms to run ParserIntegrationTest::testParse with data set "citeParserTests.txt: References: 7b. Multiple references tags some with errors should be correctly handled"
15:14:12  2. 675ms to run ParserIntegrationTest::testParse with data set "citeParserTests.txt: Template nested in extension tag in template"
15:14:12  3. 669ms to run ParserIntegrationTest::testParse with data set "luaParserTests.txt: Scribunto: newParserValue"
15:14:12  4. 641ms to run ParserIntegrationTest::testParse with data set "luaParserTests.txt: Scribunto: Test unstripNowiki behavior"
15:14:12  5. 611ms to run ParserIntegrationTest::testParse with data set "luaParserTests.txt: Scribunto: no such function"
15:14:12  6. 594ms to run ParserIntegrationTest::testParse with data set "citeParserTests.txt: Hoisting multiple references should keep references in order"
15:14:12  7. 566ms to run ParserIntegrationTest::testParse with data set "luaParserTests.txt: Scribunto: named numeric parameters"
15:14:12  8. 563ms to run ParserIntegrationTest::testParse with data set "luaParserTests.txt: Scribunto: expandTemplate"
15:14:12  9. 547ms to run ParserIntegrationTest::testParse with data set "luaParserTests.txt: Scribunto: redirect to hello world"
15:14:12  10. 546ms to run ParserIntegrationTest::testParse with data set "parserTests.txt: deprecated enclose option adds tracking category"
15:14:12 ...and there are 179 more above your threshold hidden from view
15:14:12 Worker exited with status 0
15:55:20 Build was aborted
15:55:20 Aborted by BryanDavis
15:55:20 [PostBuildScript] - [INFO] Executing post build scripts.
15:55:20 [quibble-vendor-mysql-php81] $ /bin/bash -xe /tmp/jenkins4883856748561274423.sh
15:55:20 + find log/ -name 'mw-debug-*.log' -exec gzip '{}' +
15:55:27 gzip: log/mw-debug-cli.split-group-6.log: file size changed while zipping

This shows the phpunit tests finishing at 15:14:12 UTC. Then we see nothing happen in the logs for ~41 minutes until I cancelled the job via the Jenkins UI. I was prompted to do this by @lcawte and @Reedy noticing that the gate-and-submit queue was being blocked by that apparently deadlocked job.

https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-php81/23840/console

15:22:32 Worker exited with status 0
16:10:22 Build timed out (after 60 minutes). Marking the build as failed.
16:10:22 Build was aborted

Looking at the previous passed WikibaseMediaInfo patch set (https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikibaseMediaInfo/+/1175870) it seems possible that some jobs take longer, but some passed before the 60 minute timeout, some afterwards.

https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-php81/23812/console : SUCCESS in 37m 45s
https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-php82/9853/console : SUCCESS in 32m 12s
https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-php83/9312/console : SUCCESS in 53m 44s

Something in CI make this jobs generating wait time or deadlocks or such on the last split group.
Without splitting there would be processing markers for the one running thread. With the splitting there are no process markers and it looks like the job is hanging, but often the tests taking 43 minutes, just 40 minutes longer as all the other splits.

To be extrem:
https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/1175524

https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-php81/23759/console : SUCCESS in 1h 00m 45s
https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-php82/9834/console : SUCCESS in 31m 20s
https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-php83/9293/console : SUCCESS in 25
12:24:47 Started by user unknown or anonymous
...
13:24:44 Time: 51:24.613, Memory: 6.83 GB
13:24:44 
13:24:44 OK, but incomplete, skipped, or risky tests!
13:24:44 Tests: 10573, Assertions: 48275, Skipped: 42.
...
13:24:44 INFO:quibble.commands:<<< Finish: PHPUnit extensions suite (with database) parallel run (Composer), in 3122.459 s

Passed very smart within the timeout limit ...

This shows the phpunit tests finishing at 15:14:12 UTC. Then we see nothing happen in the logs for ~41 minutes until I cancelled the job via the Jenkins UI. I was prompted to do this by @lcawte and @Reedy noticing that the gate-and-submit queue was being blocked by that apparently deadlocked job.

It shows one of the PHPUnit split groups finishing. If you scroll up, you can see that a total of eight split groups were started, and this was the seventh to finish, so the last one was likely still running. (Whether it was making any forward progress is another question, I just wanted to clear up this potential misunderstanding.)

https://integration.wikimedia.org/ci/job/wmf-quibble-selenium-php81/31892/

I don't know what changed today that is causing this to happen so often, but it is really out of control at this point.

For wmf-quibble-selenium-php81 specifically, I tried to figure out how common it was for builds to be aborted and when this started:

$ find /srv/jenkins/builds/wmf-quibble-selenium-php81/ -type f -name log -exec bash -c 'grep -q "Build was aborted" "{}" && printf "%s: %s\n" "$(dirname {})" "$(stat --format %y {})"' \; | awk '{print $2}' | sort | uniq -c | sort -rn
     94 2025-07-30
     69 2025-08-05
     66 2025-07-31
     43 2025-08-01
     22 2025-08-04
     13 2025-08-03
     12 2025-08-02

Build timeouts are rare:

thcipriani@contint1002:/srv/jenkins/builds/wmf-quibble-selenium-php81$ find . -type f -name log -exec bash -c 'grep -q "Build timed out" "{}" && printf "%s: %s\n" "{}" "$(stat --format %y {})"' \; | tee ~/wmf-quibble-selenium-php81-timeout-2025-08-05.log 
./30956/log: 2025-07-31 08:55:04.179512963 +0000
./31503/log: 2025-08-03 20:02:13.194122640 +0000
./31222/log: 2025-08-01 04:22:24.887853631 +0000
./31892/log: 2025-08-05 19:49:05.814634642 +0000
./31427/log: 2025-08-02 09:56:01.070387109 +0000

...and that's it.

CI was backed up today, but I'm still trying to determine if this is uniq or if this is T388416.

Here are the jobs that have timed out over the past 7 days:

$ awk '{print $1}' build-timeouts-2025-08-05.log | sed 's#[0-9]\+\/log:##' | tr -d './' | sort | uniq -c | sort -rn
8 quibble-vendor-mysql-php81
5 wmf-quibble-selenium-php81
4 operations-puppet-catalog-compiler-puppet7-test
2 quibble-vendor-mysql-php81-selenium
1 quibble-composer-mysql-php81
1 mwgate-node20

Here are the timeouts per day:

$ awk '{print $2}' build-timeouts-2025-08-05.log | sort | uniq -c | sort -rn
6 2025-08-05
5 2025-07-31
4 2025-07-30
2 2025-08-03
2 2025-08-01
1 2025-08-04
1 2025-08-02

Here are the timeouts today:

 $ grep 2025-08-05 build-timeouts-2025-08-05.log 
./wmf-quibble-selenium-php81/31892/log: 2025-08-05 19:49:05.814634642 +0000
./quibble-vendor-mysql-php81/23768/log: 2025-08-05 12:20:11.530146011 +0000
./quibble-vendor-mysql-php81/23756/log: 2025-08-05 11:05:47.635468917 +0000
./quibble-vendor-mysql-php81/23762/log: 2025-08-05 11:49:12.141367998 +0000
./quibble-vendor-mysql-php81/23840/log: 2025-08-05 16:10:55.197268038 +0000
./operations-puppet-catalog-compiler-puppet7-test/4647/log: 2025-08-05 16:18:24.660357252 +0000

Thoughts:

  • 21 jobs over 7 days
  • 61% of these timeouts were either quibble-vendor-mysql-php81 or wmf-quibble-selenium-php81

Spot checking a couple jobs from today:

  • quibble-vendor-mysql-php81/23756 and quibble-vendor-mysql-php81/23762
    • integration-agent-1050 at the same time
    • Load was pretty gnarly around that time.

Load explains all builds today, except wmf-quibble-selenium-php81 Load was fine when that one ran.

We're hitting a heck of a lot of rebuilds due to failing gate pipeline things today:

$ for file in /var/log/zuul/debug.*; do printf '%s: %s\n' "$file" "$(grep failing "$file" | grep gate | grep ahead | wc -l)"; done
/var/log/zuul/debug.log: 187        <-- TODAY                                                                                                                                          
/var/log/zuul/debug.log.2025-07-25: 11
/var/log/zuul/debug.log.2025-07-26: 43
/var/log/zuul/debug.log.2025-07-27: 26
/var/log/zuul/debug.log.2025-07-28: 64
/var/log/zuul/debug.log.2025-07-29: 68
/var/log/zuul/debug.log.2025-07-30: 47
/var/log/zuul/debug.log.2025-07-31: 90
/var/log/zuul/debug.log.2025-08-01: 35
/var/log/zuul/debug.log.2025-08-02: 8
/var/log/zuul/debug.log.2025-08-03: 1
/var/log/zuul/debug.log.2025-08-04: 8

Theory: Failing tests cause many rebuilds. Many rebuilds back up the pipeline. Many things in the pipeline means many builds in parallel. And many builds in parallel mean lots of opportunity to have 3 builds on the same node. And if those nodes are unlucky, they'll run many intensive jobs on the same node at the same time, causing build timeouts. Except for selenium timeouts. Today, that one wasn't load.

Theory: Failing tests cause many rebuilds. Many rebuilds back up the pipeline. Many things in the pipeline means many builds in parallel. And many builds in parallel mean lots of opportunity to have 3 builds on the same node. And if those nodes are unlucky, they'll run many intensive jobs on the same node at the same time, causing build timeouts. Except for selenium timeouts. Today, that one wasn't load.

That sounds plausible to me at the moment – it does feel like these builds are “just” experiencing excessive load rather than some sort of deadlock.

I didn’t encounter this issue again since last week, but this build today might have been the same issue – though the integration-agent-docker-1049 load during that time doesn’t look too gnarly to me. But it feels like we have a growing gate-and-submit queue again.

image.png (804×384 px, 72 KB)

image.png (473×828 px, 62 KB)

Other affected builds:

It could be the load on CI, but it is also the result of the split, which seems to produce very unbalanced test suites, that run in parallel.

Examples:

Run on wikibase

https://integration.wikimedia.org/ci/job/wmf-quibble-vendor-mysql-php81/23116
split_group_0:  577 files  Time: 02:57.765, Memory: 546.00 MB
split_group_1:  448 files  Time: 02:54.005, Memory: 450.00 MB
split_group_2:  576 files  Time: 02:48.893, Memory: 580.82 MB
split_group_3:  672 files  Time: 02:26.448, Memory: 721.00 MB
split_group_4:  304 files  Time: 03:39.867, Memory: 937.00 MB
split_group_5:  432 files  Time: 02:12.206, Memory: 354.00 MB
split_group_6:    9 files  Time: 09:07.209, Memory: 634.00 MB
split_group_7:    1 file   Time: 03:04.292, Memory: 539.01 MB

Run on core

https://integration.wikimedia.org/ci/job/wmf-quibble-core-vendor-mysql-php81/10798
split_group_0:  561 files  Time: 02:49.066, Memory: 522.00 MB
split_group_1:  464 files  Time: 03:04.128, Memory: 482.00 MB
split_group_2:  528 files  Time: 02:46.426, Memory: 576.82 MB
split_group_3:  720 files  Time: 02:31.576, Memory: 723.00 MB
split_group_4:  256 files  Time: 03:32.342, Memory: 923.00 MB
split_group_5:  480 files  Time: 02:19.663, Memory: 370.00 MB
split_group_6:    9 files  Time: 08:10.547, Memory: 632.00 MB
split_group_7:    1 file   Time: 03:09.745, Memory: 539.01 MB

Bad run from EntitySchema

https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-php82/10510/console
split_group_0:  188 files  Time: 01:31.287, Memory: 318.00 MB
split_group_1:  357 files  Time: 01:26.260, Memory: 372.00 MB
split_group_2:  238 files  Time: 01:57.780, Memory: 395.00 MB
split_group_3:  136 files  Time: 01:30.326, Memory: 311.00 MB
split_group_4:  170 files  Time: 01:30.019, Memory: 267.00 MB
split_group_5:  221 files  Time: 01:23.908, Memory: 413.54 MB
split_group_6: 1918 files  Time: 45:46.490, Memory: 3.15 GB
split_group_7:    1 file   Time: 02:23.044, Memory: 467.00 MB

split_group_6 has often the structure tests, but running additional Wikibase tests results in this long run times.
Structure tests often makes IO as reading files to check (php files to generate autoload; css, less and js files to test minify), maybe there are wait times for slow disk drives or such.

At least the splits should avoid this very long run times, while the run time for all would grow when CI is under load. But that does not happen.

https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-php83/14809/console finished in 59m 07s; it had 21 changes chained after it in gate-and-submit, so it’s a good thing it didn’t take one minute longer or a lot of them would’ve needed a rebuild 😬