Page MenuHomePhabricator

Selenium timeouts can cause the job to remain stuck until the build times out
Open, HighPublic

Description

Sometimes, when a selenium test hits the 60s timeout, the entire job kind of freezes and hangs until it reaches the job-level timeout of 1h and gets killed. This has happened several times over the last couple of weeks. Today, it caused the entire gate-and-submit pipeline, which at the time had 53 patches, to remain stuck, consuming resources and waiting for nothing, for roughly 40 minutes. Then a few hours later it happened again and blocked the pipeline for 30 minutes. While general CI slowness is a problem, we need to figure out why selenium timeouts are not terminating the build immediately.

Previous examples from T388416:

23:23:36 [0-0] RUNNING in chrome - /tests/selenium/specs/editEventRegistration.js
23:25:04 [0-0] Error in "Edit Event Registration.can allow organizer to add an additional organizer"
23:25:04 Error: Timeout of 60000ms exceeded. The execution in the test "Edit Event Registration can allow organizer to add an additional organizer" took too long. Try to reduce the run time or increase your timeout for test specs (https://webdriver.io/docs/timeouts). (/workspace/src/extensions/CampaignEvents/tests/selenium/specs/editEventRegistration.js)
23:25:04     at createTimeoutError (/workspace/src/extensions/CampaignEvents/node_modules/mocha/lib/errors.js:498:15)
23:25:04     at Runnable._timeoutError (/workspace/src/extensions/CampaignEvents/node_modules/mocha/lib/runnable.js:429:10)
23:25:04     at Timeout.<anonymous> (/workspace/src/extensions/CampaignEvents/node_modules/mocha/lib/runnable.js:244:24)
23:25:04     at listOnTimeout (node:internal/timers:581:17)
23:25:04     at process.processTimers (node:internal/timers:519:7)
00:22:13 Build timed out (after 60 minutes). Marking the build as failed.
14:57:38 Execution of 3 workers started at 2025-03-18T14:57:38.189Z
14:57:38 
14:57:39 [0-2] RUNNING in chrome - /tests/selenium/specs/watchstar.js
14:57:39 [0-0] RUNNING in chrome - /tests/selenium/specs/mainmenu_loggedin.js
14:57:39 [0-1] RUNNING in chrome - /tests/selenium/specs/references.js
14:57:52 [0-2] PASSED in chrome - /tests/selenium/specs/watchstar.js
14:57:54 [0-0] Nearby item will only appear in main menu if $wgMFNearby is configured
14:57:54 [0-0] PASSED in chrome - /tests/selenium/specs/mainmenu_loggedin.js
15:30:11 Build timed out (after 60 minutes). Marking the build as failed.
15:30:11 Build was aborted

-2025-03-20 13:40:37Z: r1129578 (MinervaNeue), https://integration.wikimedia.org/ci/job/wmf-quibble-selenium-php81/4439/console:

13:38:46 Execution of 2 workers started at 2025-03-20T13:38:46.494Z
13:38:46 
13:38:46 Setting up modified /workspace/src/LocalSettings.php
13:38:46 Restarting php8.1-fpm
13:38:48 [0-1] RUNNING in chrome - /tests/selenium/specs/ipcontributions.js
13:38:49 [0-0] RUNNING in chrome - /tests/selenium/specs/contributions.js
13:39:23 [0-1] PASSED in chrome - /tests/selenium/specs/ipcontributions.js
13:40:37 [0-0] Error in "IPInfo on Special:Contributions.should show geo data for temp user with edits if agreement was already accepted"
13:40:37 Error: Timeout of 60000ms exceeded. The execution in the test "IPInfo on Special:Contributions should show geo data for temp user with edits if agreement was already accepted" took too long. Try to reduce the run time or increase your timeout for test specs (https://webdriver.io/docs/timeouts). (/workspace/src/extensions/IPInfo/tests/selenium/specs/contributions.js)
13:40:37     at createTimeoutError (/workspace/src/extensions/IPInfo/node_modules/mocha/lib/errors.js:498:15)
13:40:37     at Runnable._timeoutError (/workspace/src/extensions/IPInfo/node_modules/mocha/lib/runnable.js:429:10)
13:40:37     at Timeout.<anonymous> (/workspace/src/extensions/IPInfo/node_modules/mocha/lib/runnable.js:244:24)
13:40:37     at listOnTimeout (node:internal/timers:581:17)
13:40:37     at process.processTimers (node:internal/timers:519:7)

-2025-03-20 15:49:34Z: r1127912 (AbuseFilter), https://integration.wikimedia.org/ci/job/wmf-quibble-selenium-php81/4492/console

15:49:34 [0-0] Error in "CheckUser.With CheckUser user group.Verify checkuser can make checks:.Should be able to run 'Get IPs' check"
15:49:34 Error: Timeout of 60000ms exceeded. The execution in the test "Verify checkuser can make checks: Should be able to run 'Get IPs' check" took too long. Try to reduce the run time or increase your timeout for test specs (https://webdriver.io/docs/timeouts). (/workspace/src/extensions/CheckUser/tests/selenium/specs/checkuser.js)
15:49:34     at createTimeoutError (/workspace/src/extensions/CheckUser/node_modules/mocha/lib/errors.js:498:15)
15:49:34     at Runnable._timeoutError (/workspace/src/extensions/CheckUser/node_modules/mocha/lib/runnable.js:429:10)
15:49:34     at Timeout.<anonymous> (/workspace/src/extensions/CheckUser/node_modules/mocha/lib/runnable.js:244:24)
15:49:34     at listOnTimeout (node:internal/timers:581:17)
15:49:34     at process.processTimers (node:internal/timers:519:7)
15:50:34 [0-0] Error in "CheckUser.With CheckUser user group.Verify checkuser can make checks:.Should be able to run 'Get actions' check"
15:50:34 Error: Timeout of 60000ms exceeded. The execution in the test "Verify checkuser can make checks: Should be able to run 'Get actions' check" took too long. Try to reduce the run time or increase your timeout for test specs (https://webdriver.io/docs/timeouts). (/workspace/src/extensions/CheckUser/tests/selenium/specs/checkuser.js)
15:50:34     at createTimeoutError (/workspace/src/extensions/CheckUser/node_modules/mocha/lib/errors.js:498:15)
15:50:34     at Runnable._timeoutError (/workspace/src/extensions/CheckUser/node_modules/mocha/lib/runnable.js:429:10)
15:50:34     at Timeout.<anonymous> (/workspace/src/extensions/CheckUser/node_modules/mocha/lib/runnable.js:244:24)
15:50:34     at listOnTimeout (node:internal/timers:581:17)
15:50:34     at process.processTimers (node:internal/timers:519:7)
15:51:34 [0-0] Error in "CheckUser.With CheckUser user group.Verify checkuser can make checks:.Should be able to run 'Get users' check"
15:51:34 Error: Timeout of 60000ms exceeded. The execution in the test "Verify checkuser can make checks: Should be able to run 'Get users' check" took too long. Try to reduce the run time or increase your timeout for test specs (https://webdriver.io/docs/timeouts). (/workspace/src/extensions/CheckUser/tests/selenium/specs/checkuser.js)
15:51:34     at createTimeoutError (/workspace/src/extensions/CheckUser/node_modules/mocha/lib/errors.js:498:15)
15:51:34     at Runnable._timeoutError (/workspace/src/extensions/CheckUser/node_modules/mocha/lib/runnable.js:429:10)
15:51:34     at Timeout.<anonymous> (/workspace/src/extensions/CheckUser/node_modules/mocha/lib/runnable.js:244:24)
15:51:34     at listOnTimeout (node:internal/timers:581:17)
15:51:34     at process.processTimers (node:internal/timers:519:7)
16:40:33 Build timed out (after 60 minutes). Marking the build as failed.
16:40:33 Build was aborted

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

This issue seen some minutes ago looks new, maybe another issue / task, but with the same impact as the exception does not fail the build.

https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ContentTranslation/+/1130692
https://integration.wikimedia.org/ci/job/wmf-quibble-selenium-php81/5421/console

20:00:22 Setting up modified /workspace/src/LocalSettings.php
20:00:22 Restarting php8.1-fpm
20:00:23 2025-03-24 19:00:23,715 INFO exited: php-fpm (exit status 0; expected)
20:00:27 [0-0] RUNNING in chrome - /tests/selenium/specs/checkuser.js
20:00:27 [0-3] RUNNING in chrome - /tests/selenium/specs/investigate.js
20:00:27 [0-1] RUNNING in chrome - /tests/selenium/specs/checkuserlog.js
20:00:27 [0-2] RUNNING in chrome - /tests/selenium/specs/clienthints.js
20:00:27 [0-4] RUNNING in chrome - /tests/selenium/specs/tempAccountsOnboarding.js
20:00:40 [0-0] Error in "CheckUser.With CheckUser user group."before all" hook in "With CheckUser user group""
20:00:40 Navigation failed because browser has disconnected!
20:00:40 Error: Navigation failed because browser has disconnected!
20:00:40     at /workspace/src/extensions/CheckUser/node_modules/puppeteer-core/lib/cjs/puppeteer/common/LifecycleWatcher.js:51:147
20:00:40     at /workspace/src/extensions/CheckUser/node_modules/puppeteer-core/lib/cjs/vendor/mitt/src/index.js:51:62
20:00:40     at Array.map (<anonymous>)
20:00:40     at Object.emit (/workspace/src/extensions/CheckUser/node_modules/puppeteer-core/lib/cjs/vendor/mitt/src/index.js:51:43)
20:00:40     at CDPSession.emit (/workspace/src/extensions/CheckUser/node_modules/puppeteer-core/lib/cjs/puppeteer/common/EventEmitter.js:72:22)
20:00:40     at CDPSession._onClosed (/workspace/src/extensions/CheckUser/node_modules/puppeteer-core/lib/cjs/puppeteer/common/Connection.js:271:14)
20:00:40     at Connection._onClose (/workspace/src/extensions/CheckUser/node_modules/puppeteer-core/lib/cjs/puppeteer/common/Connection.js:144:21)
20:00:40     at WebSocket.<anonymous> (/workspace/src/extensions/CheckUser/node_modules/puppeteer-core/lib/cjs/puppeteer/node/NodeWebSocketTransport.js:17:30)
20:00:40     at WebSocket.onClose (/workspace/src/extensions/CheckUser/node_modules/puppeteer-core/node_modules/ws/lib/event-target.js:210:18)
20:00:40     at WebSocket.emit (node:events:518:28)
20:00:42 [0-3] PASSED in chrome - /tests/selenium/specs/investigate.js
20:00:44 [0-1] PASSED in chrome - /tests/selenium/specs/checkuserlog.js
20:00:44 [0-4] PASSED in chrome - /tests/selenium/specs/tempAccountsOnboarding.js
20:01:01 [0-2] PASSED in chrome - /tests/selenium/specs/clienthints.js
20:53:38 Build timed out (after 60 minutes). Marking the build as failed.
20:53:38 Build was aborted

It just happened again to the first patch in gate-and-submit: https://integration.wikimedia.org/ci/job/wmf-quibble-selenium-php81/5595/console

This is getting quite annoying.

Noting for posterity that it later happened AGAIN in https://integration.wikimedia.org/ci/job/wmf-quibble-selenium-php81/5611/console:

15:26:00 [0-1] PASSED in chrome - /tests/selenium/specs/importingFilters.js
15:27:12 [0-0] Error in "Filter editing.Trying to save a filter with bad data.cannot save an empty filter"
15:27:12 Error: Timeout of 60000ms exceeded. The execution in the test "Trying to save a filter with bad data cannot save an empty filter" took too long. Try to reduce the run time or increase your timeout for test specs (https://webdriver.io/docs/timeouts). (/workspace/src/extensions/AbuseFilter/tests/selenium/specs/editingFilters.js)
15:27:12     at createTimeoutError (/workspace/src/extensions/AbuseFilter/node_modules/mocha/lib/errors.js:498:15)
15:27:12     at Runnable._timeoutError (/workspace/src/extensions/AbuseFilter/node_modules/mocha/lib/runnable.js:429:10)
15:27:12     at Timeout.<anonymous> (/workspace/src/extensions/AbuseFilter/node_modules/mocha/lib/runnable.js:244:24)
15:27:12     at listOnTimeout (node:internal/timers:581:17)
15:27:12     at process.processTimers (node:internal/timers:519:7)
15:28:12 [0-0] Error in "Filter editing.Trying to save a filter with bad data.cannot save a filter with rules but no name"
15:28:12 Error: Timeout of 60000ms exceeded. The execution in the test "Trying to save a filter with bad data cannot save a filter with rules but no name" took too long. Try to reduce the run time or increase your timeout for test specs (https://webdriver.io/docs/timeouts). (/workspace/src/extensions/AbuseFilter/tests/selenium/specs/editingFilters.js)
15:28:12     at createTimeoutError (/workspace/src/extensions/AbuseFilter/node_modules/mocha/lib/errors.js:498:15)
15:28:12     at Runnable._timeoutError (/workspace/src/extensions/AbuseFilter/node_modules/mocha/lib/runnable.js:429:10)
15:28:12     at Timeout.<anonymous> (/workspace/src/extensions/AbuseFilter/node_modules/mocha/lib/runnable.js:244:24)
15:28:12     at listOnTimeout (node:internal/timers:581:17)
15:28:12     at process.processTimers (node:internal/timers:519:7)
16:23:24 Build timed out (after 60 minutes). Marking the build as failed.
16:23:24 Build was aborted

This one I caught in time, and @dancy helped with some data from the agent: in P74415 is the full list of processes. Then (from #wikimedia-releng):

2025-03-25 15:46:24	<Daimona>	What is 6721 doing?
2025-03-25 15:49:12	<Daimona>	Hmmmm let me test one thing locally
2025-03-25 15:49:28	<dancy>	Looks like it's blocked waiting for activity on fd 10 or 14, both of which are eventpoll type.
2025-03-25 15:50:17	<dancy>	The end of /tmp/lighthouse.HHgKUrw/chrome-err.log has:
2025-03-25 15:50:17	<dancy>	[7067:7123:0325/152545.497018:ERROR:command_buffer_proxy_impl.cc(127)] ContextResult::kTransientFailure: Failed to send GpuControl.CreateCommandBuffer.
2025-03-25 15:50:17	<dancy>	[0325/152613.263289:ERROR:file_io_posix.cc(145)] open /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq: No such file or directory (2)
2025-03-25 15:50:17	<dancy>	[0325/152613.263393:ERROR:file_io_posix.cc(145)] open /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq: No such file or directory (2)
2025-03-25 15:50:17	<dancy>	[0325/152613.266640:ERROR:crash_handler.cc(101)] Detected GWP-ASan crash for allocation at 0x39f800c94000 (malloc) of type heap-use-after-free
2025-03-25 15:52:47	<Daimona>	Those do seem to match the time of the test failure
2025-03-25 15:53:18	<Daimona>	Actually, exactly 60 seconds before we fail due to a 60s timeout

The UAF might be worth looking into, and it might explain why the test failed in the first place. However, I'm not sure if it explains why the node process then hangs. I've been trying all day to get it to hang with r1129873 applied, but it keeps passing. I also tried testing some simple things locally but I'm not sure what to blame. Guess I can try reproducing locally in quibble, as a next step...

Finally reproduced it on purpose for r1129873 in https://integration.wikimedia.org/ci/job/wmf-quibble-selenium-php81/5707/console (kept forever):

20:04:23 [0-6] 2025-03-25T20:04:23.304Z INFO devtools: COMMAND findElement("css selector", "#wpLoginAttempt")
20:04:23 [0-6] 2025-03-25T20:04:23.306Z INFO devtools: RESULT { 'element-6066-11e4-a52e-4f735466cecf': 'ELEMENT-31' }
20:04:23 [0-6] 2025-03-25T20:04:23.313Z INFO devtools: COMMAND elementClick("ELEMENT-31")
20:04:23 [0-6] 2025-03-25T20:04:23.346Z INFO devtools: RESULT null
20:05:21 [0-6] Error in "Page.should be deletable"
20:05:21 Error: Timeout of 60000ms exceeded. The execution in the test "Page should be deletable" took too long. Try to reduce the run time or increase your timeout for test specs (https://webdriver.io/docs/timeouts). (/workspace/src/tests/selenium/specs/page.js)
20:05:21     at createTimeoutError (/workspace/src/node_modules/mocha/lib/errors.js:498:15)
20:05:21     at Runnable._timeoutError (/workspace/src/node_modules/mocha/lib/runnable.js:429:10)
20:05:21     at Timeout.<anonymous> (/workspace/src/node_modules/mocha/lib/runnable.js:244:24)
20:05:21     at listOnTimeout (node:internal/timers:581:17)
20:05:21     at process.processTimers (node:internal/timers:519:7)
20:09:23 [0-6] Start afterTest
20:10:38 Build timed out (after 60 minutes). Marking the build as failed.
20:10:38 Build was aborted

Once again we ot a 60s timeout, but nothing saying why it happened. The interesting part here is that the timeout occurred at 20:05:21, but at 20:09:23 (5 minutes later), it started executing the afterTest hook. The first thing it should do in the hook is save a screenshot, which would normally appear in the logs as follows (taken from a successful test):

20:04:20 [0-6] Start afterTest
20:04:20 [0-6] 2025-03-25T20:04:20.868Z INFO devtools: COMMAND takeScreenshot()

However, the build timed out an additional 75 seconds after the start of the hook, before any screenshot was taken. It feels as if it were actually doing something (as opposed to just being stuck), but VERY slowly. Otherwise I don't know why it would still run the afterTest hook.

Also worth noting that the logs do not indicate any uncaught exception or rejection for this run, so those cannot possibly be the cause of the timeouts. Also the fact that this only seems to happen in the 8.1 job, never in the 7.4. I don't think PHP or MW have anything to do with the timeouts, but the OS might: some other package in the 81 job could be buggy. Even just chrome, since the 74 job has chrome 90, and the 81 job chrome 120.

I'm not sure where to look next. If someone has any ideas on what might cause this or what other debug information would be useful, that'd be great. Likewise, it would be great if the entire gate-and-submit queue stopped being forced to restart multiple times a day.

Also the fact that this only seems to happen in the 8.1 job, never in the 7.4. I don't think PHP or MW have anything to do with the timeouts, but the OS might: some other package in the 81 job could be buggy. Even just chrome, since the 74 job has chrome 90, and the 81 job chrome 120.

The chromium 120.0.6099.224-1~deb11u1 in the docker-registry.wikimedia.org/releng/quibble-bullseye-php81:latest container seems to be the latest version for bullseye (https://packages.debian.org/bullseye/chromium), but 120 is ooooold for sure and could have easily had some bugs that effect selenium automation. Looking upstream for chromium and chromium-driver issues might be a good idea.

T362705: Migrate Quibble images from bullseye to bookworm is a thing too, but James seems to be holding with the buster->bullseye migration still because of PHP 7.4 not yet being completely abandoned.

Digging in the Zuul debug log, I did find php74 builds failing:

$ grep -o '/ci/job/.*quibble.*-selenium.*FAILURE in .*s$' debug.log*|grep php74|cut -d: -f1,3|grep -Pv 'in (\d\d?s|\dm|[12]\dm)'
debug.log.2025-02-28: FAILURE in 1h 03m 15s
debug.log.2025-02-28: FAILURE in 1h 02m 52s
debug.log.2025-03-14: FAILURE in 1h 00m 38s
debug.log.2025-03-19: FAILURE in 1h 01m 40s
debug.log.2025-03-20: FAILURE in 1h 01m 11s
debug.log.2025-03-20: FAILURE in 1h 00m 53s
debug.log.2025-03-20: FAILURE in 1h 00m 54s
debug.log.2025-03-20: FAILURE in 1h 00m 50s
debug.log.2025-03-20: FAILURE in 1h 00m 53s
debug.log.2025-03-24: FAILURE in 1h 00m 59s
debug.log.2025-03-24: FAILURE in 1h 01m 01s
debug.log.2025-03-24: FAILURE in 33m 36s
debug.log.2025-03-25: FAILURE in 1h 00m 50s
debug.log.2025-03-25: FAILURE in 40m 06s

And here is the list of all failures (earliest debug log is 2025-02-24):

1grep -o '^-.*/ci/job/.*quibble.*-selenium.*FAILURE in 1h.*' debug.log*|cut -d: -f1,4
2debug.log.2025-02-27: FAILURE in 1h 00m 26s
3debug.log.2025-02-28: FAILURE in 1h 03m 15s
4debug.log.2025-02-28: FAILURE in 1h 02m 52s
5debug.log.2025-03-03: FAILURE in 1h 00m 23s
6debug.log.2025-03-04: FAILURE in 1h 00m 33s
7debug.log.2025-03-04: FAILURE in 1h 01m 16s
8debug.log.2025-03-04: FAILURE in 1h 01m 05s
9debug.log.2025-03-05: FAILURE in 1h 00m 51s
10debug.log.2025-03-05: FAILURE in 1h 02m 22s
11debug.log.2025-03-06: FAILURE in 1h 00m 11s
12debug.log.2025-03-06: FAILURE in 1h 00m 26s
13debug.log.2025-03-06: FAILURE in 1h 01m 12s
14debug.log.2025-03-06: FAILURE in 1h 01m 53s
15debug.log.2025-03-06: FAILURE in 1h 01m 09s
16debug.log.2025-03-06: FAILURE in 1h 01m 53s
17debug.log.2025-03-07: FAILURE in 1h 00m 34s
18debug.log.2025-03-07: FAILURE in 1h 00m 39s
19debug.log.2025-03-07: FAILURE in 1h 00m 57s
20debug.log.2025-03-09: FAILURE in 1h 00m 12s
21debug.log.2025-03-10: FAILURE in 1h 00m 47s
22debug.log.2025-03-10: FAILURE in 1h 00m 45s
23debug.log.2025-03-10: FAILURE in 1h 01m 57s
24debug.log.2025-03-10: FAILURE in 1h 00m 58s
25debug.log.2025-03-11: FAILURE in 1h 01m 58s
26debug.log.2025-03-11: FAILURE in 1h 00m 39s
27debug.log.2025-03-11: FAILURE in 1h 00m 30s
28debug.log.2025-03-11: FAILURE in 1h 02m 03s
29debug.log.2025-03-11: FAILURE in 1h 01m 05s
30debug.log.2025-03-12: FAILURE in 1h 01m 31s
31debug.log.2025-03-13: FAILURE in 1h 00m 39s
32debug.log.2025-03-13: FAILURE in 1h 01m 49s
33debug.log.2025-03-13: FAILURE in 1h 00m 18s
34debug.log.2025-03-14: FAILURE in 1h 02m 33s
35debug.log.2025-03-14: FAILURE in 1h 01m 22s
36debug.log.2025-03-14: FAILURE in 1h 01m 02s
37debug.log.2025-03-14: FAILURE in 1h 01m 38s
38debug.log.2025-03-14: FAILURE in 1h 00m 54s
39debug.log.2025-03-14: FAILURE in 1h 00m 37s
40debug.log.2025-03-14: FAILURE in 1h 00m 38s
41debug.log.2025-03-15: FAILURE in 1h 00m 35s
42debug.log.2025-03-15: FAILURE in 1h 00m 27s
43debug.log.2025-03-17: FAILURE in 1h 00m 25s
44debug.log.2025-03-17: FAILURE in 1h 00m 53s
45debug.log.2025-03-18: FAILURE in 1h 00m 27s
46debug.log.2025-03-18: FAILURE in 1h 00m 29s
47debug.log.2025-03-18: FAILURE in 1h 02m 31s
48debug.log.2025-03-18: FAILURE in 1h 00m 57s
49debug.log.2025-03-18: FAILURE in 1h 01m 07s
50debug.log.2025-03-18: FAILURE in 1h 02m 56s
51debug.log.2025-03-18: FAILURE in 1h 00m 41s
52debug.log.2025-03-18: FAILURE in 1h 00m 55s
53debug.log.2025-03-18: FAILURE in 1h 00m 55s
54debug.log.2025-03-18: FAILURE in 1h 02m 48s
55debug.log.2025-03-18: FAILURE in 1h 01m 05s
56debug.log.2025-03-19: FAILURE in 1h 01m 58s
57debug.log.2025-03-19: FAILURE in 1h 01m 47s
58debug.log.2025-03-19: FAILURE in 1h 02m 37s
59debug.log.2025-03-19: FAILURE in 1h 01m 40s
60debug.log.2025-03-19: FAILURE in 1h 02m 31s
61debug.log.2025-03-19: FAILURE in 1h 01m 52s
62debug.log.2025-03-19: FAILURE in 1h 01m 16s
63debug.log.2025-03-19: FAILURE in 1h 00m 37s
64debug.log.2025-03-20: FAILURE in 1h 01m 30s
65debug.log.2025-03-20: FAILURE in 1h 04m 17s
66debug.log.2025-03-20: FAILURE in 1h 02m 38s
67debug.log.2025-03-20: FAILURE in 1h 02m 44s
68debug.log.2025-03-20: FAILURE in 1h 01m 01s
69debug.log.2025-03-20: FAILURE in 1h 00m 24s
70debug.log.2025-03-20: FAILURE in 1h 02m 05s
71debug.log.2025-03-20: FAILURE in 1h 01m 11s
72debug.log.2025-03-20: FAILURE in 1h 01m 28s
73debug.log.2025-03-20: FAILURE in 1h 00m 53s
74debug.log.2025-03-20: FAILURE in 1h 00m 10s
75debug.log.2025-03-20: FAILURE in 1h 00m 54s
76debug.log.2025-03-20: FAILURE in 1h 00m 51s
77debug.log.2025-03-20: FAILURE in 1h 02m 24s
78debug.log.2025-03-20: FAILURE in 1h 00m 50s
79debug.log.2025-03-20: FAILURE in 1h 01m 16s
80debug.log.2025-03-20: FAILURE in 1h 00m 53s
81debug.log.2025-03-20: FAILURE in 1h 00m 11s
82debug.log.2025-03-21: FAILURE in 1h 00m 17s
83debug.log.2025-03-21: FAILURE in 1h 01m 56s
84debug.log.2025-03-22: FAILURE in 1h 00m 30s
85debug.log.2025-03-22: FAILURE in 1h 01m 16s
86debug.log.2025-03-22: FAILURE in 1h 00m 35s
87debug.log.2025-03-22: FAILURE in 1h 01m 38s
88debug.log.2025-03-22: FAILURE in 1h 00m 27s
89debug.log.2025-03-24: FAILURE in 1h 00m 21s
90debug.log.2025-03-24: FAILURE in 1h 01m 37s
91debug.log.2025-03-24: FAILURE in 1h 00m 59s
92debug.log.2025-03-24: FAILURE in 1h 00m 42s
93debug.log.2025-03-24: FAILURE in 1h 01m 01s
94debug.log.2025-03-24: FAILURE in 1h 00m 18s
95debug.log.2025-03-24: FAILURE in 1h 00m 55s
96debug.log.2025-03-24: FAILURE in 1h 00m 29s
97debug.log.2025-03-25: FAILURE in 1h 00m 07s
98debug.log.2025-03-25: FAILURE in 1h 01m 37s
99debug.log.2025-03-25: FAILURE in 1h 00m 08s
100debug.log.2025-03-25: FAILURE in 1h 00m 33s
101debug.log.2025-03-25: FAILURE in 1h 01m 37s
102debug.log.2025-03-25: FAILURE in 1h 01m 34s
103debug.log.2025-03-25: FAILURE in 1h 00m 50s
104debug.log.2025-03-25: FAILURE in 1h 01m 31s

Looking upstream for chromium and chromium-driver issues might be a good idea.

FYI: When we upgrade webdriver.io to version 8 the browser functionality is updated in wdio so they automatically download the latest browser and driver. We haven't discussed how we would solve that though in CI (making sure its cached/exists in the container) and we probably still need to be able to pin to specific versions, so a bad browser version will not break our CI until they release an update.

For now I wonder is it possible to downgrade instead to Chromium 90? Thinking that we see that it potentially work there, we could downgrade now and then take the job and fixing everything how we want it with the new webdriverio?

Also worth noting that the logs do not indicate any uncaught exception or rejection for this run, so those cannot possibly be the cause of the timeouts

I wonder if there's another case that we could have missed? What signal is Jenkins sending when it kills the job? I added so we catch SIGINT and SIGTERM and log active handles (like what NodeJS code is actually running). Maybe we can iterate that and see if we can make that path better, it would be great if we could log exactly what's going when the job is killed.

What signal is Jenkins sending when it kills the job?

Jenkins sends a TERM to the process group. That would reach docker run which proxies the signal to the daemon which then relay it to the container. That is then passed to python/Quibble which I guess relay the TERM to any child process it might have spawned.

My guess is wdio-mediawiki can react on TERM being received and emit a thread dump? I have no idea how that can be done in NodeJS though. Maybe https://nodejs.org/en/learn/diagnostics/memory/using-heap-snapshot/ is relevant and lets one inspect the trace.

2025-03-25 15:50:17	<dancy>	The end of /tmp/lighthouse.HHgKUrw/chrome-err.log has:
2025-03-25 15:50:17	<dancy>	[7067:7123:0325/152545.497018:ERROR:command_buffer_proxy_impl.cc(127)] ContextResult::kTransientFailure: Failed to send GpuControl.CreateCommandBuffer.
2025-03-25 15:50:17	<dancy>	[0325/152613.263289:ERROR:file_io_posix.cc(145)] open /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq: No such file or directory (2)
2025-03-25 15:50:17	<dancy>	[0325/152613.263393:ERROR:file_io_posix.cc(145)] open /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq: No such file or directory (2)
2025-03-25 15:50:17	<dancy>	[0325/152613.266640:ERROR:crash_handler.cc(101)] Detected GWP-ASan crash for allocation at 0x39f800c94000 (malloc) of type heap-use-after-free

Ok, this is Chromium that is crashing. Shouldn't webdriver.io handle that? I tried to understand each line:

ERROR:command_buffer_proxy_impl.cc(127)] ContextResult::kTransientFailure: Failed to send GpuControl.CreateCommandBuffe

Maybe passing --disable-gpu to Chromium could help us with that, and don't see that we do that today?

[0325/152613.263289:ERROR:file_io_posix.cc(145)] open /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq: No such file or directory (2)

There's many references to that in issues when Chromium is crashing. I wonder why it needs access to the check the CPU frequency? It seems to be introduced in version 98 according to this user.

Lets first try disabling the GPU and see if it works.

Change #1131260 had a related patch set uploaded (by Phedenskog; author: Phedenskog):

[mediawiki/core@master] selenium: Disable GPU usage when running in Docker.

https://gerrit.wikimedia.org/r/1131260

Looking upstream for chromium and chromium-driver issues might be a good idea.

That might be interesting. It's a bit hard without knowing what kind of issues I'm looking for specifically (i.e., what component is responsible, and what exactly is happening), but I thought I'd try. I started with a general, basically unfiltered, web search to find inspiration. So I randomly opened one of the first results: https://github.com/ultrafunkamsterdam/undetected-chromedriver/issues/1732. This is a package we aren't using, but I thought maybe the issue reported there is in the underlying driver. The symptoms seem to be roughly the same: extreme slowness with chrome(driver) 120/121. People suggest some workarounds, and one of the last comments in particular says that passing --disable-gpu worked for them. Interesting, see below.

Digging in the Zuul debug log, I did find php74 builds failing:

I'm guessing the console logs for those have been garbage collected, right? It would be interesting to see if it was the same kind of timeout, or a different cause.

Also worth noting that the logs do not indicate any uncaught exception or rejection for this run, so those cannot possibly be the cause of the timeouts

I wonder if there's another case that we could have missed?

One thing I was thinking about: maybe the 1h timeout is always triggered by an error. But whereas the most recent instances (and some of our previous ones) have an actual, visible error, some might have been caused by uncaught errors. Since we now fixed the source of uncaught exceptions in Minerva, we're only seeing timeouts triggered by visible errors. Maybe!

My guess is wdio-mediawiki can react on TERM being received and emit a thread dump? I have no idea how that can be done in NodeJS though. Maybe https://nodejs.org/en/learn/diagnostics/memory/using-heap-snapshot/ is relevant and lets one inspect the trace.

I basically had the same question. I don't know how to do that in node, and how to save it as a build artifact.

Ok, this is Chromium that is crashing. Shouldn't webdriver.io handle that?

Maybe something gets corrupt and trips wdio? I'm also thinking this might have been fixed in a newer version.

ERROR:command_buffer_proxy_impl.cc(127)] ContextResult::kTransientFailure: Failed to send GpuControl.CreateCommandBuffe

Maybe passing --disable-gpu to Chromium could help us with that, and don't see that we do that today?

Aaaaaaaand here we go, that's the same suggestion I found in that GH issue! Definitely worth a shot, so I'll merge your patch next.

T362705: Migrate Quibble images from bullseye to bookworm is a thing too, but James seems to be holding with the buster->bullseye migration still because of PHP 7.4 not yet being completely abandoned.

No, I think it's reasonable to make the move, but @hashar generally thinks that we should run CI images as close as possible to the production images (which are currently bullseye-based).

Change #1131342 had a related patch set uploaded (by Zfilipin; author: Zfilipin):

[mediawiki/core@master] WIP selenium: Find flaky tests

https://gerrit.wikimedia.org/r/1131342

Change #1131260 merged by jenkins-bot:

[mediawiki/core@master] selenium: Disable GPU usage when running in Docker.

https://gerrit.wikimedia.org/r/1131260

As a next step, I imagine we should make a new release of wdio-mediawiki. Is there anything else that should be included in the release but has not been merged yet?

Ping @zeljkofilipin do you have anything that should be merged or can it be released?

Change #1131342 abandoned by Zfilipin:

[mediawiki/core@master] WIP selenium: Find flaky tests

https://gerrit.wikimedia.org/r/1131342

@Daimona @Peter feel free to release new version of wdio-mediawiki. Let me know if you need help.

I'm not in the npm group I think, so please help!

Change #1131716 had a related patch set uploaded (by Jforrester; author: Jforrester):

[mediawiki/core@master] wdio-mediawiki: Release 2.7.1

https://gerrit.wikimedia.org/r/1131716

Sadly, I don't think disabling te GPU helped. Earlier today we had https://integration.wikimedia.org/ci/job/wmf-quibble-selenium-php81/6117/console, for https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1131080/2:

05:52:07 [0-3] RUNNING in chrome - /tests/selenium/specs/myEvents.js
05:53:14 [0-3] Error in "MyEvents.can allow organizer to delete registration of first event in My Events"
05:53:14 Error: Timeout of 60000ms exceeded. The execution in the test "MyEvents can allow organizer to delete registration of first event in My Events" took too long. Try to reduce the run time or increase your timeout for test specs (https://webdriver.io/docs/timeouts). (/workspace/src/extensions/CampaignEvents/tests/selenium/specs/myEvents.js)
05:53:14     at createTimeoutError (/workspace/src/extensions/CampaignEvents/node_modules/mocha/lib/errors.js:498:15)
05:53:14     at Runnable._timeoutError (/workspace/src/extensions/CampaignEvents/node_modules/mocha/lib/runnable.js:429:10)
05:53:14     at Timeout.<anonymous> (/workspace/src/extensions/CampaignEvents/node_modules/mocha/lib/runnable.js:244:24)
05:53:14     at listOnTimeout (node:internal/timers:581:17)
05:53:14     at process.processTimers (node:internal/timers:519:7)
06:48:01 Build timed out (after 60 minutes). Marking the build as failed.
06:48:01 Build was aborted

The commit includes the addition of the --disable-gpu flag. This is the only occurrence for today that I found in https://integration.wikimedia.org/ci/job/wmf-quibble-selenium-php81/buildTimeTrend

Sadly, I don't think disabling te GPU helped. Earlier today we had https://integration.wikimedia.org/ci/job/wmf-quibble-selenium-php81/6117/console, for https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1131080/2:

05:52:07 [0-3] RUNNING in chrome - /tests/selenium/specs/myEvents.js
05:53:14 [0-3] Error in "MyEvents.can allow organizer to delete registration of first event in My Events"
05:53:14 Error: Timeout of 60000ms exceeded. The execution in the test "MyEvents can allow organizer to delete registration of first event in My Events" took too long. Try to reduce the run time or increase your timeout for test specs (https://webdriver.io/docs/timeouts). (/workspace/src/extensions/CampaignEvents/tests/selenium/specs/myEvents.js)
05:53:14     at createTimeoutError (/workspace/src/extensions/CampaignEvents/node_modules/mocha/lib/errors.js:498:15)
05:53:14     at Runnable._timeoutError (/workspace/src/extensions/CampaignEvents/node_modules/mocha/lib/runnable.js:429:10)
05:53:14     at Timeout.<anonymous> (/workspace/src/extensions/CampaignEvents/node_modules/mocha/lib/runnable.js:244:24)
05:53:14     at listOnTimeout (node:internal/timers:581:17)
05:53:14     at process.processTimers (node:internal/timers:519:7)
06:48:01 Build timed out (after 60 minutes). Marking the build as failed.
06:48:01 Build was aborted

The commit includes the addition of the --disable-gpu flag. This is the only occurrence for today that I found in https://integration.wikimedia.org/ci/job/wmf-quibble-selenium-php81/buildTimeTrend

This uses the version of wdio-mediawiki defined in CampaignEvents though, no? I see a local npm install step prior to that test run, and CampaignEvents isn't yet updated to use wdio-mediawiki 2.7.1.

This uses the version of wdio-mediawiki defined in CampaignEvents though, no? I see a local npm install step prior to that test run, and CampaignEvents isn't yet updated to use wdio-mediawiki 2.7.1.

Oh yeah, my previous comment was a test to see if y'all were paying attention, obviously. Definitely not me being stoopid :P The only new instance is https://integration.wikimedia.org/ci/job/wmf-quibble-selenium-php81/6484/console, which is in VisualEditor. So yeah, releasing 2.7.1 with the GPU fix is currently our best bet.

FYI: now it is not only CI, but also LibUp that is broken, because it tries to upgrade wdio-mediawiki to 2.7.1, but it has not been released. Is there something I can do to help get it out?

FYI: now it is not only CI, but also LibUp that is broken, because it tries to upgrade wdio-mediawiki to 2.7.1, but it has not been released. Is there something I can do to help get it out?

Sorry, that's my bad—I misread the conversation here and ended up assuming 2.7.1 already got released.

I can revert the libup config change, but if possible, it'd be nice to land the update on NPM instead.

FYI: now it is not only CI, but also LibUp that is broken, because it tries to upgrade wdio-mediawiki to 2.7.1, but it has not been released. Is there something I can do to help get it out?

Sorry, that's my bad—I misread the conversation here and ended up assuming 2.7.1 already got released.

No, it's fine! I also think it'd be nice to get the update out -- in fact, I was initially comfortable with the LibUp change because I assumed we'd get a release shortly afterwards.

From a quick check to the build trends, it seems that there have been at least 3 instances today of the 1h timeout: one in gate-and-submit (6789, ext/MobileApp) and two in test (6772 and 6961, both in core). So yeah, this bug is definitely still happening. I'll try adding reviewers to the patch, based on the list of who can npm-publish it.

Change #1131716 merged by jenkins-bot:

[mediawiki/core@master] wdio-mediawiki: Release 2.7.1

https://gerrit.wikimedia.org/r/1131716

My guess is wdio-mediawiki can react on TERM being received and emit a thread dump? I have no idea how that can be done in NodeJS though. Maybe https://nodejs.org/en/learn/diagnostics/memory/using-heap-snapshot/ is relevant and lets one inspect the trace.

I basically had the same question. I don't know how to do that in node, and how to save it as a build artifact.

I had that filed T390125 but had it associated with the parent task T388416. I have made it a subtask of this one to give it some visibility. Unfortunately I don't know much about NodeJS internals myself but we can collect ideas there.

I have just noticed this is unbreak now. Is that still correct?

zeljkofilipin lowered the priority of this task from Unbreak Now! to High.Apr 11 2025, 1:37 PM

Looks like this is no longer unbreak now. I'll lower priority to high.

I have just noticed this is unbreak now. Is that still correct?

I was at an offsite last week and was waiting for the wdio-mediawiki upgrade rollout. I just checked the build times (last week). We have 13 builds that timed out after 1 hour. Checked a bunch of them and it's always the same pattern in selenium, not always preceded by an explicit test failure. A random example, https://integration.wikimedia.org/ci/job/wmf-quibble-selenium-php81/8979/console, is from VE at 2025-04-13 18:30:50Z. This is using wdio-mediawiki 2.7.1, so our fixes are insufficient.

I'm out of ideas though, and I've already spent a bunch of time on this task. I'm not sure if people have better ideas, or if we should aim for the v9 upgrade, or what else. So, is this still UBN? Well, the issue is still happening, and the numbers seem to suggest that we're still wasting CI time. Is it UBN? I guess it depends on who you ask.

So, is this still UBN? Well, the issue is still happening, and the numbers seem to suggest that we're still wasting CI time. Is it UBN? I guess it depends on who you ask.

Unbreak Now! – Something is broken and needs to be fixed immediately, setting anything else aside. This should meet the requirements for issues that hold the train.

I honestly don't think this was ever actually UBN! by the most public definition. Different teams can use the priority differently, but in a cross-team, cross-department sense this was more likely always High at best per https://www.mediawiki.org/wiki/Bug_management/Phabricator_etiquette and "Report status and priority fields summarise and reflect reality and do not cause it."

That doesn't mean that this was not and is not still workflow impacting issue of high importance to monitor and attempt to remediate. It really just means that UBN! is meant to mean "Interrupt anyone and everyone necessary to get this fixed as close to immediately as possible", and bringing that level of urgency to browser automation test execution would most likely result in "turn off those tests to get things moving" followed by setting a Medium-High priority to follow up with the investigation and attempted remedies that are what has actually been happening here.

It really just means that UBN! is meant to mean "Interrupt anyone and everyone necessary to get this fixed as close to immediately as possible"

Thanks, that was exactly what I was thinking. Should I stop everything and just work on this (UBN) or is it sufficient to "just" make this a priority task.

I honestly don't think this was ever actually UBN! by the most public definition. Different teams can use the priority differently, but in a cross-team, cross-department sense this was more likely always High at best per https://www.mediawiki.org/wiki/Bug_management/Phabricator_etiquette and "Report status and priority fields summarise and reflect reality and do not cause it."

I guess I was mostly reflecting my reality at the time. When I made this task an UBN, the timeouts happened 3 times in the span of a single afternoon, all in the gate-and-submit queue. The queue already had like 50 patches in it, and because a failure at the start of the queue causes everything else to be re-enqueued and tested from scratch, it eventually resulted in nothing being merged for a few hours, impacting my ability to work on stuff. So, I immediately went on to see what I could've done. In terms of effects, this wasn't much different than a shared build failure, which we usually treat as UBN. The difference, I guess, is that while build failure generally have a quick workaround of disabling the offending test, this wasn't an option here.

That doesn't mean that this was not and is not still workflow impacting issue of high importance to monitor and attempt to remediate.

Makes sense. It seems realistic to treat this as high (but not UBN) priority now, assuming we're no longer constantly delaying the merge of dozens of patches by multiple hours; which I think is a safe assumption, seeing that nobody complained about it in the last week or so.

bringing that level of urgency to browser automation test execution would most likely result in "turn off those tests to get things moving"

Oh yeah, I guess that would have been an option. It sounds a bit scary though, especially because we don't seem to have many ideas on how to fix this problem.

I just saw something interesting in https://integration.wikimedia.org/ci/job/wmf-quibble-selenium-php81/9521/console: it timed out after installing npm dependencies for Wikibase, before running selenium:

19:07:48 > tainted-ref@0.1.0 prepare
19:07:48 > node build/wikimedia-ui-base.js
19:07:48 
19:07:48 
19:07:48 added 2722 packages, and audited 2723 packages in 19s
19:07:48 
19:07:48 160 packages are looking for funding
19:07:48   run `npm fund` for details
19:07:49 
19:07:49 149 vulnerabilities (4 low, 79 moderate, 52 high, 14 critical)
19:07:49 
19:07:49 To address issues that do not require attention, run:
19:07:49   npm audit fix
19:07:49 
19:07:49 To address all issues possible (including breaking changes), run:
19:07:49   npm audit fix --force
19:07:49 
19:07:49 Some issues need review, and may require choosing
19:07:49 a different dependency.
19:07:49 
19:07:49 Run `npm audit` for details.
19:08:01 
19:08:01 added 976 packages, and audited 977 packages in 31s
19:08:01 
19:08:01 229 packages are looking for funding
19:08:01   run `npm fund` for details
19:08:01 
19:08:01 16 vulnerabilities (9 moderate, 7 high)
19:08:01 
19:08:01 To address all issues, run:
19:08:01   npm audit fix
19:08:01 
19:08:01 Run `npm audit` for details.
19:53:37 Build timed out (after 60 minutes). Marking the build as failed.

I'm not going to keep that build because it doesn't look interesting, but this might indicate that it's not a selenium problem, after all? And it's not the only instance I've seen recently that isn't preceded by errors, either.

Change #1130326 abandoned by Hashar:

[integration/config@master] jjb: [quibble*] lower timeout from 60 to 30 minutes

https://gerrit.wikimedia.org/r/1130326

https://integration.wikimedia.org/ci/job/wmf-quibble-selenium-php81/10869/console

17:41:20 [0-4] Error in "Temporary Accounts Onboarding Dialog.Verify onboarding dialog displays correctly and IPInfo preference works"
17:41:20 Error: Timeout of 60000ms exceeded. The execution in the test "Temporary Accounts Onboarding Dialog Verify onboarding dialog displays correctly and IPInfo preference works" took too long. Try to reduce the run time or increase your timeout for test specs (https://webdriver.io/docs/timeouts). (/workspace/src/extensions/CheckUser/tests/selenium/specs/tempAccountsOnboarding.js)
17:41:20     at createTimeoutError (/workspace/src/extensions/CheckUser/node_modules/mocha/lib/errors.js:498:15)
17:41:20     at Runnable._timeoutError (/workspace/src/extensions/CheckUser/node_modules/mocha/lib/runnable.js:429:10)
17:41:20     at Timeout.<anonymous> (/workspace/src/extensions/CheckUser/node_modules/mocha/lib/runnable.js:244:24)
17:41:20     at listOnTimeout (node:internal/timers:581:17)
17:41:20     at process.processTimers (node:internal/timers:519:7)
17:55:05 Build was aborted

(I have rebased the corresponding patch set to remove it from the queue.)

https://integration.wikimedia.org/ci/job/wmf-quibble-selenium-php81/11339/console

21:12:41 [0-0] RUNNING in chrome - /repo/tests/selenium/specs/blocked.js
21:12:55 [0-0] PASSED in chrome - /repo/tests/selenium/specs/blocked.js
21:12:55 [0-1] RUNNING in chrome - /repo/tests/selenium/specs/item.js
21:14:04 [0-1] Error in "item.old revisions do not have an edit link"
21:14:04 Error: Timeout of 60000ms exceeded. The execution in the test "item old revisions do not have an edit link" took too long. Try to reduce the run time or increase your timeout for test specs (https://webdriver.io/docs/timeouts). (/workspace/src/extensions/Wikibase/repo/tests/selenium/specs/item.js)
21:14:04     at createTimeoutError (/workspace/src/extensions/Wikibase/node_modules/mocha/lib/errors.js:498:15)
21:14:04     at Runnable._timeoutError (/workspace/src/extensions/Wikibase/node_modules/mocha/lib/runnable.js:429:10)
21:14:04     at Timeout.<anonymous> (/workspace/src/extensions/Wikibase/node_modules/mocha/lib/runnable.js:244:24)
21:14:04     at listOnTimeout (node:internal/timers:581:17)
21:14:04     at process.processTimers (node:internal/timers:519:7)
21:15:04 [0-1] Error in "item.has its label not rendered when linked on a Wikipage"
21:15:04 Error: Timeout of 60000ms exceeded. The execution in the test "item has its label not rendered when linked on a Wikipage" took too long. Try to reduce the run time or increase your timeout for test specs (https://webdriver.io/docs/timeouts). (/workspace/src/extensions/Wikibase/repo/tests/selenium/specs/item.js)
21:15:04     at createTimeoutError (/workspace/src/extensions/Wikibase/node_modules/mocha/lib/errors.js:498:15)
21:15:04     at Runnable._timeoutError (/workspace/src/extensions/Wikibase/node_modules/mocha/lib/runnable.js:429:10)
21:15:04     at Timeout.<anonymous> (/workspace/src/extensions/Wikibase/node_modules/mocha/lib/runnable.js:244:24)
21:15:04     at listOnTimeout (node:internal/timers:581:17)
21:15:04     at process.processTimers (node:internal/timers:519:7)
21:57:31 Build timed out (after 60 minutes). Marking the build as failed.

https://integration.wikimedia.org/ci/job/wmf-quibble-selenium-php81/11350/consoleFull

22:01:46 Setting up modified /workspace/src/LocalSettings.php
22:01:46 Restarting php8.1-fpm
22:01:47 [0-0] RUNNING in chrome - /tests/selenium/specs/newpagesfeed.js
22:01:47 [0-1] RUNNING in chrome - /tests/selenium/specs/toolbar.js
22:01:51 [0-1] RunJobs through requests to the main page (run 1).
22:01:51 [0-1] Unhandled Rejection at: Promise {
22:01:51 [0-1]   <rejected> Protocol error (Page.navigate): Target closed.
22:01:51 [0-1]   ProtocolError: Protocol error (Page.navigate): Target closed.
22:01:51 [0-1]       at new Promise (<anonymous>)
22:01:51 [0-1]       at DevToolsDriver.navigateTo (/workspace/src/extensions/PageTriage/node_modules/devtools/build/commands/navigateTo.js:18:16)
22:01:51 [0-1]       at Browser.wrappedCommand (/workspace/src/extensions/PageTriage/node_modules/devtools/build/devtoolsdriver.js:128:55)
22:01:51 [0-1]       at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
22:01:51 [0-1]       at async Browser.wrapCommandFn (/workspace/src/extensions/PageTriage/node_modules/@wdio/utils/build/shim.js:136:29)
22:01:51 [0-1]       at async Browser.wrapCommandFn (/workspace/src/extensions/PageTriage/node_modules/@wdio/utils/build/shim.js:136:29)
22:01:51 [0-1]       at async Page.openTitle (/workspace/src/extensions/PageTriage/node_modules/wdio-mediawiki/Page.js:22:3)
22:01:51 [0-1]       at async Browser.wrapCommandFn (/workspace/src/extensions/PageTriage/node_modules/@wdio/utils/build/shim.js:136:29) {
22:01:51 [0-1]     originalMessage: undefined
22:01:51 [0-1]   }
22:01:51 [0-1] } reason: Protocol error (Page.navigate): Target closed.
22:01:51 [0-1] ProtocolError: Protocol error (Page.navigate): Target closed.
22:01:51 [0-1]     at new Promise (<anonymous>)
22:01:51 [0-1]     at DevToolsDriver.navigateTo (/workspace/src/extensions/PageTriage/node_modules/devtools/build/commands/navigateTo.js:18:16)
22:01:51 [0-1]     at Browser.wrappedCommand (/workspace/src/extensions/PageTriage/node_modules/devtools/build/devtoolsdriver.js:128:55)
22:01:51 [0-1]     at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
22:01:51 [0-1]     at async Browser.wrapCommandFn (/workspace/src/extensions/PageTriage/node_modules/@wdio/utils/build/shim.js:136:29)
22:01:51 [0-1]     at async Browser.wrapCommandFn (/workspace/src/extensions/PageTriage/node_modules/@wdio/utils/build/shim.js:136:29)
22:01:51 [0-1]     at async Page.openTitle (/workspace/src/extensions/PageTriage/node_modules/wdio-mediawiki/Page.js:22:3)
22:01:51 [0-1]     at async Browser.wrapCommandFn (/workspace/src/extensions/PageTriage/node_modules/@wdio/utils/build/shim.js:136:29) {
22:01:51 [0-1]   originalMessage: undefined
22:01:51 [0-1] }
22:01:51 [0-1] Unhandled Rejection at: Promise {
22:01:51 [0-1]   <rejected> Protocol error (Page.navigate): Target closed.
22:01:51 [0-1]   ProtocolError: Protocol error (Page.navigate): Target closed.
22:01:51 [0-1]       at new Promise (<anonymous>)
22:01:51 [0-1]       at DevToolsDriver.navigateTo (/workspace/src/extensions/PageTriage/node_modules/devtools/build/commands/navigateTo.js:18:16)
22:01:51 [0-1]       at Browser.wrappedCommand (/workspace/src/extensions/PageTriage/node_modules/devtools/build/devtoolsdriver.js:128:55)
22:01:51 [0-1]       at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
22:01:51 [0-1]       at async Browser.wrapCommandFn (/workspace/src/extensions/PageTriage/node_modules/@wdio/utils/build/shim.js:136:29)
22:01:51 [0-1]       at async Browser.wrapCommandFn (/workspace/src/extensions/PageTriage/node_modules/@wdio/utils/build/shim.js:136:29)
22:01:51 [0-1]       at async Page.openTitle (/workspace/src/extensions/PageTriage/node_modules/wdio-mediawiki/Page.js:22:3)
22:01:51 [0-1]       at async Browser.wrapCommandFn (/workspace/src/extensions/PageTriage/node_modules/@wdio/utils/build/shim.js:136:29) {
22:01:51 [0-1]     originalMessage: undefined
22:01:51 [0-1]   }
22:01:51 [0-1] } reason: Protocol error (Page.navigate): Target closed.
22:01:51 [0-1] ProtocolError: Protocol error (Page.navigate): Target closed.
22:01:51 [0-1]     at new Promise (<anonymous>)
22:01:51 [0-1]     at DevToolsDriver.navigateTo (/workspace/src/extensions/PageTriage/node_modules/devtools/build/commands/navigateTo.js:18:16)
22:01:51 [0-1]     at Browser.wrappedCommand (/workspace/src/extensions/PageTriage/node_modules/devtools/build/devtoolsdriver.js:128:55)
22:01:51 [0-1]     at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
22:01:51 [0-1]     at async Browser.wrapCommandFn (/workspace/src/extensions/PageTriage/node_modules/@wdio/utils/build/shim.js:136:29)
22:01:51 [0-1]     at async Browser.wrapCommandFn (/workspace/src/extensions/PageTriage/node_modules/@wdio/utils/build/shim.js:136:29)
22:01:51 [0-1]     at async Page.openTitle (/workspace/src/extensions/PageTriage/node_modules/wdio-mediawiki/Page.js:22:3)
22:01:51 [0-1]     at async Browser.wrapCommandFn (/workspace/src/extensions/PageTriage/node_modules/@wdio/utils/build/shim.js:136:29) {
22:01:51 [0-1]   originalMessage: undefined
22:01:51 [0-1] }
22:02:00 [0-0] RunJobs through requests to the main page (run 1).
22:02:00 [0-0] Unhandled Rejection at: Promise {
22:02:00 [0-0]   <rejected> Protocol error (Page.navigate): Target closed.
22:02:00 [0-0]   ProtocolError: Protocol error (Page.navigate): Target closed.
22:02:00 [0-0]       at new Promise (<anonymous>)
22:02:00 [0-0]       at DevToolsDriver.navigateTo (/workspace/src/extensions/PageTriage/node_modules/devtools/build/commands/navigateTo.js:18:16)
22:02:00 [0-0]       at Browser.wrappedCommand (/workspace/src/extensions/PageTriage/node_modules/devtools/build/devtoolsdriver.js:128:55)
22:02:00 [0-0]       at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
22:02:00 [0-0]       at async Browser.wrapCommandFn (/workspace/src/extensions/PageTriage/node_modules/@wdio/utils/build/shim.js:136:29)
22:02:00 [0-0]       at async Browser.wrapCommandFn (/workspace/src/extensions/PageTriage/node_modules/@wdio/utils/build/shim.js:136:29)
22:02:00 [0-0]       at async Page.openTitle (/workspace/src/extensions/PageTriage/node_modules/wdio-mediawiki/Page.js:22:3)
22:02:00 [0-0]       at async Browser.wrapCommandFn (/workspace/src/extensions/PageTriage/node_modules/@wdio/utils/build/shim.js:136:29) {
22:02:00 [0-0]     originalMessage: undefined
22:02:00 [0-0]   }
22:02:00 [0-0] } reason: Protocol error (Page.navigate): Target closed.
22:02:00 [0-0] ProtocolError: Protocol error (Page.navigate): Target closed.
22:02:00 [0-0]     at new Promise (<anonymous>)
22:02:00 [0-0]     at DevToolsDriver.navigateTo (/workspace/src/extensions/PageTriage/node_modules/devtools/build/commands/navigateTo.js:18:16)
22:02:00 [0-0]     at Browser.wrappedCommand (/workspace/src/extensions/PageTriage/node_modules/devtools/build/devtoolsdriver.js:128:55)
22:02:00 [0-0]     at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
22:02:00 [0-0]     at async Browser.wrapCommandFn (/workspace/src/extensions/PageTriage/node_modules/@wdio/utils/build/shim.js:136:29)
22:02:00 [0-0]     at async Browser.wrapCommandFn (/workspace/src/extensions/PageTriage/node_modules/@wdio/utils/build/shim.js:136:29)
22:02:00 [0-0]     at async Page.openTitle (/workspace/src/extensions/PageTriage/node_modules/wdio-mediawiki/Page.js:22:3)
22:02:00 [0-0]     at async Browser.wrapCommandFn (/workspace/src/extensions/PageTriage/node_modules/@wdio/utils/build/shim.js:136:29) {
22:02:00 [0-0]   originalMessage: undefined
22:02:00 [0-0] }
22:02:00 [0-0] Unhandled Rejection at: Promise {
22:02:00 [0-0]   <rejected> Protocol error (Page.navigate): Target closed.
22:02:00 [0-0]   ProtocolError: Protocol error (Page.navigate): Target closed.
22:02:00 [0-0]       at new Promise (<anonymous>)
22:02:00 [0-0]       at DevToolsDriver.navigateTo (/workspace/src/extensions/PageTriage/node_modules/devtools/build/commands/navigateTo.js:18:16)
22:02:00 [0-0]       at Browser.wrappedCommand (/workspace/src/extensions/PageTriage/node_modules/devtools/build/devtoolsdriver.js:128:55)
22:02:00 [0-0]       at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
22:02:00 [0-0]       at async Browser.wrapCommandFn (/workspace/src/extensions/PageTriage/node_modules/@wdio/utils/build/shim.js:136:29)
22:02:00 [0-0]       at async Browser.wrapCommandFn (/workspace/src/extensions/PageTriage/node_modules/@wdio/utils/build/shim.js:136:29)
22:02:00 [0-0]       at async Page.openTitle (/workspace/src/extensions/PageTriage/node_modules/wdio-mediawiki/Page.js:22:3)
22:02:00 [0-0]       at async Browser.wrapCommandFn (/workspace/src/extensions/PageTriage/node_modules/@wdio/utils/build/shim.js:136:29) {
22:02:00 [0-0]     originalMessage: undefined
22:02:00 [0-0]   }
22:02:00 [0-0] } reason: Protocol error (Page.navigate): Target closed.
22:02:00 [0-0] ProtocolError: Protocol error (Page.navigate): Target closed.
22:02:00 [0-0]     at new Promise (<anonymous>)
22:02:00 [0-0]     at DevToolsDriver.navigateTo (/workspace/src/extensions/PageTriage/node_modules/devtools/build/commands/navigateTo.js:18:16)
22:02:00 [0-0]     at Browser.wrappedCommand (/workspace/src/extensions/PageTriage/node_modules/devtools/build/devtoolsdriver.js:128:55)
22:02:00 [0-0]     at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
22:02:00 [0-0]     at async Browser.wrapCommandFn (/workspace/src/extensions/PageTriage/node_modules/@wdio/utils/build/shim.js:136:29)
22:02:00 [0-0]     at async Browser.wrapCommandFn (/workspace/src/extensions/PageTriage/node_modules/@wdio/utils/build/shim.js:136:29)
22:02:00 [0-0]     at async Page.openTitle (/workspace/src/extensions/PageTriage/node_modules/wdio-mediawiki/Page.js:22:3)
22:02:00 [0-0]     at async Browser.wrapCommandFn (/workspace/src/extensions/PageTriage/node_modules/@wdio/utils/build/shim.js:136:29) {
22:02:00 [0-0]   originalMessage: undefined
22:02:00 [0-0] }
22:02:03 [0-0] PASSED in chrome - /tests/selenium/specs/newpagesfeed.js
22:02:56 [0-1] Error in "PageTriage Toolbar.should allow user to add a maintainence tag to a page"
22:02:56 Error: Timeout of 60000ms exceeded. The execution in the test "PageTriage Toolbar should allow user to add a maintainence tag to a page" took too long. Try to reduce the run time or increase your timeout for test specs (https://webdriver.io/docs/timeouts). (/workspace/src/extensions/PageTriage/tests/selenium/specs/toolbar.js)
22:02:56     at createTimeoutError (/workspace/src/extensions/PageTriage/node_modules/mocha/lib/errors.js:498:15)
22:02:56     at Runnable._timeoutError (/workspace/src/extensions/PageTriage/node_modules/mocha/lib/runnable.js:429:10)
22:02:56     at Timeout.<anonymous> (/workspace/src/extensions/PageTriage/node_modules/mocha/lib/runnable.js:244:24)
22:02:56     at listOnTimeout (node:internal/timers:581:17)
22:02:56     at process.processTimers (node:internal/timers:519:7)
22:03:56 [0-1] Error in "PageTriage Toolbar.should allow user to add a maintainence tag to a page and send a note to user"
22:03:56 Error: Timeout of 60000ms exceeded. The execution in the test "PageTriage Toolbar should allow user to add a maintainence tag to a page and send a note to user" took too long. Try to reduce the run time or increase your timeout for test specs (https://webdriver.io/docs/timeouts). (/workspace/src/extensions/PageTriage/tests/selenium/specs/toolbar.js)
22:03:56     at createTimeoutError (/workspace/src/extensions/PageTriage/node_modules/mocha/lib/errors.js:498:15)
22:03:56     at Runnable._timeoutError (/workspace/src/extensions/PageTriage/node_modules/mocha/lib/runnable.js:429:10)
22:03:56     at Timeout.<anonymous> (/workspace/src/extensions/PageTriage/node_modules/mocha/lib/runnable.js:244:24)
22:03:56     at listOnTimeout (node:internal/timers:581:17)
22:03:56     at process.processTimers (node:internal/timers:519:7)
22:10:30 Build was aborted

https://integration.wikimedia.org/ci/job/wmf-quibble-selenium-php81/11326/console

20:34:10 [0-2] Error in "Client Hints.stores client hints data on successful logins"
20:34:10 Error: Timeout of 60000ms exceeded. The execution in the test "Client Hints stores client hints data on successful logins" took too long. Try to reduce the run time or increase your timeout for test specs (https://webdriver.io/docs/timeouts). (/workspace/src/extensions/CheckUser/tests/selenium/specs/clienthints.js)
20:34:10     at createTimeoutError (/workspace/src/extensions/CheckUser/node_modules/mocha/lib/errors.js:498:15)
20:34:10     at Runnable._timeoutError (/workspace/src/extensions/CheckUser/node_modules/mocha/lib/runnable.js:429:10)
20:34:10     at Timeout.<anonymous> (/workspace/src/extensions/CheckUser/node_modules/mocha/lib/runnable.js:244:24)
20:34:10     at listOnTimeout (node:internal/timers:581:17)
20:34:10     at process.processTimers (node:internal/timers:519:7)
20:35:10 [0-2] Error in "Client Hints.stores client hints data on account creation"
20:35:10 Error: Timeout of 60000ms exceeded. The execution in the test "Client Hints stores client hints data on account creation" took too long. Try to reduce the run time or increase your timeout for test specs (https://webdriver.io/docs/timeouts). (/workspace/src/extensions/CheckUser/tests/selenium/specs/clienthints.js)
20:35:10     at createTimeoutError (/workspace/src/extensions/CheckUser/node_modules/mocha/lib/errors.js:498:15)
20:35:10     at Runnable._timeoutError (/workspace/src/extensions/CheckUser/node_modules/mocha/lib/runnable.js:429:10)
20:35:10     at Timeout.<anonymous> (/workspace/src/extensions/CheckUser/node_modules/mocha/lib/runnable.js:244:24)
20:35:10     at listOnTimeout (node:internal/timers:581:17)
20:35:10     at process.processTimers (node:internal/timers:519:7)
20:36:10 [0-2] Error in "Client Hints.stores client hints data on temporary account auto-creation"
20:36:10 Error: Timeout of 60000ms exceeded. The execution in the test "Client Hints stores client hints data on temporary account auto-creation" took too long. Try to reduce the run time or increase your timeout for test specs (https://webdriver.io/docs/timeouts). (/workspace/src/extensions/CheckUser/tests/selenium/specs/clienthints.js)
20:36:10     at createTimeoutError (/workspace/src/extensions/CheckUser/node_modules/mocha/lib/errors.js:498:15)
20:36:10     at Runnable._timeoutError (/workspace/src/extensions/CheckUser/node_modules/mocha/lib/runnable.js:429:10)
20:36:10     at Timeout.<anonymous> (/workspace/src/extensions/CheckUser/node_modules/mocha/lib/runnable.js:244:24)
20:36:10     at listOnTimeout (node:internal/timers:581:17)
20:36:10     at process.processTimers (node:internal/timers:519:7)
21:27:26 Build timed out (after 60 minutes). Marking the build as failed.
21:27:26 Build was aborted

Unassigned from myself to reflect reality. I have read all comments and from what I understood we still don't know what's causing the problem. Test-Platform has decided to focus on T324766: Upgrade WebdriverIO to v8 in core for the next sprint. That might be a good mid to long term solution to the problem. Let's get closer to the upstream major version for now, and let's catch up as soon as possible.

Test-Platform has decided to focus on T324766: Upgrade WebdriverIO to v8 in core for the next sprint. That might be a good mid to long term solution to the problem. Let's get closer to the upstream major version for now, and let's catch up as soon as possible.

Thank you, this makes sense to me. Based on T389536#10752111, this issue might not even be selenium-related, but I imagine upgrading wdio would help us prove or disprove that theory. It's also worth noting that this was the only instance I've seen where the timeout does not happen while running browser tests.

Also, if people are curious as to how often this is happening: you can sort https://integration.wikimedia.org/ci/job/wmf-quibble-selenium-php81/buildTimeTrend by duration. Anything >= 1h is most probably (but not necessarily) an instance of this bug. We seem to have 21 occurrences for the last 7 days, for various repos (quite commonly in core) and both the test and gate-and-submit pipelines.

It also just happened to a patch in gate-and-submit that I +2ed:

https://integration.wikimedia.org/ci/job/wmf-quibble-selenium-php81/11852/console

12:05:46 Execution of 2 workers started at 2025-04-30T12:05:46.321Z
12:05:46 
12:05:46 [0-1] RUNNING in chrome - /tests/selenium/specs/toolbar.js
12:05:46 [0-0] RUNNING in chrome - /tests/selenium/specs/content_editable.js
12:06:11 [0-0] PASSED in chrome - /tests/selenium/specs/content_editable.js
12:49:41 Build timed out (after 60 minutes). Marking the build as failed.

This was on integration-agent-docker-1040, which interestingly did not show any resource usage while being stuck (the IO spike is after the job ended):

image.png (1×3 px, 360 KB)

I tried asking if anyone was around to get a process list or check chrome logs. But it turns out, the process of me nagging people in IRC, hoping to find a hero who's willing to help in the 20 minutes before times out, is not very effective or scalable.

I tried asking if anyone was around to get a process list or check chrome logs. But it turns out, the process of me nagging people in IRC, hoping to find a hero who's willing to help in the 20 minutes before times out, is not very effective or scalable.

Noting it here so it isn't lost: from IRC conversation with @hashar the other day, we could maybe print some debug information before the job gets killed, like the process list, and the chrome error log. No idea how hard this would be.

(And FTR, the issue just happened again and reset the gate-and-submit queue with 10 patches)

Unsure if related, but in https://integration.wikimedia.org/ci/job/wmf-quibble-selenium-php81/12434/console (for r1140485 in ext/Cite) I saw the following (using browser timezone):

18:45:02 Execution of 4 workers started at 2025-05-02T16:45:02.418Z
18:45:02 
18:45:03 [0-0] RUNNING in chrome - /tests/selenium/specs/editEventRegistration.js
18:51:04 [0-0] 2025-05-02T16:51:04.397Z ERROR @wdio/local-runner: Failed launching test session: Error: Protocol error (Runtime.evaluate): Session closed. Most likely the page has been closed.
18:51:04 [0-0]     at CDPSession.send (/workspace/src/extensions/CampaignEvents/node_modules/puppeteer-core/lib/cjs/puppeteer/common/Connection.js:218:35)
18:51:04 [0-0]     at ExecutionContext._evaluateInternal (/workspace/src/extensions/CampaignEvents/node_modules/puppeteer-core/lib/cjs/puppeteer/common/ExecutionContext.js:166:18)
18:51:04 [0-0]     at ExecutionContext.evaluate (/workspace/src/extensions/CampaignEvents/node_modules/puppeteer-core/lib/cjs/puppeteer/common/ExecutionContext.js:110:27)
18:51:04 [0-0]     at DevToolsDriver.checkPendingNavigations (/workspace/src/extensions/CampaignEvents/node_modules/devtools/build/devtoolsdriver.js:230:36)
18:51:04 [0-0]     at async DevToolsDriver.checkPendingNavigations (/workspace/src/extensions/CampaignEvents/node_modules/devtools/build/devtoolsdriver.js:251:9)
18:51:04 [0-0]     at async DevToolsDriver.checkPendingNavigations (/workspace/src/extensions/CampaignEvents/node_modules/devtools/build/devtoolsdriver.js:251:9)
18:51:04 [0-0]     at async DevToolsDriver.checkPendingNavigations (/workspace/src/extensions/CampaignEvents/node_modules/devtools/build/devtoolsdriver.js:251:9)
18:51:04 [0-0]     at async DevToolsDriver.checkPendingNavigations (/workspace/src/extensions/CampaignEvents/node_modules/devtools/build/devtoolsdriver.js:251:9)
18:51:04 [0-0]     at async DevToolsDriver.checkPendingNavigations (/workspace/src/extensions/CampaignEvents/node_modules/devtools/build/devtoolsdriver.js:251:9)
18:51:04 [0-0]     at async DevToolsDriver.checkPendingNavigations (/workspace/src/extensions/CampaignEvents/node_modules/devtools/build/devtoolsdriver.js:251:9)
18:51:04 [0-0]     at async DevToolsDriver.checkPendingNavigations (/workspace/src/extensions/CampaignEvents/node_modules/devtools/build/devtoolsdriver.js:251:9)
18:51:04 [0-0]     at async DevToolsDriver.checkPendingNavigations (/workspace/src/extensions/CampaignEvents/node_modules/devtools/build/devtoolsdriver.js:251:9)
18:51:04 [0-0]     at async DevToolsDriver.checkPendingNavigations (/workspace/src/extensions/CampaignEvents/node_modules/devtools/build/devtoolsdriver.js:251:9)
18:51:04 [0-0]     at async DevToolsDriver.checkPendingNavigations (/workspace/src/extensions/CampaignEvents/node_modules/devtools/build/devtoolsdriver.js:251:9)
18:51:04 [0-0]     at async DevToolsDriver.checkPendingNavigations (/workspace/src/extensions/CampaignEvents/node_modules/devtools/build/devtoolsdriver.js:251:9)
18:51:04 [0-0]     at async DevToolsDriver.checkPendingNavigations (/workspace/src/extensions/CampaignEvents/node_modules/devtools/build/devtoolsdriver.js:251:9)
18:51:04 [0-0] RETRYING in chrome - /tests/selenium/specs/editEventRegistration.js

And then it passed when retrying, and all the other tests ran normally. Cross-referencing the load data (same timezone):

{F59625571}

There is a CPU spike around the time of the timeout, and memory usage goes down. No idea if it has anything to do with T281122.


(And FTR, the issue just happened again and reset the gate-and-submit queue with 10 patches)

This was https://integration.wikimedia.org/ci/job/wmf-quibble-selenium-php81/12404/console for r1140670 (ext/ContentTranslation). Using browser timezone:

17:36:54 [0-0] RUNNING in chrome - /tests/selenium/specs/helpbanner.js
17:37:54 [0-0] Error in "ImportPreview page.shows dismissible help banner"
17:37:54 Error: Timeout of 60000ms exceeded. The execution in the test "ImportPreview page shows dismissible help banner" took too long. Try to reduce the run time or increase your timeout for test specs (https://webdriver.io/docs/timeouts). (/workspace/src/extensions/FileImporter/tests/selenium/specs/helpbanner.js)
17:37:54     at createTimeoutError (/workspace/src/extensions/FileImporter/node_modules/mocha/lib/errors.js:498:15)
17:37:54     at Runnable._timeoutError (/workspace/src/extensions/FileImporter/node_modules/mocha/lib/runnable.js:429:10)
17:37:54     at Timeout.<anonymous> (/workspace/src/extensions/FileImporter/node_modules/mocha/lib/runnable.js:244:24)
17:37:54     at listOnTimeout (node:internal/timers:581:17)
17:37:54     at process.processTimers (node:internal/timers:519:7)
18:26:52 Build timed out (after 60 minutes). Marking the build as failed.

Cross-referencing:

image.png (1×3 px, 650 KB)

Change #1178950 had a related patch set uploaded (by Dduvall; author: Dduvall):

[mediawiki/core@master] DO NOT MERGE: attempting to repro T389536 in CI

https://gerrit.wikimedia.org/r/1178950

Change #1178950 abandoned by Dduvall:

[mediawiki/core@master] DO NOT MERGE: attempting to repro T389536 in CI

https://gerrit.wikimedia.org/r/1178950

https://integration.wikimedia.org/ci/job/wmf-quibble-selenium-php81/34124/console

00:16:53 [0-0] Error in "CheckUser.With CheckUser user group."before all" hook in "With CheckUser user group""
00:16:53 Navigation failed because browser has disconnected!
00:16:53 Error: Navigation failed because browser has disconnected!
00:16:53     at Array.map (<anonymous>)
00:16:53     at WebSocket.onClose (/workspace/src/extensions/CheckUser/node_modules/devtools/node_modules/ws/lib/event-target.js:210:18)
00:16:53     at WebSocket.emit (node:events:524:28)
00:16:55 [0-3] PASSED in chrome - /tests/selenium/specs/investigate.js
00:16:56 [0-1] PASSED in chrome - /tests/selenium/specs/checkuserlog.js
00:16:57 [0-4] PASSED in chrome - /tests/selenium/specs/tempAccountsOnboarding.js
01:09:05 Build timed out (after 60 minutes). Marking the build as failed.
01:09:05 Build was aborted

This seems also to affect cypress, so possible an issue with the browser

https://integration.wikimedia.org/ci/job/quibble-with-gated-extensions-selenium-php81/10622/console

20:51:19 INFO:quibble.commands:Running webdriver test in /workspace/src/extensions/Wikibase
[...]
20:52:56 Note: Overriding Cypress cache directory to: ./cypress/.cache
20:52:56 
20:52:56       Previous installs of Cypress may not be found.
20:52:56 
20:52:57 Installing Cypress (version: 14.5.3)
20:52:57 
20:52:58 [STARTED] Task without title.
20:53:00 [SUCCESS] Task without title.
20:53:00 [STARTED] Task without title.
21:09:16 Build was aborted

[Aborted by rebasing the patch set, to remove job from gate-and-submit queue, so the abort is not due to timeout, but very likly to happen]

Normal run
https://integration.wikimedia.org/ci/job/quibble-with-gated-extensions-selenium-php81/10418/console

00:29:58 INFO:quibble.commands:Running webdriver test in /workspace/src/extensions/Wikibase
[...]
00:30:51 Note: Overriding Cypress cache directory to: ./cypress/.cache
00:30:51 
00:30:51       Previous installs of Cypress may not be found.
00:30:51 
00:30:51 Installing Cypress (version: 14.5.3)
00:30:51 
00:30:51 [STARTED] Task without title.
00:30:53 [SUCCESS] Task without title.
00:30:53 [STARTED] Task without title.
00:30:59 [SUCCESS] Task without title.
00:30:59 [STARTED] Task without title.
00:30:59 [SUCCESS] Task without title.

While general CI slowness is a problem, we need to figure out why selenium timeouts are not terminating the build immediately.

I don't want to resurrect a massive new effort here (I've spent some time trying to read through the copious notes), but have two questions as I look to focus some effort:

  1. Does this still happen in the latest versions of WDIO-mediawiki or webdriverIO?

It looks like the previous Jenkins link above has stopped working (or I don't have access)

  1. Did we ever look into the very specific reason why a timeout isn't triggering a build failure? Or were we never able to get enough information on how Selenium treats timeouts.

While general CI slowness is a problem, we need to figure out why selenium timeouts are not terminating the build immediately.

I don't want to resurrect a massive new effort here (I've spent some time trying to read through the copious notes), but have two questions as I look to focus some effort:

  1. Does this still happen in the latest versions of WDIO-mediawiki or webdriverIO?

It was still happening in December, although I don't have more recent data. Typically, you can go to https://integration.wikimedia.org/ci/job/quibble-with-gated-extensions-selenium-php83/buildTimeTrend, sort by time, and check the 1-hour ones (they're not all necessarily due to this task). I don't see any of those for the last 7 days, but maybe we also didn't have many patches during this period; both as data points, and as extra load for CI (one of the theories being that the 1h timeout is more likely to happen when the CI load is higher).

It looks like the previous Jenkins link above has stopped working (or I don't have access)

Jenkins logs get garbage-collected after ~2 weeks, so the above have expired. I used to have some logs that I had marked as to be kept "forever" (e.g. in T389536#10676487), but I have apparently since deleted those as well. Probably because this issue kept happening, and it's usually easy to find a fresh example.

  1. Did we ever look into the very specific reason why a timeout isn't triggering a build failure? Or were we never able to get enough information on how Selenium treats timeouts.

A built timeout does trigger a build failure (failure by timeout), but that is at the 1-hour mark. This is obviously problematic because the job is typically much quicker than that, and sometimes the freeze happens like 5-10 minutes in. Back then @hashar had looked into reducing the CI timeout to something like 30 minutes, but that didn't work because there were jobs slower than that; I don't know if that is still the case today, especially now that T389998 is done.

A selenium timeout would also trigger a normal build failure.

The problem is that whatever causes selenium/wdio/the browser to freeze is NOT triggering a selenium timeout. It just, well, freezes (unclear what it's doing in the meantime), until the 1-hour build timeout kicks in.

I believe last year we tried to understand why that was the cause, trying different things such as upgrading wdio versions and changing browser options, but the root cause has not been determined AFAIK.

As for the question of whether it is still happening, a week and more data points later: yes, we have 3 examples from the last few days.

This also implies that selenium or browser tests are not the cause, given the timeout while running npm install.

I have not marked the builds above to be kept, folks can do it if they'd like to investigate the failures.

Thanks - this helps. Do you have a query or something you can share to help me see these easier?

@AMarkossyan-WMF this might be worth bringing into a Test Platform sprint to see if there's additional logging we could add. Like Daimona mentioned above, it's looking like it's not the browser tests themselves but something else within CI that is causing a timeout. Take a look at the Cypress output above (https://integration.wikimedia.org/ci/job/quibble-with-gated-extensions-selenium-php83/6041/consoleFull)

There's a bunch of tasks without titles and on one of them it times out the job. I'm going to see if I can dig in and find what's driving that CI and figure out what that step actually does.

Thanks - this helps. Do you have a query or something you can share to help me see these easier?

It's the link I shared above: https://integration.wikimedia.org/ci/job/quibble-with-gated-extensions-selenium-php83/buildTimeTrend Sort by time descending and check out the 1h ones; usually, most if not all of them are instances of this bug.