Page MenuHomePhabricator

Firefox CI tests keep failing in VE with Firefox 68
Closed, ResolvedPublic

Description

TLDR: Docker defaults to 64MB of shared memory, which is insanely low.

For example the first run here: https://gerrit.wikimedia.org/r/#/c/VisualEditor/VisualEditor/+/558531/ (https://integration.wikimedia.org/ci/job/visualeditor-node10-browser-docker/1196/console)

I've seen this at least 3 times in the past week.

14:32:16 Running "karma:firefox" (karma) task
14:32:16 17 12 2019 14:32:16.097:WARN [watcher]: All files matched by "/src/node_modules/qunit/qunit/qunit.js" were excluded or matched by prior matchers.
14:32:16 17 12 2019 14:32:16.179:INFO [karma-server]: Karma v3.1.3 server started at http://0.0.0.0:9876/
14:32:16 17 12 2019 14:32:16.179:INFO [launcher]: Launching browsers FirefoxHeadless with concurrency unlimited
14:32:16 17 12 2019 14:32:16.182:INFO [launcher]: Starting browser FirefoxHeadless
14:33:16 17 12 2019 14:33:16.186:WARN [launcher]: FirefoxHeadless have not captured in 60000 ms, killing.
14:33:16 17 12 2019 14:33:16.267:INFO [launcher]: Trying to start FirefoxHeadless again (1/2).
14:34:02 17 12 2019 14:34:02.255:INFO [Firefox 68.0.0 (Linux 0.0.0)]: Connected on socket C2YBo2DbLQsFFg3PAAAB with id 94570126
14:34:37 17 12 2019 14:34:37.260:WARN [Firefox 68.0.0 (Linux 0.0.0)]: Disconnected (0 times)reconnect failed before timeout of 5000ms (ping timeout)
14:34:37 Firefox 68.0.0 (Linux 0.0.0) ERROR
14:34:37   Disconnectedreconnect failed before timeout of 5000ms (ping timeout)
14:34:37 Firefox 68.0.0 (Linux 0.0.0): Executed 0 of 0 DISCONNECTED (35.005 secs / 0 secs)
14:34:37 17 12 2019 14:34:37.263:INFO [karma-server]: Restarting Firefox 68.0.0 (Linux 0.0.0) (1 of 2 attempts)
14:35:37 17 12 2019 14:35:37.490:WARN [launcher]: FirefoxHeadless have not captured in 60000 ms, killing.
14:35:37 17 12 2019 14:35:37.620:INFO [launcher]: Trying to start FirefoxHeadless again (2/2).
14:36:37 17 12 2019 14:36:37.626:WARN [launcher]: FirefoxHeadless have not captured in 60000 ms, killing.
14:36:37 17 12 2019 14:36:37.769:ERROR [launcher]: FirefoxHeadless failed 2 times (timeout). Giving up.
14:36:37 Warning: Task "karma:firefox" failed. Use --force to continue.

Happened again when the image got updated by mistake T259925. It seems to be related to Firefox 68.

ImageFirefox package
docker-registry.wikimedia.org/releng/node10-test-browser:0.6.0-s160.8.0esr-1~deb9u1
docker-registry.wikimedia.org/releng/node10-test-browser:0.6.268.11.0esr-1~deb9u1
docker-registry.wikimedia.org/releng/node10-test-browser:0.6.3-s378.10.0esr

References:

Event Timeline

Esanders added a subscriber: Krinkle.

This is failing quite regularly now.

Change 559545 had a related patch set uploaded (by Jforrester; owner: Jforrester):
[integration/config@master] jjb: Temporarily roll back node10-test-browser uses to 0.6.0-s1

https://gerrit.wikimedia.org/r/559545

Change 559545 merged by jenkins-bot:
[integration/config@master] jjb: Temporarily roll back node10-test-browser uses to 0.6.0-s1

https://gerrit.wikimedia.org/r/559545

OK, I've bumped the jobs back down to the old image (which has Firefox 60 not 68), and all seems to now pass. We'll need to work out what's broken here and fix it before rolling them forward again.

thcipriani lowered the priority of this task from High to Medium.Jan 7 2020, 2:47 PM
thcipriani added a subscriber: thcipriani.

OK, I've bumped the jobs back down to the old image (which has Firefox 60 not 68), and all seems to now pass. We'll need to work out what's broken here and fix it before rolling them forward again.

Lowering priority since there is a workaround in place -- feel free to override if I've misunderstood the current situation.

Jdforrester-WMF lowered the priority of this task from Medium to Low.Feb 24 2020, 5:38 PM
JTannerWMF moved this task from To Triage to Triaged on the VisualEditor board.
JTannerWMF added a subscriber: JTannerWMF.

Looks like the Release Engineering team is thinking about this task. If that is incorrect and there is an action for Editing please let me know.

Change 619288 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] jjb: rollback node10-test-browser to 0.6.0-s1 [2]

https://gerrit.wikimedia.org/r/619288

Change 619288 merged by jenkins-bot:
[integration/config@master] jjb: rollback node10-test-browser to 0.6.0-s1 [2]

https://gerrit.wikimedia.org/r/619288

hashar renamed this task from Firefox CI tests keep failing in VE to Firefox CI tests keep failing in VE with Firefox 68.Aug 10 2020, 12:35 PM
hashar updated the task description. (Show Details)

We would need a way to reproduce the issue. I tried but it passes just fine on my machine:

$ docker run --rm -it -v "$(pwd):/src" --entrypoint=/src/node_modules/.bin/grunt docker-registry.wikimedia.org/releng/node10-test-browser:0.6.2 karma:firefox
Running "karma:firefox" (karma) task
10 08 2020 13:10:46.828:WARN [filelist]: All files matched by "/src/node_modules/qunit/qunit/qunit.js" were excluded or matched by prior matchers.
10 08 2020 13:10:46.978:INFO [karma-server]: Karma v5.0.9 server started at http://0.0.0.0:9876/
10 08 2020 13:10:46.979:INFO [launcher]: Launching browsers FirefoxHeadless with concurrency unlimited
10 08 2020 13:10:46.982:INFO [launcher]: Starting browser FirefoxHeadless
10 08 2020 13:10:48.902:INFO [Firefox 68.0 (Linux x86_64)]: Connected on socket 4-nCmjjFxaAvcTwBAAAA with id 28414039
................................................................................
................................................................................
................................................................................
......................................................................
Firefox 68.0 (Linux x86_64): Executed 310 of 310 SUCCESS (13.886 secs / 13.776 secs)

Done.
$

This seems to be affecting jquery-client as well (a commit of mine was failing with a timeout, but now passes after a recheck).

https://gerrit.wikimedia.org/r/c/jquery-client/+/619134
https://integration.wikimedia.org/ci/job/generic-node10-browser-docker/1480/console (Build kept indefinitely)

`
… docker-registry.wikimedia.org/releng/node10-test-browser:0.6.3
…
08 08 2020 18:51:13.560:WARN [launcher]: FirefoxHeadless have not captured in 60000 ms, killing.
08 08 2020 18:51:13.712:INFO [launcher]: Trying to start FirefoxHeadless again (1/2).
08 08 2020 18:52:13.716:WARN [launcher]: FirefoxHeadless have not captured in 60000 ms, killing.
08 08 2020 18:52:13.823:INFO [launcher]: Trying to start FirefoxHeadless again (2/2).
08 08 2020 18:53:05.578:INFO [Firefox 68.0 (Linux x86_64)]: Connected on socket 2xwjbntcMwh2Z2-8AAAA with id 79031389
08 08 2020 18:53:37.556:WARN [Firefox 68.0 (Linux x86_64)]: Disconnected (0 times)reconnect failed before timeout of 2000ms (ping timeout)
Firefox 68.0 (Linux x86_64) ERROR

And after: https://integration.wikimedia.org/ci/job/generic-node10-browser-docker/1487/console

… docker-registry.wikimedia.org/releng/node10-test-browser:0.6.0-s1

Firefox 60.0 (Linux x86_64): Executed 5 of 5 SUCCESS (0.051 secs / 0.025 secs)

This is a tiny test and might make for an easier repro case locally. In any event, CI was (usually) passing on jquery-client with the newer Docker image as well for a while, so it's most likely a race condition of some kind.

Upstream:

Upstream upstream:

Still unfixed. :-(

Talking about this in the releng triage meeting. It looks like 1/2 upstream bugs are fixed. This bug seems to show a workaround: https://github.com/karma-runner/karma-firefox-launcher/issues/104#issuecomment-748140630 (in Firefox 84[!]).

Seems like there are some cross team dependencies here that we should address to get this unstuck.

If bugzilla 1464690 is actually the issue, it has been marked fixed with labels:

firefox-esr78wontfix
firefox-84fixed

Our image has Firefox 68.12.0esr-1~deb9u1.

There was a comment about raising the amount of shared memory with: --memory "16g" --shm-size "1g" instead of whatever is the default for the Docker version we are running on WMCS CI instances. So maybe that is all about adding that parameter to docker run or check docker run default value / upgrade Docker.

Change 673188 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[integration/config@master] jjb: Increase shared memory to 1g for node10-brwoser jobs

https://gerrit.wikimedia.org/r/673188

Not pretending to know what I'm doing, Would this patch fix the issue?

hashar claimed this task.

Firefox 68.0 (Linux x86_64): Executed 312 of 312 SUCCESS (15.866 secs / 15.737 secs)

Success! Thank you @Ladsgroup

Change 673188 merged by jenkins-bot:
[integration/config@master] jjb: increase shared memory to 1g for node10-browser jobs

https://gerrit.wikimedia.org/r/673188

The oojs/core and unicodejs repos have started failing their master branch CI. It seems to align with JJB config change that moves CI from node10-test-browser:0.6.0-s1 (Firefox 60 esr) to 0.6.3 (Firefox 68 esr).

The reason these (and most js lib?) repos are failing and VE passing is, I suspect, because of 1g memory being in some cases.

For OOjs, UnicodeJS, etc these seem to consistently fail.

https://gerrit.wikimedia.org/r/c/unicodejs/+/681070
https://gerrit.wikimedia.org/r/c/unicodejs/+/682020
https://gerrit.wikimedia.org/r/c/oojs/core/+/682018
etc.

EDIT 1: I first noticed the issue on unicodejs which runs Chromium and Firefox side-by-side so I thought maybe the concurrency is the issue. However, oojs/core is like VE and runs one after the other, and yet oojs/core is also consistently failing.

EDIT 2: @hashar Does this docker option control the amount of allowed memory or does it reserve it specifically for this purpose? If the former, then it be trivial to increase it to a larger option of the memory we generally set aside for Jenkins worker executors, e.g. closer to what we do for Quibble jobs which presumably are given more than 1g as well, so giving it 2g (or 4g) might be fine in that case given we effectively already reserve that at the worker level.

But, if Docker explicitly reserves it, and reserves it separately from what the container itself uses for regular memory allocations, then we might need to look more closely at the numbers.

Change 682023 had a related patch set uploaded (by Krinkle; author: Krinkle):

[integration/config@master] Increase docker shm-size of node10-test-browser jobs from 1g to 2g

https://gerrit.wikimedia.org/r/682023

I updated the job ad-hoc and tried with 2g as well as 4g and it made no difference. I did get it to start once with 4g, but it still timed out. I'll file a separate task and assume for now that it is unrelated.

Change 682023 abandoned by Krinkle:

[integration/config@master] Increase docker shm-size of node10-test-browser jobs from 1g to 2g

Reason:

https://gerrit.wikimedia.org/r/682023

.... and it's affecting most VisualEditor commits as well. I'll merge these tasks and re-open this. The issue was clearly not fixed or perhaps only appeared to be fix on one lucky run.

https://integration.wikimedia.org/ci/job/visualeditor-node10-browser-docker/2096/console
https://gerrit.wikimedia.org/r/c/VisualEditor/VisualEditor/+/683657
etc.

Change 683685 had a related patch set uploaded (by Krinkle; author: Jforrester):

[VisualEditor/VisualEditor@master] build: Re-disable Firefox testing

https://gerrit.wikimedia.org/r/683685

Change 683685 merged by jenkins-bot:

[VisualEditor/VisualEditor@master] build: Re-disable Firefox testing

https://gerrit.wikimedia.org/r/683685

Mentioned in SAL (#wikimedia-releng) [2021-04-30T02:37:29Z] <James_F> Docker: Publishing node10 images based on buster T278203 T240955

Change 683936 had a related patch set uploaded (by Bartosz Dziewoński; author: Bartosz Dziewoński):

[mediawiki/extensions/VisualEditor@master] Update VE core submodule to master (831a75cdb)

https://gerrit.wikimedia.org/r/683936

Change 683944 had a related patch set uploaded (by Jforrester; author: Jforrester):

[VisualEditor/VisualEditor@master] build: Re-enable Firefox testing

https://gerrit.wikimedia.org/r/683944

Change 683944 merged by jenkins-bot:

[VisualEditor/VisualEditor@master] build: Re-enable Firefox testing

https://gerrit.wikimedia.org/r/683944

Change 683936 merged by jenkins-bot:

[mediawiki/extensions/VisualEditor@master] Update VE core submodule to master (831a75cdb)

https://gerrit.wikimedia.org/r/683936

Change 692457 had a related patch set uploaded (by Jforrester; author: Jforrester):

[mediawiki/extensions/VisualEditor@master] Update VE core submodule to master (0c024f431)

https://gerrit.wikimedia.org/r/692457

https://bugzilla.mozilla.org/show_bug.cgi?id=1464690 is fixed:

I'm going to mark this fixed as Firefox 84 ships with bug 1440203 which should work around broken Docker configurations. If you're still seeing this, or a similar issue, please file a new bug as the underlying cause must be different.

But Buster comes with version 78 so we will have to keep the shm parameter around.

The VisualEditor build has Firefox again, so I guess this is resolved entirely.

Change 692457 merged by jenkins-bot:

[mediawiki/extensions/VisualEditor@master] Update VE core submodule to master (0c024f431)

https://gerrit.wikimedia.org/r/692457