Page MenuHomePhabricator

Jenkins job mediawiki-extensions-qunit Karma timeout on odd-numbered build slaves
Closed, ResolvedPublic

Description

About 50% of build jobs are failing for the mediawiki/core test job mediawiki-extensions-qunit due to a Karma related timeout.

Interestingly enough, all of the slaves which are consistently failing are odd numbered slaves:

  • integration-slave-trusty-1011
  • integration-slave-trusty-1013
  • integration-slave-trusty-1015

And those usually succeeding are even-numbered:

  • integration-slave-trusty-1012
  • integration-slave-trusty-1014
  • integration-slave-trusty-1016

I only looked at the most recent 15 or so builds, so it's conceivable that this is just a coincidence, or it could mean that Jenkins has become sentient and is rebelling against us. Also, mediawiki-core-qunit tests are fine on all of the build slaves.

02:56:06 Running "karma:main" (karma) task
02:56:06 26 12 2015 02:56:06.827:INFO [karma]: Karma v0.13.10 server started at http://localhost:9876/
02:56:06 26 12 2015 02:56:06.870:INFO [launcher]: Starting browser Chrome
02:56:12 26 12 2015 02:56:12.424:INFO [Chromium 47.0.2526 (Ubuntu 0.0.0)]: Connected on socket x8rKNUuoFpv7cATqAAAA with id 64277732
02:57:12 26 12 2015 02:57:12.443:WARN [Chromium 47.0.2526 (Ubuntu 0.0.0)]: Disconnected (1 times), because no message in 60000 ms.
02:57:12 Chromium 47.0.2526 (Ubuntu 0.0.0): Executed 0 of 0 DISCONNECTED (1 min 0.018 secs / 0 secs)
02:57:12 Warning: Task "karma:main" failed. Use --force to continue.
02:57:12 
02:57:12 Aborted due to warnings.
02:57:12 Build step 'Execute shell' marked build as failure

Event Timeline

Unicornisaurous raised the priority of this task from to High.
Unicornisaurous updated the task description. (Show Details)
Unicornisaurous added a subscriber: Unicornisaurous.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 26 2015, 5:00 AM
hashar added a subscriber: hashar.Dec 26 2015, 10:05 AM

I am merely acknowledging this issue. Unlikely to get it fixed anytime soon since I am traveling right now.. On a first diagnosticI can't find anything specific, though I had restarted the puppetmaster.

The tmpfs dir is empty, there is enough disk space, I restarted Xvfb on trusty-1011 and rebooted trusty-1015.

Will look again more seriously tonight (CET).

Change 261037 had a related patch set uploaded (by Unicornisaurous):
Update karma

https://gerrit.wikimedia.org/r/261037

Looks like jenkins was playing tricks on me, as an even-numbered build machine has also failed the same way. I suspect increasing the timeout will help as in the patch linked here. (The patch isn't actually mine...gerritbot just thinks it was because I added this bug to its commit message)

Increasing the timeout may solve the problem until the actual problem is fixed. Some tests may fail but the amount of tests failing will go down hopefully.

hashar added a comment.EditedDec 26 2015, 8:50 PM

From one of the MediaWiki debug log running under Apache (which QUnit hits) such as https://integration.wikimedia.org/ci/job/mediawiki-extensions-qunit/24605/artifact/log/mw-debug-www.log/*view*/ I got:

17.6132  22.8M  HTTP: GET: https://meta.wikimedia.org/w/api.php?action=jsonschema&revid=12114785&formatversion=2
[http] Error fetching URL: Failed to connect to webproxy.eqiad.wmnet port 8080: Connection timed out
[EventLogging] Request to https://meta.wikimedia.org/w/api.php?action=jsonschema&revid=12114785&formatversion=2 failed.

So it is EventLogging hitting the production site meta.wikimedia.org via web proxy.eqiad.wmnet:8080` which fail for some reason. Trying on integration-slave-trusty-1011:

curl --verbose --proxy webproxy.eqiad.wmnet 'https://meta.wikimedia.org/w/api.php?action=jsonschema&revid=12114785&formatversion=2'
* Hostname was NOT found in DNS cache
*   Trying 208.80.154.10...
*   Trying 2620:0:861:1:208:80:154:10...
* Immediate connect fail for 2620:0:861:1:208:80:154:10: Network is unreachable
<stall>

And carbon is no more joinable from the instance for some reason over IPv4:

curl -4 --verbose --proxy webproxy.eqiad.wmnet 'https://meta.wikimedia.org/w/api.php?action=jsonschema&revid=12114785&formatversion=2'
* Hostname was NOT found in DNS cache
*   Trying 208.80.154.10...

Thanks for tracing the problem or at least tracing where the problem is coming from could that mean that the dns was accidently removed.

So it seems the proxy host webproxy.eqiad.wmnet:8080 is unresponsive. Icinga has a Squid check: TCP OK - 0.001 second response time on port 8080 been green for 93 days.

I can ping the host just fine at least. Maybe a firewall rule? Will fill another bug for Operations.

Change 261096 had a related patch set uploaded (by Hashar):
mwconf: no more set $wgHTTPProxy

https://gerrit.wikimedia.org/r/261096

Change 261096 merged by jenkins-bot:
mwconf: no more set $wgHTTPProxy

https://gerrit.wikimedia.org/r/261096

hashar closed this task as Resolved.Dec 26 2015, 9:52 PM
hashar claimed this task.

Should be fixed now.

So it was potentially a race condition with EventLogging doing a bunch of queries timing out on their own and eventually reaching the 60 seconds global timeout. Worth another investigation of its own.

Sometime the queries were balling out fast enough to not reach the 60 seconds timeout and the job would continue. Sometime they would not and some web query would timeout in karma.

All solved now.

Change 261476 had a related patch set uploaded (by Hashar):
contint: remove maven webproxy

https://gerrit.wikimedia.org/r/261476

Change 261476 merged by Faidon Liambotis:
contint: remove maven webproxy

https://gerrit.wikimedia.org/r/261476