Page MenuHomePhabricator

QUnit job unable to collect logs, test still has active http requests writing to them
Closed, ResolvedPublic

Description

Somteimes, MobileFrontens qunit job fails due to the following error:
20:11:02 ERROR: Failed to archive artifacts: log/*
20:11:02 java.io.IOException: java.util.concurrent.ExecutionException: java.io.IOException: request to write '2315' bytes exceeds size in header of '2262' bytes for entry 'log/mw-debug-www.log'

Examples:
https://integration.wikimedia.org/ci/job/mwext-MobileFrontend-qunit-mobile/8100/consoleFull
https://integration.wikimedia.org/ci/job/mwext-MobileFrontend-qunit-mobile/8078/consoleFull
https://integration.wikimedia.org/ci/job/mwext-MobileFrontend-qunit-mobile/8076/consoleFull

Jenkins is voting -2 and prevent from merging changes :( It seems to be not a problem of MobileFrontends codebase, or?

Event Timeline

Florian raised the priority of this task from to Needs Triage.
Florian updated the task description. (Show Details)
Florian changed Security from none to None.
Florian added subscribers: Florian, Jdlrobson.
hashar triaged this task as Unbreak Now! priority.

MediaWiki is configured to write debug log under /log/ which is writable by both www-data (for qunit jobs) and jenkins-slave (for PHPUnit in cli).

For one of the run that fails I can see two files:

 www-data www-data          mw-debug-www.log
 jenkins-slave jenkins-slave mw-debug-www.log.gz


Seems we have a race condition between an Apache thread that writes to the mw-debug-www.log while the Jenkins artifact publishers attempts to tar it.  That is a known tar error.  That goes as follow:

* test does an apache request
* qunit succeed
* Jenkins gzip the files under /log/
* the long apache request writes to mw-debug-www.log which create it
* Jenkins artifact publisher list the files under /log/ and tar them
* the long apache request writes to mw-debug-www.log while tar is busy processing the file
* tar complains that the file has been changed


The fix would be to have our qunit special page to wait for all Apache requests to have completed.

Will rephrase bug and unassign myself since I have not much JavaScript to figure it out :(
hashar renamed this task from Some MobileFrontend qunit builds failed to archive artifacts (voting -2) to MediaWiki QUnit test does not wait for all requests to complete, causing a race condition in Jenkins.Dec 15 2014, 9:14 PM
hashar removed hashar as the assignee of this task.
hashar edited projects, added MediaWiki-General; removed Jenkins.
hashar added a subscriber: Krinkle.

Sounds like an issue with the code. Unit tests should mock anything they don't depend on, and wait for anything they do depend on. If there are animations, asynchronous workers, and ajax requests still running when the last QUnit test finishes, the test is flawed.

The QUnit testrunner actually throws an error if there are pending animations and ajax requests at the end of a test, so it sounds like maybe this ajax request is fired off asynchronously after the test is finished?

Have we seen this in other projects, or only MobileFrontend? That would help narrow down where the culprit is.

This is frequently hitting us today. Cookie for whoever can fix it! :)

This is frequently hitting us today. Cookie for whoever can fix it! :)

As said in T78590#850957, please investigate in MobileFrontend by running the tests locally and looking for outgoing http requests (e.g. in Chrome Dev Tools) on Special:JavaScriptTest. I suspect the MF functions called by your tests are firing off AJAX requests without providing a callback or promise (or it does, but the test ignores it). In that case it's the test that should wait for it. There's nothing I can do from the Jenkins side to change that.

Krinkle renamed this task from MediaWiki QUnit test does not wait for all requests to complete, causing a race condition in Jenkins to QUnit job unable to collect logs, test still has active http requests writing to them.Dec 18 2014, 11:54 PM

@Krinkle this is not the same bug.
What does a log file running out of space have to do with testing locally?
See https://integration.wikimedia.org/ci/job/mwext-MobileFrontend-qunit-mobile/8449/console for another example

(you'll see the tests pass but it is Archiving artifacts that fails)

(Running locally I'm seeing no ajax requests. I just verified that again.)

@Krinkle this is not the same bug.
What does a log file running out of space have to do with testing locally?
See https://integration.wikimedia.org/ci/job/mwext-MobileFrontend-qunit-mobile/8449/console for another example

Nobody said anything about running out of space. As pointed out by @hashar earlier, the problem is that Jenkins has been given the signal by your test suite that the test is done. At that point it starts collecting log files into a tar ball (which individual developers can download in case of failures to see exactly what MediaWiki was doing).

tar doesn't tolerate the files being archived changing at the same time. And since MediaWiki is still receiving http requests, because theres still activity after the test was "done", it still writes to the log, and the tar fails.

@Krinkle I'm not seeing anything obvious so if you can find the time to take a look at our tests and see if there is something obvious, direction here would be appreciated

The thing that is strange is that this has only recently started happening, hence why I suspected (maybe incorrectly) that something changed in Jenkins.

I tried to figure out, what @Krinkle said.

What i did: I added a QUnit.done() callback on top of the test_Page.js (it should doesn't matter, in which test file it's added), which add a console.log with some test details and a timeline marker for Google chrome (console.timeStamp( 'QunitDone' );).

I run the test locally and looked at the timeline of chrome. There are a lot of things and somewhere near the end i had the QUnitDone event fired by a timer. So far, so good.

Now, i think that can be a/the problem: There are two timer events, which are fired after the QUnitDone event. Both comes from Panel.js on line 57 which is a timeout to remove the visible class of the panel and emit the hide event.

The only modules, which extends Panel in some way i can remember is:

  • Drawer: for which we have no tests, so i can't think, that this happens these events
  • AbuseFilterPanel: same as Drawer, (unhappily) no tests for it
  • WikiGrokDialog(B): For which we have tests (test_WikiGrokDialog.js / test_WikiGrokDialogB.js), but doesn't seem to be the problem, because Panel.hide is stubbed correctly.
  • references drawer: extends Drawer, which extends Panel and there are tests for it

So i tested it again after i removed the complete test and run the same thing as described above. Without these test there isn't any fired events after the QUnitDone marker.

So, if i interpret it correctly (and didn't missed anything), we should extend references test case to stub self.hide() and match the QUnit requirements. Can someone confirm or refute it? :)

I suspect it might not be an ajax request but a JavaScript timeout somewhere that might be causing this issue.

Jdlrobson claimed this task.

We haven't seen this for a while. This might be related to T84922 or luck. Closing and will reopen if it resurfaces.