Page MenuHomePhabricator

MediaWiki jobs intermittently fail due to "Maximum call stack size exceeded" from timestampAutoUpdate in QUnit test
Closed, ResolvedPublicPRODUCTION ERROR

Description

Seen several times on Friday and again today.

Example from https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/NavigationTiming/+/461014/ at https://integration.wikimedia.org/ci/job/wmf-quibble-vendor-mysql-hhvm-docker/5877/consoleFull#console-section-2

00:09:46.806   {
     "message": "
Uncaught RangeError: Maximum call stack size exceeded\nat ...)
    at proxy (eval at createProxy (...), <anonymous>:1:35)
        at timestampAutoUpdate (...)
            at ...)
            at proxy (eval at createProxy (...), <anonymous>:1:35)
        at timestampAutoUpdate (...)
            at ...)
            at proxy (eval at createProxy (...), <anonymous>:1:35)"


00:09:46.939 Warning: Task "karma:main" failed. Use --force to continue.
..
00:09:47.461 subprocess.CalledProcessError: Command '['./node_modules/.bin/grunt', 'qunit']' returned non-zero exit status 3

Event Timeline

Krinkle created this task.Sep 18 2018, 1:58 AM
Restricted Application added a project: Growth-Team. · View Herald TranscriptSep 18 2018, 1:58 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Krinkle updated the task description. (Show Details)Sep 18 2018, 1:58 AM
Krinkle updated the task description. (Show Details)
Krinkle triaged this task as Unbreak Now! priority.Oct 2 2018, 5:08 PM

And again on various repos today.

This comes from https://gerrit.wikimedia.org/g/mediawiki/extensions/Flow/+/d1c646aee7f149861132a06e5f4212cc7abdcb43/modules/engine/misc/flow-handlebars.js. Please fix, or avoid loading this code during the QUnit tests.

This is overdue, marking UBN. If not fixed by tomorrow, I'll temporarily remove Flow from the shared test job in Jenkins, as we've done for others in the past.

Restricted Application added subscribers: Liuxinyu970226, TerraCodes. · View Herald TranscriptOct 2 2018, 5:08 PM

Change 464636 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/extensions/Flow@master] flow-handlebars: Disable timestampAutoUpdate() when QUnit is active

https://gerrit.wikimedia.org/r/464636

SBisson added a subscriber: SBisson.Oct 5 2018, 2:51 PM

@Krinkle are you able to reproduce this failure locally? If so, how?

From https://integration.wikimedia.org/ci/job/wmf-quibble-vendor-mysql-hhvm-docker/5877/consoleFull#console-section-2 it looks like the last passing test is Stale response caching - T117587 so I'm guessing Stale response caching - backcompat might be the one failing but it's passing for me locally. I don't know for sure if they run in the definition order and I might be reading the build output wrong.

I'm fine with the patch, in the "terrible hacks" category, and I don't mind merging it but I agree that this file shouldn't be loaded in the first place so there might be a clean solution that's relatively low-hanging. I'll look into it if I can just reproduce the failure.

Change 464636 merged by jenkins-bot:
[mediawiki/extensions/Flow@master] flow-handlebars: Disable timestampAutoUpdate() when QUnit is active

https://gerrit.wikimedia.org/r/464636

Krinkle added a comment.EditedOct 6 2018, 12:08 AM

@Krinkle are you able to reproduce this failure locally? If so, how?

I have not yet tried to reproduce it locally.

From https://integration.wikimedia.org/ci/job/wmf-quibble-vendor-mysql-hhvm-docker/5877/consoleFull#console-section-2 it looks like the last passing test is Stale response caching - T117587 so I'm guessing Stale response caching - backcompat might be the one failing but it's passing for me locally.

Yeah, the order of the tests probably won't help. The build is not failing due to a failed assertion from QUnit. It is failing due to JavaScript code that runs between tests. My understanding is currently as follows:

  • The test runner loads qunit.js, the registered test files, and their dependencies. One of the dependencies contains flow-handlebars.js, which calls $() to schedule a function, that will schedule a timer.
  • After a test file's dependencies have finished their initial execution, the test files will execute. Test files mainly call QUnit.test(), which registers the test name and closure under the last-called QUnit.module.
  • Once all dependencies and test files have executed, the browser lets the dom-ready event fire. The first callback for that event is QUnit starting test execution. This will process uninterrupted until an asynchronous test is encountered, at which QUnit pauses until either the test calls its "done" callback, or the test's internal timeout callback is called.
  • While whatever async test is busy, the other callbacks for the browsers dom-ready event finally run. Once of those is probably flow's timestampAutoUpdate. The function runs and schedules another timer.
  • QUnit will resume at some point until the next async test starts. Then it will yield like before and eventually continue like before. At some point during these tests, the browser will find that timestampAutoUpdate's timeout of 100ms or 5000ms has been reached, and queues it next on the event loop. Then, at the first yield from QUnit tests after that, it will execute.
  • The timestampAutoUpdate function executes again. This time, it recurses many times until the JavaScript engine aborts it, throws an uncaught exception, and Karma closes the browser and marks the build as failed.

I suspect the reason for the recursion of timestampAutoUpdate may relate to Sinon. Among Sinon's features is to let a test create "fake clocks" that can intercept calls to setTimeout within that test and allow the test to control the progression of time and callback invocation.

Sinon cleanly starts and ends these sandboxes before and after each test, leaving the clock untampered with in unrelated source code executing between tests and outside tests. If the flow-handlebar code were started and ended by a test suite, then it would be fine, either with or without fake clocks. But, because it starts independently and interleaves between test code, it ends up trapped in one of the many tests that use fake clocks.

Krinkle lowered the priority of this task from Unbreak Now! to Needs Triage.Oct 6 2018, 12:08 AM
Krinkle closed this task as Resolved.Jan 22 2019, 3:18 AM
Krinkle claimed this task.
Krinkle removed a project: Patch-For-Review.
mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:08 PM

This error is almost always means you have a problem with recursion in JavaScript code, as there isn't any other way in JavaScript to consume lots of stack. Sometimes calling a recursive function over and over again, causes the browser to send you Maximum call stack size exceeded error message as the memory that can be allocated for your use is not unlimited. It's possible to cause infinite recursion in a fully promisified code, too. That can happen if the promises in a chain don't actually perform any asynchronous execution , in which case control never really returns to the event loop, even though the code otherwise appears to be asynchronous. That's when it's useful to wrap your recursive function call into a -

  • setTimeout
  • setImmediate or
  • process.nextTick

Also, you can localize the issue by setting a breakpoint on RangeError type of exception , and then adjust the code appropriately. Moreover, you can managed to find the point that was causing the error by check the error details in the Chrome dev toolbar console , this will give you the functions in the call stack, and guide you towards the recursion that's causing the error.