Page MenuHomePhabricator

"Minerva AB test" blocking merge in unrelated repos
Closed, ResolvedPublic


Failure message:

15:47:49   Minerva AB-test
15:47:49     ✖ Bucketing test
15:47:49       HeadlessChrome 73.0.3683 (Linux 0.0.0)
15:47:49     test control group is about 25% (30.8%)
15:47:49     Expected: true
15:47:49     Actual: false

This seems to be doing something stochastic, which is not cool in a test and especially not in a gated repo. From tests/qunit/skins.minerva.scripts/AB.test.js,

    ( userBuckets.unsampled / maxUsers > 0.4 ) &&
    ( userBuckets.unsampled / maxUsers < 0.6 ),
    true, 'test unsampled group is about 50% (' + userBuckets.unsampled / 10 + '%)' );

    ( userBuckets.control / maxUsers > 0.2 ) &&
    ( userBuckets.control / maxUsers < 0.3 ),
    true, 'test control group is about 25% (' + userBuckets.control / 10 + '%)' );

    ( userBuckets.treatment / maxUsers > 0.2 ) &&
    ( userBuckets.treatment / maxUsers < 0.3 ),
    true, 'test new treatment group is about 25% (' + userBuckets.treatment / 10 + '%)' );

The right approach is to manipulate the random number generator, mock the internals of the class, whatever it takes, but failing some percentage of the time is a problem for everyone.


Related Gerrit Patches:
mediawiki/skins/MinervaNeue : masterEmbrace headless node-qunit tests

Event Timeline

awight created this task.Jun 5 2019, 2:32 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 5 2019, 2:32 PM

Looks like we can just pass a handful of static session IDs into the function and make sure they each map to the expected bucket.

@awight, thanks for reporting. Is this failing frequently?

awight added a comment.Jun 5 2019, 3:56 PM

@awight, thanks for reporting. Is this failing frequently?

No, it's not particularly urgent, but according to my napkin math it would have failed something like 0.1% - 1% of the time. This adds up quickly if many gated extensions are doing similar things, so I think it's worth fixing in the medium term. With constant session IDs, you could get stronger guarantees about code correctness, with no chance for spurious failure.

Krinkle triaged this task as High priority.Jun 10 2019, 3:21 PM
Krinkle added a subscriber: Krinkle.

@awight, thanks for reporting. Is this failing frequently?

Flaky tests should be disabled. Breaking 1% of merges, cherry picks, and deployments is actively unproductive, especially with those actions already taking 15-20 minutes, and a 1% failure rate very rapidly increases the time it takes to do these actions given that it often depends on several commits passing together in the parallel Zuul gating mechanism.

If it doesn't fail frequently, that might inform the priority as to when it should be fixed, but it should be disabled first regardless unless a fix is imminent (e.g. less than a day or so?)

Is this still an issue?

ovasileva lowered the priority of this task from High to Normal.Jun 25 2019, 4:18 PM
ovasileva moved this task from Incoming to Needs Prioritization on the Readers-Web-Backlog board.

Change 520651 had a related patch set uploaded (by Jdlrobson; owner: Jdlrobson):
[mediawiki/skins/MinervaNeue@master] WIP: Embrace packageFiles and headless node-qunit tests

Jdlrobson closed this task as Resolved.Jul 8 2019, 4:01 PM
Jdlrobson claimed this task.
Jdlrobson raised the priority of this task from Normal to High.
Jdlrobson added a subscriber: Jdlrobson.

Change 523755 had a related patch set uploaded (by Jdlrobson; owner: Jdlrobson):
[mediawiki/skins/MinervaNeue@master] Embrace packageFiles

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:06 PM