Page MenuHomePhabricator

Non-deterministic unit test "streamInSample() - session sampling resets"
Closed, ResolvedPublic

Description

There is this unit test:

tests/ext.eventLogging/stream.test.js
QUnit.test( 'streamInSample() - session sampling resets', function ( assert ) {
	var conf, tot = 0, i;

	conf = {
		sample: {
			rate: 0.5,
			unit: 'session'
		}
	};

	for ( i = 0; i < 20; i++ ) {
		tot += mw.eventLog.streamInSample( conf );
		mw.eventLog.id.resetSessionId();
	}

	assert.notEqual( tot, 20 );
	assert.notEqual( tot, 0 );
} );

If I understand correctly, this generates a random true/false value 20 times, and checks that it got at least one true and one false. This test will fail once in 2^19 attempts (about 0.0002% chance). Seems low, but it just failed in https://gerrit.wikimedia.org/r/c/mediawiki/extensions/DiscussionTools/+/711505 (https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-php73-noselenium-docker/48968/console), so I guess I'm pretty unlucky (or my math is wrong).

It was added in https://gerrit.wikimedia.org/r/c/mediawiki/extensions/EventLogging/+/659060 about a year ago. There are also a few other tests that look similar, but I didn't try to figure out if they also have this issue.

Failure for reference:

23:40:11 FAILED TESTS:
23:40:11   ext.eventLogging/stream
23:40:11     ✖ streamInSample() - session sampling resets
23:40:11       Chrome Headless 90.0.4430.212 (Linux x86_64)
23:40:11     Expected: 20
23:40:11     Actual: 20
23:40:11         at Object.<anonymous> (http://localhost:9876/load.php?modules=oojs-ui-core.icons%2Cstyles%7Coojs-ui-toolbars%2Coojs-ui-widgets%2Coojs-ui-windows%2Cpapaparse%2Crangefix%2Csinonjs%2Cspark-md5%2CtreeDiffer%2Cunicodejs%7Coojs-ui-toolbars.icons%7Coojs-ui-widgets.icons%7Coojs-ui-windows.icons%7Coojs-ui.styles.icons-accessibility%2Cicons-alerts%2Cicons-content%2Cicons-editing-advanced%2Cicons-editing-citation%2Cicons-editing-core%2Cicons-editing-list%2Cicons-editing-styling%2Cicons-interactions%2Cicons-layout%2Cicons-media%2Cicons-moderation%2Cicons-movement%2Cicons-user%2Cicons-wikimedia%2Cindicators%7Cskins.minerva.messageBox.styles%7Csocket.io%7Ctest.CentralAuth%2CCite%2CDiscussionTools%2CEcho%2CEventLogging%2CGuidedTour%2CMediaWiki%2CMinervaNeue%2CTemplateData%2CVisualEditor%2CWikiEditor%2CWikimediaEvents&version=1q8ym:2672:216)
23:40:11         at runTest (node_modules/qunit/qunit/qunit.js:2496:35)
23:40:11         at Test.run (node_modules/qunit/qunit/qunit.js:2479:9)
23:40:11         at node_modules/qunit/qunit/qunit.js:2770:16
23:40:11         at processTaskQueue (node_modules/qunit/qunit/qunit.js:2051:26)
23:40:11         at node_modules/qunit/qunit/qunit.js:2055:13

(Also, the failure message states the opposite of what it should – the expected value is anything but 20. This looks like a QUnit issue though.)

Event Timeline

Restricted Application added subscribers: Masumrezarock100, Aklapper. · View Herald Transcript
phuedx triaged this task as Medium priority.Nov 15 2023, 3:16 PM
phuedx removed a subscriber: jlinehan.

Dang, I wish there was a way to set a seed in JS.

Okay, so there are a few components being tested here:

  • A session ID can be generated (by this algorithm) and accessed
  • That session ID can be turned into a number between 0 and 1 by normalizeId() (which is just parseInt( id.slice( 0, 8 ), 16 ) / UINT32_MAX and has been confirmed to have the desired properties)
  • We can do a basic inequality check against the rate
  • If we repeat this process multiple times hopefully not all are in/out of sample

I'm wondering if this unit test has much value and should be removed. (Gasp!)

The alternative would be to use statistics since this is a stochastic process. We could implement a confidence interval-based solution (happy to elaborate on that if needed) but it would be VERY computationally expensive – essentially requiring 20K session resets per run.

From @phuedx on Slack:

I wonder how valuable a test it is. Perhaps we could break it out into a couple of unit tests with stubs for the session ID generator and an (expensive?) test that tests it end-to-end that can be run manually?

Yeah, it's not a valuable test as-is. I also was thinking of providing some session IDs that we know to be:

  • in-sample at 50%
  • out-of-sample at 50%
  • in-sample at 1% and double checking that the same ID is also in-sample at 5%, 10%, 50% (I have previously verified that the algorithm behaves in a way that this property is satisfied)

Because yeah, and end-to-end test tests:

  • ID generation algorithm produces an ID that has the properties we desire / behaves in the way that we intend
  • ID processing algorithm works with the generated ID in the way we desire/intend

And the properties/behavior we want to test something that we need a lot of simulated sessions per run and a lot of simulated runs to verify. Do we have that option? To have tests that aren't run by CI, only manually?

Do we have that option? To have tests that aren't run by CI, only manually?

We do! We can write a QUnit test in its own file and mark it as skipped. If we want to run it, then we remove the mark, e.g.

QUnit.test.skip( 'mw.user.generateUserSessionID(): ...', ( assert ) => {
} );

// And if want to run it...

QUnit.test( 'mw.user.generateUserSessionID(): ...', ( assert ) => {
} );

Alternatively, we can write a test in Node to be run via the CLI.

Awesome! Okay, when I played around originally to get a sense of how expensive of a procedure it might need to be I generated random numbers, obtained proportions "in-sample", calculated confidence intervals, and checked how many contain the true proportion. I just thought of a much less expensive procedure that might actually be feasible in a CI setup! (Although it might still be expensive enough to make manual.)

Properties/behavior we care about:

  1. If a session is in-sample at 1% rate, it should be in-sample at 5%, 10%, 25%, etc.
  2. If rate is p, approx p of all sessions would be in-sample. This won't be exact but the more sessions there are (n → ∞), the sample proportion → p
  • Generate N session IDs (e.g. N=1000)
  • Convert them to decimals via normalizeId, yielding N numbers between 0 and 1
    • Because our algorithm uses <, property #1 will always hold.
    • Now we just need to worry about property #2
  • We don't actually need to test if ~10% of sessions are determined to be in sample when sample rate is 0.1, etc. – INSTEAD we just need to verify that the distribution of those N numbers (obtained from "normalizing" the randomly generated session IDs) is approximately Uniform(0,1).

We can use https://commons.apache.org/proper/commons-math/javadocs/api-3.6.1/org/apache/commons/math3/stat/inference/KolmogorovSmirnovTest.html in the Java implementation

We can create our own in the Swift implementation and use SciPy's implementation (or the aforementioned Java or the JS versions) for reference.

Properties/behavior we care about:

  1. If a session is in-sample at 1% rate, it should be in-sample at 5%, 10%, 25%, etc.

This isn't tested in either the JS nor the Java Metrics Platform Client. I'll create a task.

  1. If rate is p, approx p of all sessions would be in-sample. This won't be exact but the more sessions there are (n → ∞), the sample proportion → p
  • Generate N session IDs (e.g. N=1000)
  • Convert them to decimals via normalizeId, yielding N numbers between 0 and 1
    • Because our algorithm uses <, property #1 will always hold.
    • Now we just need to worry about property #2
  • We don't actually need to test if ~10% of sessions are determined to be in sample when sample rate is 0.1, etc. – INSTEAD we just need to verify that the distribution of those N numbers (obtained from "normalizing" the randomly generated session IDs) is approximately Uniform(0,1).

We can use https://commons.apache.org/proper/commons-math/javadocs/api-3.6.1/org/apache/commons/math3/stat/inference/KolmogorovSmirnovTest.html in the Java implementation

We can create our own in the Swift implementation and use SciPy's implementation (or the aforementioned Java or the JS versions) for reference.

I'll create a task (maybe an epic with a subtask per implementation for this). For the JS implementation, we should also be wary of the execution environment, i.e. we should have a test for Node and a test that can be run in a browser.

Once these tasks are Done™, then we can remove this test.

phuedx claimed this task.

AIUI this is Done™ so I'm being bold and closing this task as Resolved.

The JS-specific subtasks of T355052: [EPIC] Increase coverage of sampling tests are Done™ and supersede the original test. The original test was deleted in rEEVL801d4b553c81: Integrate mediawiki/libs/metrics-platform.