Non-deterministic unit test "streamInSample() - session sampling resets"
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	matmarex
	Mar 21 2022, 11:08 PM

Description

There is this unit test:

tests/ext.eventLogging/stream.test.js

QUnit.test( 'streamInSample() - session sampling resets', function ( assert ) {
	var conf, tot = 0, i;

	conf = {
		sample: {
			rate: 0.5,
			unit: 'session'
		}
	};

	for ( i = 0; i < 20; i++ ) {
		tot += mw.eventLog.streamInSample( conf );
		mw.eventLog.id.resetSessionId();
	}

	assert.notEqual( tot, 20 );
	assert.notEqual( tot, 0 );
} );

If I understand correctly, this generates a random true/false value 20 times, and checks that it got at least one true and one false. This test will fail once in 2^19 attempts (about 0.0002% chance). Seems low, but it just failed in https://gerrit.wikimedia.org/r/c/mediawiki/extensions/DiscussionTools/+/711505 (https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-php73-noselenium-docker/48968/console), so I guess I'm pretty unlucky (or my math is wrong).

It was added in https://gerrit.wikimedia.org/r/c/mediawiki/extensions/EventLogging/+/659060 about a year ago. There are also a few other tests that look similar, but I didn't try to figure out if they also have this issue.

Failure for reference:

23:40:11 FAILED TESTS:
23:40:11   ext.eventLogging/stream
23:40:11     ✖ streamInSample() - session sampling resets
23:40:11       Chrome Headless 90.0.4430.212 (Linux x86_64)
23:40:11     Expected: 20
23:40:11     Actual: 20
23:40:11         at Object.<anonymous> (http://localhost:9876/load.php?modules=oojs-ui-core.icons%2Cstyles%7Coojs-ui-toolbars%2Coojs-ui-widgets%2Coojs-ui-windows%2Cpapaparse%2Crangefix%2Csinonjs%2Cspark-md5%2CtreeDiffer%2Cunicodejs%7Coojs-ui-toolbars.icons%7Coojs-ui-widgets.icons%7Coojs-ui-windows.icons%7Coojs-ui.styles.icons-accessibility%2Cicons-alerts%2Cicons-content%2Cicons-editing-advanced%2Cicons-editing-citation%2Cicons-editing-core%2Cicons-editing-list%2Cicons-editing-styling%2Cicons-interactions%2Cicons-layout%2Cicons-media%2Cicons-moderation%2Cicons-movement%2Cicons-user%2Cicons-wikimedia%2Cindicators%7Cskins.minerva.messageBox.styles%7Csocket.io%7Ctest.CentralAuth%2CCite%2CDiscussionTools%2CEcho%2CEventLogging%2CGuidedTour%2CMediaWiki%2CMinervaNeue%2CTemplateData%2CVisualEditor%2CWikiEditor%2CWikimediaEvents&version=1q8ym:2672:216)
23:40:11         at runTest (node_modules/qunit/qunit/qunit.js:2496:35)
23:40:11         at Test.run (node_modules/qunit/qunit/qunit.js:2479:9)
23:40:11         at node_modules/qunit/qunit/qunit.js:2770:16
23:40:11         at processTaskQueue (node_modules/qunit/qunit/qunit.js:2051:26)
23:40:11         at node_modules/qunit/qunit/qunit.js:2055:13

(Also, the failure message states the opposite of what it should – the expected value is anything but 20. This looks like a QUnit issue though.)

Related Objects
Search...

Status	Assigned	Task
Resolved	phuedx	T304379 Non-deterministic unit test "streamInSample() - session sampling resets"
Open	None	T355052 [EPIC] Increase coverage of sampling tests
Resolved	Sfaci	T355054 [JS] Increase coverage of SamplingController tests
Open	None	T355055 [Java] Increase coverage of SamplingController tests
Resolved	phuedx	T355064 [JS] Test distribution of pageview/session IDs
Open	None	T355123 [Java] Test distribution of pageview/session IDs

Event Timeline

matmarex created this task.Mar 21 2022, 11:08 PM

Restricted Application added a project: Data-Engineering. · View Herald TranscriptMar 21 2022, 11:08 PM

Restricted Application added subscribers: Masumrezarock100, Aklapper. · View Herald Transcript

odimitrijevic added projects: Event-Platform, Data-Engineering-Kanban.Mar 23 2022, 4:30 AM

odimitrijevic edited projects, added Data-Engineering-Radar; removed Data-Engineering-Kanban.

Restricted Application removed a project: Data-Engineering. · View Herald TranscriptMar 23 2022, 4:30 AM

Ottomata edited projects, added Metrics Platform; removed Event-Platform.Aug 11 2022, 5:54 PM

phuedx subscribed.Aug 15 2022, 10:51 AM

phuedx moved this task from Backlog to To be discussed on the Metrics Platform board.Jul 31 2023, 1:38 PM

phuedx moved this task from To be discussed to Client improvements on the Metrics Platform board.Sep 27 2023, 9:07 AM

phuedx triaged this task as Medium priority.Nov 15 2023, 3:16 PM

phuedx added a project: Data Products (Data Products Sprint 04).

phuedx removed a subscriber: • jlinehan.

Dang, I wish there was a way to set a seed in JS.

Okay, so there are a few components being tested here:

A session ID can be generated (by this algorithm) and accessed
That session ID can be turned into a number between 0 and 1 by normalizeId() (which is just parseInt( id.slice( 0, 8 ), 16 ) / UINT32_MAX and has been confirmed to have the desired properties)
We can do a basic inequality check against the rate
If we repeat this process multiple times hopefully not all are in/out of sample

I'm wondering if this unit test has much value and should be removed. (Gasp!)

The alternative would be to use statistics since this is a stochastic process. We could implement a confidence interval-based solution (happy to elaborate on that if needed) but it would be VERY computationally expensive – essentially requiring 20K session resets per run.

phuedx edited projects, added Data Products (Data Products Sprint 05), Technical-Debt; removed Data Products (Data Products Sprint 04).Dec 7 2023, 10:55 AM

From @phuedx on Slack:

I wonder how valuable a test it is. Perhaps we could break it out into a couple of unit tests with stubs for the session ID generator and an (expensive?) test that tests it end-to-end that can be run manually?

Yeah, it's not a valuable test as-is. I also was thinking of providing some session IDs that we know to be:

in-sample at 50%
out-of-sample at 50%
in-sample at 1% and double checking that the same ID is also in-sample at 5%, 10%, 50% (I have previously verified that the algorithm behaves in a way that this property is satisfied)

Because yeah, and end-to-end test tests:

ID generation algorithm produces an ID that has the properties we desire / behaves in the way that we intend
ID processing algorithm works with the generated ID in the way we desire/intend

And the properties/behavior we want to test something that we need a lot of simulated sessions per run and a lot of simulated runs to verify. Do we have that option? To have tests that aren't run by CI, only manually?

In T304379#9398410, @mpopov wrote:

Do we have that option? To have tests that aren't run by CI, only manually?

We do! We can write a QUnit test in its own file and mark it as skipped. If we want to run it, then we remove the mark, e.g.

QUnit.test.skip( 'mw.user.generateUserSessionID(): ...', ( assert ) => {
} );

// And if want to run it...

QUnit.test( 'mw.user.generateUserSessionID(): ...', ( assert ) => {
} );

Alternatively, we can write a test in Node to be run via the CLI.

Awesome! Okay, when I played around originally to get a sense of how expensive of a procedure it might need to be I generated random numbers, obtained proportions "in-sample", calculated confidence intervals, and checked how many contain the true proportion. I just thought of a much less expensive procedure that might actually be feasible in a CI setup! (Although it might still be expensive enough to make manual.)

Properties/behavior we care about:

If a session is in-sample at 1% rate, it should be in-sample at 5%, 10%, 25%, etc.
If rate is p, approx p of all sessions would be in-sample. This won't be exact but the more sessions there are (n → ∞), the sample proportion → p

Generate N session IDs (e.g. N=1000)
Convert them to decimals via normalizeId, yielding N numbers between 0 and 1
- Because our algorithm uses <, property #1 will always hold.
- Now we just need to worry about property #2
We don't actually need to test if ~10% of sessions are determined to be in sample when sample rate is 0.1, etc. – INSTEAD we just need to verify that the distribution of those N numbers (obtained from "normalizing" the randomly generated session IDs) is approximately Uniform(0,1).
- Fortunately there's a statistical test for that – the Kolmogorov–Smirnov test. Actually, pages 139-140 of https://www.johndcook.com/Beautiful_Testing_ch10.pdf describe the procedure and suggests thresholds to use.
- We can even use an existing library to accomplish this: https://www.npmjs.com/package/@stdlib/stats-kstest
  - PASS if pValue >= 0.001 (fail to reject the null hypothesis that the two distributions are the same)
  - FAIL if pValue < 0.001 (reject the null in favor of the alternative hypothesis that the two distributions are different)

We can use https://commons.apache.org/proper/commons-math/javadocs/api-3.6.1/org/apache/commons/math3/stat/inference/KolmogorovSmirnovTest.html in the Java implementation

We can create our own in the Swift implementation and use SciPy's implementation (or the aforementioned Java or the JS versions) for reference.

• WDoranWMF moved this task from Sprint Backlog to Back to main backlog on the Data Products (Data Products Sprint 05) board.Jan 9 2024, 6:23 PM

• WDoranWMF edited projects, added Data Products; removed Data Products (Data Products Sprint 05).

VirginiaPoundstone moved this task from Incoming to NEEDS DISCUSSION on the Data Products board.Jan 10 2024, 4:48 PM

In T304379#9400845, @mpopov wrote:

Properties/behavior we care about:

If a session is in-sample at 1% rate, it should be in-sample at 5%, 10%, 25%, etc.

This isn't tested in either the JS nor the Java Metrics Platform Client. I'll create a task.

If rate is p, approx p of all sessions would be in-sample. This won't be exact but the more sessions there are (n → ∞), the sample proportion → p

Generate N session IDs (e.g. N=1000)

Convert them to decimals via normalizeId, yielding N numbers between 0 and 1

Because our algorithm uses <, property #1 will always hold.

Now we just need to worry about property #2

We don't actually need to test if ~10% of sessions are determined to be in sample when sample rate is 0.1, etc. – INSTEAD we just need to verify that the distribution of those N numbers (obtained from "normalizing" the randomly generated session IDs) is approximately Uniform(0,1).

Fortunately there's a statistical test for that – the Kolmogorov–Smirnov test. Actually, pages 139-140 of https://www.johndcook.com/Beautiful_Testing_ch10.pdf describe the procedure and suggests thresholds to use.

We can even use an existing library to accomplish this: https://www.npmjs.com/package/@stdlib/stats-kstest

PASS if pValue >= 0.001 (fail to reject the null hypothesis that the two distributions are the same)

FAIL if pValue < 0.001 (reject the null in favor of the alternative hypothesis that the two distributions are different)

We can use https://commons.apache.org/proper/commons-math/javadocs/api-3.6.1/org/apache/commons/math3/stat/inference/KolmogorovSmirnovTest.html in the Java implementation

We can create our own in the Swift implementation and use SciPy's implementation (or the aforementioned Java or the JS versions) for reference.

I'll create a task (maybe an epic with a subtask per implementation for this). For the JS implementation, we should also be wary of the execution environment, i.e. we should have a test for Node and a test that can be run in a browser.

Once these tasks are Done™, then we can remove this test.

phuedx mentioned this in T355052: [EPIC] Increase coverage of sampling tests.Jan 15 2024, 11:48 AM

phuedx added a parent task: T355052: [EPIC] Increase coverage of sampling tests.Jan 15 2024, 11:51 AM

phuedx removed a parent task: T355052: [EPIC] Increase coverage of sampling tests.Jan 15 2024, 11:56 AM

phuedx added a parent task: T355052: [EPIC] Increase coverage of sampling tests.

phuedx removed a parent task: T355052: [EPIC] Increase coverage of sampling tests.

☝️ ☕ required…

phuedx mentioned this in T355064: [JS] Test distribution of pageview/session IDs.Jan 15 2024, 2:08 PM

phuedx mentioned this in T355123: [Java] Test distribution of pageview/session IDs.Jan 16 2024, 10:44 AM

mpopov mentioned this in T353666: [REQUEST] Get pageview data for MediaWiki core JS docs.Jan 29 2024, 4:04 PM

VirginiaPoundstone moved this task from NEEDS DISCUSSION to Wikistats Backlog on the Data Products board.Mar 18 2024, 7:51 PM

VirginiaPoundstone moved this task from Wikistats Backlog to NEEDS DISCUSSION on the Data Products board.Mar 19 2024, 12:21 AM