Page MenuHomePhabricator

Identify pages to be bucketed in page schema linked data A/B test
Closed, ResolvedPublic0 Story Points

Description

Per T208755, pages will be bucketed in a 50/50 A/B test. Assume a 1% sampling rate and identify some pages that would be sampled and in the "treatment" bucket across wikis. 500+ pages each for at least two wikis per group should be identified. Note that as the sample rate is increased, these pages will remain in their "treatment" bucket. In human form, the query might look like:

In the page table, for pages in the main namespace with page_random in [.5, .505), print page_id, page_random, and page_title, order by popularity, limit 10000.

If you think of other fields useful for testing or debugging, feel free to include them.

Since you're in the database, do the same to generate a list of pages in the "control" bucket assuming a 100% sampling rate.

The results should be saved in a spreadsheet and distributed. are available in the Quarry links below (and can be downloaded as CSV from there if needed).

Note that the order of the bucket config matters and affects the query as noted in T208763. The page_random range for sampled new treatments was verified for the following config:

$wgWBClientSettings[ 'pageSchemaNamespaces' ] = [ 0 ];
$wgWBClientSettings[ 'pageSchemaSplitTestSamplingRatio' ] = 0.01;
$wgWBClientSettings[ 'pageSchemaSplitTestBuckets' ] = [ 'control', 'treatment' ];

With the following test:

	public function testScenarioAb1() {
		// "control" / "treatment" A/B test with 1% sampling.
		$sampling = 0.01;
		$buckets = [ /*A*/ 'control', /*B*/ 'treatment' ];
		$subject = new PageSplitTester( $sampling, $buckets );

		// Supply page_random at different values. [0, .005) and [.5, .505) are sampled,
		// [.005, .5) and [.505, 1) are unsampled.
		$this->assertEquals( true, $subject->isSampled( 0.000 ) ); // Sampled
		$this->assertEquals( true, $subject->isSampled( 0.001 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.002 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.003 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.004 ) ); // ''
		$this->assertEquals( false, $subject->isSampled( 0.005 ) ); // Unsampled
		$this->assertEquals( false, $subject->isSampled( 0.008 ) ); // ''
		$this->assertEquals( false, $subject->isSampled( 0.009 ) ); // ''
		$this->assertEquals( false, $subject->isSampled( 0.010 ) ); // ''
		$this->assertEquals( false, $subject->isSampled( 0.011 ) ); // ''
		$this->assertEquals( false, $subject->isSampled( 0.012 ) ); // ''
		$this->assertEquals( false, $subject->isSampled( 0.013 ) ); // ''
		$this->assertEquals( false, $subject->isSampled( 0.015 ) ); // ''
		$this->assertEquals( false, $subject->isSampled( 0.018 ) ); // ''
		$this->assertEquals( false, $subject->isSampled( 0.019 ) ); // ''
		$this->assertEquals( false, $subject->isSampled( 0.100 ) ); // ''
		$this->assertEquals( false, $subject->isSampled( 0.200 ) ); // ''
		$this->assertEquals( false, $subject->isSampled( 0.490 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.500 ) ); // Sampled
		$this->assertEquals( true, $subject->isSampled( 0.501 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.502 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.503 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.504 ) ); // ''
		$this->assertEquals( false, $subject->isSampled( 0.505 ) ); // Unsampled
		$this->assertEquals( false, $subject->isSampled( 0.508 ) ); // ''
		$this->assertEquals( false, $subject->isSampled( 0.509 ) ); // ''
		$this->assertEquals( false, $subject->isSampled( 0.510 ) ); // ''
		$this->assertEquals( false, $subject->isSampled( 0.800 ) ); // ''
		$this->assertEquals( false, $subject->isSampled( 0.999 ) ); // ''

		// Supply page_random at different values. [0, .5) are "control", [.5, 1) are "treatment".
		$this->assertEquals( 'control', $subject->getBucket( 0.000 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.001 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.002 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.003 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.004 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.005 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.008 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.009 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.010 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.011 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.012 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.013 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.015 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.018 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.019 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.100 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.200 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.490 ) );
		$this->assertEquals( 'treatment', $subject->getBucket( 0.500 ) );
		$this->assertEquals( 'treatment', $subject->getBucket( 0.501 ) );
		$this->assertEquals( 'treatment', $subject->getBucket( 0.502 ) );
		$this->assertEquals( 'treatment', $subject->getBucket( 0.503 ) );
		$this->assertEquals( 'treatment', $subject->getBucket( 0.504 ) );
		$this->assertEquals( 'treatment', $subject->getBucket( 0.505 ) );
		$this->assertEquals( 'treatment', $subject->getBucket( 0.508 ) );
		$this->assertEquals( 'treatment', $subject->getBucket( 0.509 ) );
		$this->assertEquals( 'treatment', $subject->getBucket( 0.510 ) );
		$this->assertEquals( 'treatment', $subject->getBucket( 0.800 ) );
		$this->assertEquals( 'treatment', $subject->getBucket( 0.999 ) );

		// Thus, pages sampled at 1% in "treatment" may be found for page_random in [.5, .505).
	}

	public function testScenarioAb5() {
		// "control" / "treatment" A/B test with 1% sampling.
		$sampling = 0.05;
		$buckets = [ /*A*/ 'control', /*B*/ 'treatment' ];
		$subject = new PageSplitTester( $sampling, $buckets );

		// Supply page_random at different values. [0, .025) and [.5, .525) are sampled,
		// [.025, .5) and [.525, 1) are unsampled.
		$this->assertEquals( true, $subject->isSampled( 0.000 ) ); // Sampled
		$this->assertEquals( true, $subject->isSampled( 0.001 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.002 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.003 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.004 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.005 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.008 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.009 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.010 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.011 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.012 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.013 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.015 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.018 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.019 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.024 ) ); // ''
		$this->assertEquals( false, $subject->isSampled( 0.025 ) ); // Unsampled
		$this->assertEquals( false, $subject->isSampled( 0.100 ) ); // ''
		$this->assertEquals( false, $subject->isSampled( 0.200 ) ); // ''
		$this->assertEquals( false, $subject->isSampled( 0.490 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.500 ) ); // Sampled
		$this->assertEquals( true, $subject->isSampled( 0.501 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.502 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.503 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.504 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.505 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.508 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.509 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.510 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.524 ) ); // ''
		$this->assertEquals( false, $subject->isSampled( 0.525 ) ); // Unsampled
		$this->assertEquals( false, $subject->isSampled( 0.800 ) ); // ''
		$this->assertEquals( false, $subject->isSampled( 0.999 ) ); // ''

		// Supply page_random at different values. [0, .5) are "control", [.5, 1) are "treatment".
		$this->assertEquals( 'control', $subject->getBucket( 0.000 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.001 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.002 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.003 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.004 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.005 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.008 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.009 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.010 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.011 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.012 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.013 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.015 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.018 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.019 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.100 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.200 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.490 ) );
		$this->assertEquals( 'treatment', $subject->getBucket( 0.500 ) );
		$this->assertEquals( 'treatment', $subject->getBucket( 0.501 ) );
		$this->assertEquals( 'treatment', $subject->getBucket( 0.502 ) );
		$this->assertEquals( 'treatment', $subject->getBucket( 0.503 ) );
		$this->assertEquals( 'treatment', $subject->getBucket( 0.504 ) );
		$this->assertEquals( 'treatment', $subject->getBucket( 0.505 ) );
		$this->assertEquals( 'treatment', $subject->getBucket( 0.508 ) );
		$this->assertEquals( 'treatment', $subject->getBucket( 0.509 ) );
		$this->assertEquals( 'treatment', $subject->getBucket( 0.510 ) );
		$this->assertEquals( 'treatment', $subject->getBucket( 0.800 ) );
		$this->assertEquals( 'treatment', $subject->getBucket( 0.999 ) );

		// Thus, pages sampled at 5% in "treatment" may be found for page_random in [.5, .525).
	}

	public function testScenarioAb25() {
		// "control" / "treatment" A/B test with 1% sampling.
		$sampling = 0.25;
		$buckets = [ /*A*/ 'control', /*B*/ 'treatment' ];
		$subject = new PageSplitTester( $sampling, $buckets );

		// Supply page_random at different values. [0, .125) and [.5, .625) are sampled,
		// [.125, .5) and [.625, 1) are unsampled.
		$this->assertEquals( true, $subject->isSampled( 0.000 ) ); // Sampled
		$this->assertEquals( true, $subject->isSampled( 0.001 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.002 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.003 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.004 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.005 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.008 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.009 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.010 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.011 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.012 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.013 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.015 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.018 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.019 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.024 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.025 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.100 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.124 ) ); // ''
		$this->assertEquals( false, $subject->isSampled( 0.125 ) ); // Unsampled
		$this->assertEquals( false, $subject->isSampled( 0.200 ) ); // ''
		$this->assertEquals( false, $subject->isSampled( 0.490 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.500 ) ); // Sampled
		$this->assertEquals( true, $subject->isSampled( 0.501 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.502 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.503 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.504 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.505 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.508 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.509 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.510 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.524 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.525 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.624 ) ); // ''
		$this->assertEquals( false, $subject->isSampled( 0.625 ) ); // Unsampled
		$this->assertEquals( false, $subject->isSampled( 0.800 ) ); // ''
		$this->assertEquals( false, $subject->isSampled( 0.999 ) ); // ''

		// Supply page_random at different values. [0, .5) are "control", [.5, 1) are "treatment".
		$this->assertEquals( 'control', $subject->getBucket( 0.000 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.001 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.002 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.003 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.004 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.005 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.008 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.009 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.010 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.011 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.012 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.013 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.015 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.018 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.019 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.100 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.200 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.490 ) );
		$this->assertEquals( 'treatment', $subject->getBucket( 0.500 ) );
		$this->assertEquals( 'treatment', $subject->getBucket( 0.501 ) );
		$this->assertEquals( 'treatment', $subject->getBucket( 0.502 ) );
		$this->assertEquals( 'treatment', $subject->getBucket( 0.503 ) );
		$this->assertEquals( 'treatment', $subject->getBucket( 0.504 ) );
		$this->assertEquals( 'treatment', $subject->getBucket( 0.505 ) );
		$this->assertEquals( 'treatment', $subject->getBucket( 0.508 ) );
		$this->assertEquals( 'treatment', $subject->getBucket( 0.509 ) );
		$this->assertEquals( 'treatment', $subject->getBucket( 0.510 ) );
		$this->assertEquals( 'treatment', $subject->getBucket( 0.800 ) );
		$this->assertEquals( 'treatment', $subject->getBucket( 0.999 ) );

		// Thus, pages sampled at 25% in "treatment" may be found for page_random in [.5, .625).
	}

	public function testScenarioAb50() {
		// "control" / "treatment" A/B test with 1% sampling.
		$sampling = 0.5;
		$buckets = [ /*A*/ 'control', /*B*/ 'treatment' ];
		$subject = new PageSplitTester( $sampling, $buckets );

		// Supply page_random at different values. [0, .25) and [.5, .75) are sampled,
		// [.25, .5) and [.75, 1) are unsampled.
		$this->assertEquals( true, $subject->isSampled( 0.000 ) ); // Sampled
		$this->assertEquals( true, $subject->isSampled( 0.001 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.002 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.003 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.004 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.005 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.008 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.009 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.010 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.011 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.012 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.013 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.015 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.018 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.019 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.024 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.025 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.100 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.124 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.125 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.200 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.24 ) ); // ''
		$this->assertEquals( false, $subject->isSampled( 0.25 ) ); // 'Unsampled
		$this->assertEquals( false, $subject->isSampled( 0.490 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.500 ) ); // Sampled
		$this->assertEquals( true, $subject->isSampled( 0.501 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.502 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.503 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.504 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.505 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.508 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.509 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.510 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.524 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.525 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.624 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.625 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.74 ) ); // ''
		$this->assertEquals( false, $subject->isSampled( 0.75 ) ); // Unsampled
		$this->assertEquals( false, $subject->isSampled( 0.800 ) ); // ''
		$this->assertEquals( false, $subject->isSampled( 0.999 ) ); // ''

		// Supply page_random at different values. [0, .5) are "control", [.5, 1) are "treatment".
		$this->assertEquals( 'control', $subject->getBucket( 0.000 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.001 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.002 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.003 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.004 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.005 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.008 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.009 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.010 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.011 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.012 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.013 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.015 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.018 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.019 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.100 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.200 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.490 ) );
		$this->assertEquals( 'treatment', $subject->getBucket( 0.500 ) );
		$this->assertEquals( 'treatment', $subject->getBucket( 0.501 ) );
		$this->assertEquals( 'treatment', $subject->getBucket( 0.502 ) );
		$this->assertEquals( 'treatment', $subject->getBucket( 0.503 ) );
		$this->assertEquals( 'treatment', $subject->getBucket( 0.504 ) );
		$this->assertEquals( 'treatment', $subject->getBucket( 0.505 ) );
		$this->assertEquals( 'treatment', $subject->getBucket( 0.508 ) );
		$this->assertEquals( 'treatment', $subject->getBucket( 0.509 ) );
		$this->assertEquals( 'treatment', $subject->getBucket( 0.510 ) );
		$this->assertEquals( 'treatment', $subject->getBucket( 0.800 ) );
		$this->assertEquals( 'treatment', $subject->getBucket( 0.999 ) );

		// Thus, pages sampled at 50% in "treatment" may be found for page_random in [.5, .75).
	}

	public function testScenarioAb100() {
		// "control" / "treatment" A/B test with 1% sampling.
		$sampling = 1;
		$buckets = [ /*A*/ 'control', /*B*/ 'treatment' ];
		$subject = new PageSplitTester( $sampling, $buckets );

		// Supply page_random at different values. [0, 1) are sampled.
		$this->assertEquals( true, $subject->isSampled( 0.000 ) ); // Sampled
		$this->assertEquals( true, $subject->isSampled( 0.001 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.002 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.003 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.004 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.005 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.008 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.009 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.010 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.011 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.012 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.013 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.015 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.018 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.019 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.024 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.025 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.100 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.124 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.125 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.200 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.24 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.25 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.490 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.500 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.501 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.502 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.503 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.504 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.505 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.508 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.509 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.510 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.524 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.525 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.624 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.625 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.74 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.75 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.800 ) ); // ''
		$this->assertEquals( true, $subject->isSampled( 0.999 ) ); // ''

		// Supply page_random at different values. [0, .5) are "control", [.5, 1) are "treatment".
		$this->assertEquals( 'control', $subject->getBucket( 0.000 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.001 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.002 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.003 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.004 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.005 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.008 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.009 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.010 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.011 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.012 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.013 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.015 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.018 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.019 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.100 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.200 ) );
		$this->assertEquals( 'control', $subject->getBucket( 0.490 ) );
		$this->assertEquals( 'treatment', $subject->getBucket( 0.500 ) );
		$this->assertEquals( 'treatment', $subject->getBucket( 0.501 ) );
		$this->assertEquals( 'treatment', $subject->getBucket( 0.502 ) );
		$this->assertEquals( 'treatment', $subject->getBucket( 0.503 ) );
		$this->assertEquals( 'treatment', $subject->getBucket( 0.504 ) );
		$this->assertEquals( 'treatment', $subject->getBucket( 0.505 ) );
		$this->assertEquals( 'treatment', $subject->getBucket( 0.508 ) );
		$this->assertEquals( 'treatment', $subject->getBucket( 0.509 ) );
		$this->assertEquals( 'treatment', $subject->getBucket( 0.510 ) );
		$this->assertEquals( 'treatment', $subject->getBucket( 0.800 ) );
		$this->assertEquals( 'treatment', $subject->getBucket( 0.999 ) );

		// Thus, pages sampled at 50% in "treatment" may be found for page_random in [.5, 1).
	}

Please review the assertions of this test prior to executing the query to ensure that correct values are used for page_random.


Since you're in the there already, might as well report any unexpected values for page_random too:

In the page table, for pages in the main namespace where page_random is false, null, or otherwise not in [0, 1), print the page_id, page_random, and page_title, order by popularity, limit 10000

If there are unexpected values in the database, this may affect the A/B test as well as logstash errors reported.

No known bad values detect on wiki spot check but the queries ran suspiciously fast:

Developer notes

  • What does a developer need to be able to have access to these database tables? (please add link to wiki page)
  • Please explicitly list the outcomes as a checklist in "sign off steps"/"outputted documents"

Results

Please use these results when you need to find a page that's been both bucketed in the new treatment _and_ sampled. As the sample rate increases from 0% to 1%, 1% to 5%, 5% to 25%, 25% to 50%, and 50% to 100%, you should see these pages transition from the old version (no SEO linked data) to the new treatment (SEO linked data is shown).

Event Timeline

Restricted Application changed the subtype of this task from "Deadline" to "Task". · View Herald TranscriptNov 5 2018, 9:02 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Jdlrobson updated the task description. (Show Details)Nov 5 2018, 9:35 PM

This feels like a timeboxed task (spike?)
I've added some developer notes - do you know the answers?

For starters, here is a query for 500 example pages on enwiki: https://quarry.wmflabs.org/query/30989
(Those are note yet selected/ordered by popularity - that requires a join with (e.g.) the pageview_hourly table, which may be the trickier part; trying that out should be part of these preparations too.)

@Niedzielski, @Tbayer, @mpopov (sorry for tagging everyone). Since this task is only for analysis purposes, do we want to order these by popularity/does popularity matter in this context?

ovasileva raised the priority of this task from Normal to High.Nov 5 2018, 10:31 PM

do we want to order these by popularity/does popularity matter in this context?

@ovasileva, I only suggested the popularity ordering because I thought it would be easy to perform with the query and we'd need it eventually. It will be a necessary consideration when the results of the A/B test are considered but not for QAing and any debugging. The list @Tbayer generated is perfect for QAing (and any debugging of) enwiki so I think we've got a great start for this task. I see that I have access to fork the query so there shouldn't be any issue identifying pages needed for whatever wikis we want.

(Those are note yet selected/ordered by popularity - that requires a join with (e.g.) the pageview_hourly table, which may be the trickier part; trying that out should be part of these preparations too.)

@Tbayer, given that you've already generated a nice list of test pages and we really only need the popularity for the analysis, I propose we either punt on this effort or split it into a new task that doesn't block the launch.

might as well report any unexpected values for page_random too

I didn't encounter NULL values or any number outside the [0, 1) interval so far.
However, I'm starting to worry about how random it really is. I spot-checked its distribution on three wikis' mainspace (ruwiki [1], ptwiki [2], enwiki[3]) the first three 1% buckets (i.e. [0,0.01) , [0.01,0.02) and [0.02,0.03)) each had a larger frequency than almost every of the remaining 97 buckets.

Could we review the underlying code that generates (generated) the page_random values?

[1] https://quarry.wmflabs.org/query/30994
[2] https://quarry.wmflabs.org/query/30990

[3]
SELECT bucket, COUNT(*) AS frequency FROM (
  SELECT FLOOR(page_random * 100) / 100 AS bucket
  FROM  wmf_raw.mediawiki_page 
  WHERE wiki_db = 'enwiki'
  AND snapshot = '2018-09'
  AND page_namespace = 0
) AS buckets
GROUP BY bucket
ORDER BY bucket LIMIT 10000;

bucket	frequency
0.0	156947
0.01	151517
0.02	150887
0.03	150177
0.04	149290
0.05	148273
0.06	147363
0.07	146529
0.08	146070
0.09	145995
0.1	145743
0.11	145684
0.12	145197
0.13	144760
0.14	144703
0.15	144160
0.16	144083
0.17	145004
0.18	143585
0.19	144050
0.2	143386
0.21	143811
0.22	143232
0.23	143367
0.24	142930
0.25	142882
0.26	142725
0.27	142190
0.28	142185
0.29	142154
0.3	142369
0.31	142139
0.32	142125
0.33	141646
0.34	142024
0.35	141896
0.36	141553
0.37	141536
0.38	141707
0.39	141978
0.4	140912
0.41	140932
0.42	141410
0.43	140862
0.44	140706
0.45	141459
0.46	141617
0.47	140371
0.48	140833
0.49	140851
0.5	140435
0.51	140388
0.52	140208
0.53	139851
0.54	140505
0.55	139620
0.56	139881
0.57	139960
0.58	140019
0.59	139932
0.6	139655
0.61	139713
0.62	140030
0.63	139718
0.64	139711
0.65	139081
0.66	139838
0.67	139496
0.68	139205
0.69	139758
0.7	139739
0.71	139696
0.72	139097
0.73	139546
0.74	138534
0.75	139140
0.76	138925
0.77	138826
0.78	139153
0.79	139243
0.8	138465
0.81	138886
0.82	138810
0.83	138849
0.84	139113
0.85	138239
0.86	139016
0.87	138706
0.88	138207
0.89	138164
0.9	138773
0.91	138179
0.92	138090
0.93	137633
0.94	138662
0.95	137974
0.96	138549
0.97	137218
0.98	138304
0.99	138314
Time taken: 377.376 seconds, Fetched: 100 row(s)

I apologize but I missed this critical bug in my earlier code search which unfortunately uses RAND() and I'm guessing that's where the discrepancy is coming from. Here's the implementation for wfRandom() which uses uses two calls to mt_rand() but is only used to generate page_random for newly inserted pages.

How do you think we should handle this? Can we consider pages prior to T5946 differently? Should we seek a new (or additional) source of randomness such as page ID instead?

I apologize but I missed this critical bug in my earlier code search which unfortunately uses RAND() and I'm guessing that's where the discrepancy is coming from. Here's the implementation for wfRandom() which uses uses two calls to mt_rand() but is only used to generate page_random for newly inserted pages.

Great find! So is it reasonable to assume that the current implementation of wfRandom() was used to generate page_random for all pages created after T5946 was fixed? (It looks like that fix and the RAND() backfill were completed in December 2005.)

In any case, the distribution of page_random looks indeed much more plausible when limiting the above query for enwiki to all pages with ID > 5,000,000 (which dates from spring 2006, i.e. well after the fix).[1] Conversely, for pages with ID < 3,000,000 (i.e created until fall 2005), it looks quite un-random indeed.[2]
(As an aside, these are of course just spot checks, rather than systematic randomness tests - but seem good enough for now to do due diligence for issues like this.)

How do you think we should handle this? Can we consider pages prior to T5946 differently?

A crude but effective solution might be to just leave them out of the sample for now, i.e. only study the effect of the sameAs change on pages that were created after T5946 was fixed, say from January 2006 on. Obviously this would leave out a great deal of important and high-traffic pages and we would not be able to estimate the size of the overall effect.

Should we seek a new (or additional) source of randomness such as page ID instead?

As discussed, page ID comes with its own randomness vagaries ... Also, since T208755 now calls for a gradual rollout, the simple even vs. odd page ID solution is out of the question anyway, as we discussed earlier. Combining page_random with page_id using a suitable hash function (and hoping that the result will be random enough or our purposes) might be a solution, albeit not a pretty one.

in any case, once the above diagnosis is confirmed, we should file a task for correcting these pre-2006 RAND() based values with wfRandom() output. Their non-randomness looks like a clear bug with non-negligible long-term consequences.

[1]
SELECT bucket, COUNT(*) AS frequency FROM (
  SELECT FLOOR(page_random * 100) / 100 AS bucket
  FROM  wmf_raw.mediawiki_page 
  WHERE wiki_db = 'enwiki'
  AND snapshot = '2018-09'
  AND page_namespace = 0
  AND page_id > 5000000
) AS buckets
GROUP BY bucket
ORDER BY bucket LIMIT 10000;

bucket	frequency
0.0	122028
0.01	121397
0.02	122179
0.03	122417
0.04	122392
0.05	122321
0.06	121682
0.07	121489
0.08	121873
0.09	121954
0.1	122264
0.11	122114
0.12	121835
0.13	121878
0.14	121783
0.15	121767
0.16	121677
0.17	122601
0.18	121797
0.19	122147
0.2	121682
0.21	122329
0.22	121824
0.23	122239
0.24	122021
0.25	121814
0.26	122085
0.27	121446
0.28	121494
0.29	121787
0.3	121722
0.31	121985
0.32	121987
0.33	121624
0.34	122042
0.35	121940
0.36	121811
0.37	121827
0.38	122309
0.39	122401
0.4	121304
0.41	121588
0.42	122095
0.43	121891
0.44	121609
0.45	122360
0.46	122565
0.47	121763
0.48	121991
0.49	121992
0.5	121727
0.51	122006
0.52	121660
0.53	121486
0.54	121987
0.55	121206
0.56	121678
0.57	121759
0.58	121648
0.59	122148
0.6	121536
0.61	121851
0.62	122088
0.63	121968
0.64	121845
0.65	121256
0.66	121985
0.67	121795
0.68	121282
0.69	122168
0.7	122133
0.71	122245
0.72	121611
0.73	122030
0.74	121374
0.75	121622
0.76	121723
0.77	121664
0.78	122007
0.79	122039
0.8	121316
0.81	121901
0.82	121577
0.83	121523
0.84	122023
0.85	121544
0.86	122103
0.87	121657
0.88	121437
0.89	121488
0.9	121942
0.91	121473
0.92	121371
0.93	120990
0.94	121903
0.95	121451
0.96	121968
0.97	120949
0.98	121821
0.99	121803
Time taken: 195.485 seconds, Fetched: 100 row(s)
[2]
SELECT bucket, COUNT(*) AS frequency FROM (
  SELECT FLOOR(page_random * 100) / 100 AS bucket
  FROM  wmf_raw.mediawiki_page 
  WHERE wiki_db = 'enwiki'
  AND snapshot = '2018-09'
  AND page_namespace = 0
  AND page_id < 3000000
) AS buckets
GROUP BY bucket
ORDER BY bucket LIMIT 10000;

bucket	frequency
0.0	28925
0.01	24084
0.02	22576
0.03	21647
0.04	20829
0.05	19893
0.06	19623
0.07	19020
0.08	18168
0.09	18082
0.1	17457
0.11	17494
0.12	17341
0.13	16941
0.14	16918
0.15	16454
0.16	16340
0.17	16320
0.18	15810
0.19	15807
0.2	15716
0.21	15614
0.22	15335
0.23	15151
0.24	14907
0.25	15041
0.26	14715
0.27	14708
0.28	14637
0.29	14399
0.3	14497
0.31	14000
0.32	14028
0.33	13922
0.34	13958
0.35	13870
0.36	13559
0.37	13504
0.38	13305
0.39	13517
0.4	13459
0.41	13336
0.42	13315
0.43	12986
0.44	13045
0.45	13035
0.46	13086
0.47	12634
0.48	12836
0.49	12877
0.5	12593
0.51	12442
0.52	12462
0.53	12267
0.54	12413
0.55	12494
0.56	12065
0.57	12111
0.58	12312
0.59	11913
0.6	11931
0.61	11758
0.62	11775
0.63	11704
0.64	11836
0.65	11780
0.66	11858
0.67	11738
0.68	11854
0.69	11521
0.7	11461
0.71	11443
0.72	11374
0.73	11495
0.74	11198
0.75	11444
0.76	11234
0.77	11204
0.78	11180
0.79	11086
0.8	11002
0.81	11007
0.82	11230
0.83	11135
0.84	10999
0.85	10758
0.86	10847
0.87	11007
0.88	10739
0.89	10685
0.9	10790
0.91	10647
0.92	10639
0.93	10645
0.94	10570
0.95	10521
0.96	10419
0.97	10296
0.98	10433
0.99	10399
ovasileva added a comment.EditedNov 6 2018, 3:25 PM

in any case, once the above diagnosis is confirmed, we should file a task for correcting these pre-2006 RAND() based values with wfRandom() output. Their non-randomness looks like a clear bug with non-negligible long-term consequences.

@Tbayer - just to confirm - are you suggesting that instead of removing them from the sample, we randomize them manually? @mpopov - any thoughts here?

in any case, once the above diagnosis is confirmed, we should file a task for correcting these pre-2006 RAND() based values with wfRandom() output. Their non-randomness looks like a clear bug with non-negligible long-term consequences.

@Tbayer - just to confirm - are you suggesting that instead of removing them from the sample, we randomize them manually? @mpopov - any thoughts here?

This is about fixing it directly in the database (once and for all). I guess the query for this would be fairly straightforward, but, like any mass change of values, probably requires some scheduling work and giving DBAs a heads-up.

Jdlrobson set the point value for this task to 0.Nov 8 2018, 6:24 PM

doesn't look like this needs estimating? Please remove the 0 if it does!

Change 472798 had a related patch set uploaded (by Niedzielski; owner: Stephen Niedzielski):
[mediawiki/extensions/Wikibase@master] Hygiene: add PageSplitTester test to be used

https://gerrit.wikimedia.org/r/472798

Still blocked on T208909. The database updater script is merged but execution of said script is still pending. @phuedx and @jcrespo to start Monday.

Change 472798 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Hygiene: add PageSplitTester test to be used

https://gerrit.wikimedia.org/r/472798

Niedzielski removed Niedzielski as the assignee of this task.Nov 13 2018, 7:31 PM
Niedzielski updated the task description. (Show Details)
Niedzielski updated the task description. (Show Details)Nov 13 2018, 7:48 PM
Niedzielski updated the task description. (Show Details)Nov 13 2018, 7:51 PM

@Tbayer, if you have time, please review this prior to launch. If not, I think we should be ok.

@Tbayer, if you have time, please review this prior to launch. If not, I think we should be ok.

The Quarry queries linked under "Results" (with the ORDER BY RAND(...) trick we discussed earlier today) look fine to me.

Regarding the "suspiciously fast" queries spot checking the condition page_random REGEXP '^0\\.[0-9]+$', it does seem that the regex works as intended, see e.g. https://quarry.wmflabs.org/query/31223 .

Tbayer closed this task as Resolved.Nov 14 2018, 8:17 AM
Tbayer updated the task description. (Show Details)
Niedzielski updated the task description. (Show Details)Nov 14 2018, 5:35 PM

Removed frwiki which is excluded from the test. Updated queries to exclude redirect pages which are misleading to evaluate.