== Task description
The `page_random` values on a set of very old pages (those created between October 11, 2004 and July 7, 2005) are not uniformly distributed. These values must be repopulated in the `page` table. Thus we need to run a query like the following, for each project:
```lang=sql
USE ptwiki;
UPDATE page
JOIN (
SELECT DISTINCT rev_page
FROM revision
WHERE rev_parent_id = 0
AND rev_timestamp < '20050708000000'
AND rev_timestamp > '20041011000000') AS unrandompages
ON page.page_id = unrandompages.rev_page
SET page.page_random = RAND();
```
//(strawman example, not yet fully vetted or tested for production, there might be more performant ways to do this)//
== Task detail
Given that [[ https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/core/+/ebbccf1845abbd1030ce8e3c7d27c2014f2b66ce/includes/installer/MysqlUpdater.php#903 | a prior fix to this field used `RAND()` ]] we believe it will be adequate to use again when backfilling these values. Additionally, a change to MysqlUpdater.php is not considered necessary given the lifespan of the original bug. However, documentation of the event should be shared. The update to the data must be completed and replicated prior to launching T208755.
== Original task investigation notes
In T5946, [[ https://gerrit.wikimedia.org/g/mediawiki/core/+/3e0c8a42d75e0112d46ab6a06c093c94cd73ec12/includes/installer/MysqlUpdater.php#896 | `page_random` values were revised ]] using [[ https://dev.mysql.com/doc/refman/8.0/en/mathematical-functions.html#function_rand | `RAND()` ]]. [[ https://gerrit.wikimedia.org/g/mediawiki/core/+/3e0c8a42d75e0112d46ab6a06c093c94cd73ec12/includes/page/WikiPage.php#1315 | Newly insert pages today use `wfRandom()` ]], [[ https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/core/+/a5352157c7209084c85ac1bacc6c4b06f8c423f0/includes/GlobalFunctions.php#268 | which is implemented using ]] two calls to [[ http://php.net/manual/en/function.mt-rand.php | `mt_rand()` ]]. @tbayer's analysis has shown that while the distribution of `wfRandom()` appears to be adequate for pages created after T5946 [1][3], the backfill using `RAND()` appears to have produced nonuniform values for the older pages (-2005) [2][4]; these values must be revised again to offer pseudo-randomness.
```lang=sql,lines=10,name=post-fix[1]
# all enwiki pages with ID > 5,000,000 (which dates from spring 2006, i.e. well after the fix).
SELECT bucket, COUNT(*) AS frequency FROM (
SELECT FLOOR(page_random * 100) / 100 AS bucket
FROM wmf_raw.mediawiki_page
WHERE wiki_db = 'enwiki'
AND snapshot = '2018-09'
AND page_namespace = 0
AND page_id > 5000000
) AS buckets
GROUP BY bucket
ORDER BY bucket LIMIT 10000;
bucket frequency
0.0 122028
0.01 121397
0.02 122179
0.03 122417
0.04 122392
0.05 122321
0.06 121682
0.07 121489
0.08 121873
0.09 121954
0.1 122264
0.11 122114
0.12 121835
0.13 121878
0.14 121783
0.15 121767
0.16 121677
0.17 122601
0.18 121797
0.19 122147
0.2 121682
0.21 122329
0.22 121824
0.23 122239
0.24 122021
0.25 121814
0.26 122085
0.27 121446
0.28 121494
0.29 121787
0.3 121722
0.31 121985
0.32 121987
0.33 121624
0.34 122042
0.35 121940
0.36 121811
0.37 121827
0.38 122309
0.39 122401
0.4 121304
0.41 121588
0.42 122095
0.43 121891
0.44 121609
0.45 122360
0.46 122565
0.47 121763
0.48 121991
0.49 121992
0.5 121727
0.51 122006
0.52 121660
0.53 121486
0.54 121987
0.55 121206
0.56 121678
0.57 121759
0.58 121648
0.59 122148
0.6 121536
0.61 121851
0.62 122088
0.63 121968
0.64 121845
0.65 121256
0.66 121985
0.67 121795
0.68 121282
0.69 122168
0.7 122133
0.71 122245
0.72 121611
0.73 122030
0.74 121374
0.75 121622
0.76 121723
0.77 121664
0.78 122007
0.79 122039
0.8 121316
0.81 121901
0.82 121577
0.83 121523
0.84 122023
0.85 121544
0.86 122103
0.87 121657
0.88 121437
0.89 121488
0.9 121942
0.91 121473
0.92 121371
0.93 120990
0.94 121903
0.95 121451
0.96 121968
0.97 120949
0.98 121821
0.99 121803
Time taken: 195.485 seconds, Fetched: 100 row(s)
```
```lang=sql,lines=10,name=pre-fix[2]
# all enwiki pages with ID < 3,000,000 (i.e created until fall 2005)
SELECT bucket, COUNT(*) AS frequency FROM (
SELECT FLOOR(page_random * 100) / 100 AS bucket
FROM wmf_raw.mediawiki_page
WHERE wiki_db = 'enwiki'
AND snapshot = '2018-09'
AND page_namespace = 0
AND page_id < 3000000
) AS buckets
GROUP BY bucket
ORDER BY bucket LIMIT 10000;
bucket frequency
0.0 28925
0.01 24084
0.02 22576
0.03 21647
0.04 20829
0.05 19893
0.06 19623
0.07 19020
0.08 18168
0.09 18082
0.1 17457
0.11 17494
0.12 17341
0.13 16941
0.14 16918
0.15 16454
0.16 16340
0.17 16320
0.18 15810
0.19 15807
0.2 15716
0.21 15614
0.22 15335
0.23 15151
0.24 14907
0.25 15041
0.26 14715
0.27 14708
0.28 14637
0.29 14399
0.3 14497
0.31 14000
0.32 14028
0.33 13922
0.34 13958
0.35 13870
0.36 13559
0.37 13504
0.38 13305
0.39 13517
0.4 13459
0.41 13336
0.42 13315
0.43 12986
0.44 13045
0.45 13035
0.46 13086
0.47 12634
0.48 12836
0.49 12877
0.5 12593
0.51 12442
0.52 12462
0.53 12267
0.54 12413
0.55 12494
0.56 12065
0.57 12111
0.58 12312
0.59 11913
0.6 11931
0.61 11758
0.62 11775
0.63 11704
0.64 11836
0.65 11780
0.66 11858
0.67 11738
0.68 11854
0.69 11521
0.7 11461
0.71 11443
0.72 11374
0.73 11495
0.74 11198
0.75 11444
0.76 11234
0.77 11204
0.78 11180
0.79 11086
0.8 11002
0.81 11007
0.82 11230
0.83 11135
0.84 10999
0.85 10758
0.86 10847
0.87 11007
0.88 10739
0.89 10685
0.9 10790
0.91 10647
0.92 10639
0.93 10645
0.94 10570
0.95 10521
0.96 10419
0.97 10296
0.98 10433
0.99 10399
```
[3] https://quarry.wmflabs.org/query/31022 (ptwiki pages created on or after Jan 29, 2006)
[4] https://quarry.wmflabs.org/query/31021 (ptwiki pages created on or before Dec 8, 2005)