Task description
The page_random values on a set of very old pages (those created between October 11, 2004 and July 7, 2005) are not uniformly distributed. These values must be repopulated in the page table. Thus we need to run a query like the following, for each project:
USE ptwiki; UPDATE page JOIN ( SELECT DISTINCT rev_page FROM revision WHERE rev_parent_id = 0 AND rev_timestamp < '20050708000000' AND rev_timestamp > '20041011000000') AS unrandompages ON page.page_id = unrandompages.rev_page SET page.page_random = RAND();
(strawman example, not yet fully vetted or tested for production, there might be more performant ways to do this)
Task detail
Given that [[ https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/core/+/ebbccf1845abbd1030ce8e3c7d27c2014f2b66ce/includes/installer/MysqlUpdater.php#903 | a prior fix to this field used RAND() ]] we believe it will be adequate to use again when backfilling these values. Additionally, a change to MysqlUpdater.php is not considered necessary given the lifespan of the original bug. However, documentation of the event should be shared. The update to the data must be completed and replicated prior to launching T208755.
Original task investigation notes
In T5946, [[ https://gerrit.wikimedia.org/g/mediawiki/core/+/3e0c8a42d75e0112d46ab6a06c093c94cd73ec12/includes/installer/MysqlUpdater.php#896 | page_random values were revised ]] using [[ https://dev.mysql.com/doc/refman/8.0/en/mathematical-functions.html#function_rand | RAND() ]]. [[ https://gerrit.wikimedia.org/g/mediawiki/core/+/3e0c8a42d75e0112d46ab6a06c093c94cd73ec12/includes/page/WikiPage.php#1315 | Newly insert pages today use wfRandom() ]], which is implemented using two calls to [[ http://php.net/manual/en/function.mt-rand.php | mt_rand() ]]. @Tbayer's analysis has shown that while the distribution of wfRandom() appears to be adequate for pages created after T5946 [1][3], the backfill using RAND() appears to have produced nonuniform values for the older pages (-2005) [2][4]; these values must be revised again to offer pseudo-randomness.
# all enwiki pages with ID > 5,000,000 (which dates from spring 2006, i.e. well after the fix). SELECT bucket, COUNT(*) AS frequency FROM ( SELECT FLOOR(page_random * 100) / 100 AS bucket FROM wmf_raw.mediawiki_page WHERE wiki_db = 'enwiki' AND snapshot = '2018-09' AND page_namespace = 0 AND page_id > 5000000 ) AS buckets GROUP BY bucket ORDER BY bucket LIMIT 10000; bucket frequency 0.0 122028 0.01 121397 0.02 122179 0.03 122417 0.04 122392 0.05 122321 0.06 121682 0.07 121489 0.08 121873 0.09 121954 0.1 122264 0.11 122114 0.12 121835 0.13 121878 0.14 121783 0.15 121767 0.16 121677 0.17 122601 0.18 121797 0.19 122147 0.2 121682 0.21 122329 0.22 121824 0.23 122239 0.24 122021 0.25 121814 0.26 122085 0.27 121446 0.28 121494 0.29 121787 0.3 121722 0.31 121985 0.32 121987 0.33 121624 0.34 122042 0.35 121940 0.36 121811 0.37 121827 0.38 122309 0.39 122401 0.4 121304 0.41 121588 0.42 122095 0.43 121891 0.44 121609 0.45 122360 0.46 122565 0.47 121763 0.48 121991 0.49 121992 0.5 121727 0.51 122006 0.52 121660 0.53 121486 0.54 121987 0.55 121206 0.56 121678 0.57 121759 0.58 121648 0.59 122148 0.6 121536 0.61 121851 0.62 122088 0.63 121968 0.64 121845 0.65 121256 0.66 121985 0.67 121795 0.68 121282 0.69 122168 0.7 122133 0.71 122245 0.72 121611 0.73 122030 0.74 121374 0.75 121622 0.76 121723 0.77 121664 0.78 122007 0.79 122039 0.8 121316 0.81 121901 0.82 121577 0.83 121523 0.84 122023 0.85 121544 0.86 122103 0.87 121657 0.88 121437 0.89 121488 0.9 121942 0.91 121473 0.92 121371 0.93 120990 0.94 121903 0.95 121451 0.96 121968 0.97 120949 0.98 121821 0.99 121803 Time taken: 195.485 seconds, Fetched: 100 row(s)
# all enwiki pages with ID < 3,000,000 (i.e created until fall 2005) SELECT bucket, COUNT(*) AS frequency FROM ( SELECT FLOOR(page_random * 100) / 100 AS bucket FROM wmf_raw.mediawiki_page WHERE wiki_db = 'enwiki' AND snapshot = '2018-09' AND page_namespace = 0 AND page_id < 3000000 ) AS buckets GROUP BY bucket ORDER BY bucket LIMIT 10000; bucket frequency 0.0 28925 0.01 24084 0.02 22576 0.03 21647 0.04 20829 0.05 19893 0.06 19623 0.07 19020 0.08 18168 0.09 18082 0.1 17457 0.11 17494 0.12 17341 0.13 16941 0.14 16918 0.15 16454 0.16 16340 0.17 16320 0.18 15810 0.19 15807 0.2 15716 0.21 15614 0.22 15335 0.23 15151 0.24 14907 0.25 15041 0.26 14715 0.27 14708 0.28 14637 0.29 14399 0.3 14497 0.31 14000 0.32 14028 0.33 13922 0.34 13958 0.35 13870 0.36 13559 0.37 13504 0.38 13305 0.39 13517 0.4 13459 0.41 13336 0.42 13315 0.43 12986 0.44 13045 0.45 13035 0.46 13086 0.47 12634 0.48 12836 0.49 12877 0.5 12593 0.51 12442 0.52 12462 0.53 12267 0.54 12413 0.55 12494 0.56 12065 0.57 12111 0.58 12312 0.59 11913 0.6 11931 0.61 11758 0.62 11775 0.63 11704 0.64 11836 0.65 11780 0.66 11858 0.67 11738 0.68 11854 0.69 11521 0.7 11461 0.71 11443 0.72 11374 0.73 11495 0.74 11198 0.75 11444 0.76 11234 0.77 11204 0.78 11180 0.79 11086 0.8 11002 0.81 11007 0.82 11230 0.83 11135 0.84 10999 0.85 10758 0.86 10847 0.87 11007 0.88 10739 0.89 10685 0.9 10790 0.91 10647 0.92 10639 0.93 10645 0.94 10570 0.95 10521 0.96 10419 0.97 10296 0.98 10433 0.99 10399
[3] https://quarry.wmflabs.org/query/31022 (ptwiki pages created on or after Jan 29, 2006)
[4] https://quarry.wmflabs.org/query/31021 (ptwiki pages created on or before Dec 8, 2005)