Page MenuHomePhabricator

[Bug] Update old nonuniformly distributed page_random values
Closed, ResolvedPublic13 Story Points

Description

Task description

The page_random values on a set of very old pages (those created between October 11, 2004 and July 7, 2005) are not uniformly distributed. These values must be repopulated in the page table. Thus we need to run a query like the following, for each project:

USE ptwiki; 
UPDATE page
JOIN (
  SELECT DISTINCT rev_page
  FROM revision
  WHERE rev_parent_id = 0
  AND rev_timestamp < '20050708000000'
  AND rev_timestamp > '20041011000000') AS unrandompages
ON page.page_id = unrandompages.rev_page
SET page.page_random = RAND();

(strawman example, not yet fully vetted or tested for production, there might be more performant ways to do this)

Task detail

Given that a prior fix to this field used RAND() we believe it will be adequate to use again when backfilling these values. Additionally, a change to MysqlUpdater.php is not considered necessary given the lifespan of the original bug. However, documentation of the event should be shared. The update to the data must be completed and replicated prior to launching T208755.

Original task investigation notes

In T5946, page_random values were revised using RAND(). Newly insert pages today use wfRandom(), which is implemented using two calls to mt_rand(). @Tbayer's analysis has shown that while the distribution of wfRandom() appears to be adequate for pages created after T5946 [1][3], the backfill using RAND() appears to have produced nonuniform values for the older pages (-2005) [2][4]; these values must be revised again to offer pseudo-randomness.

post-fix[1]
# all enwiki pages with ID > 5,000,000 (which dates from spring 2006, i.e. well after the fix).
SELECT bucket, COUNT(*) AS frequency FROM (
  SELECT FLOOR(page_random * 100) / 100 AS bucket
  FROM  wmf_raw.mediawiki_page 
  WHERE wiki_db = 'enwiki'
  AND snapshot = '2018-09'
  AND page_namespace = 0
  AND page_id > 5000000
) AS buckets
GROUP BY bucket
ORDER BY bucket LIMIT 10000;

bucket	frequency
0.0	122028
0.01	121397
0.02	122179
0.03	122417
0.04	122392
0.05	122321
0.06	121682
0.07	121489
0.08	121873
0.09	121954
0.1	122264
0.11	122114
0.12	121835
0.13	121878
0.14	121783
0.15	121767
0.16	121677
0.17	122601
0.18	121797
0.19	122147
0.2	121682
0.21	122329
0.22	121824
0.23	122239
0.24	122021
0.25	121814
0.26	122085
0.27	121446
0.28	121494
0.29	121787
0.3	121722
0.31	121985
0.32	121987
0.33	121624
0.34	122042
0.35	121940
0.36	121811
0.37	121827
0.38	122309
0.39	122401
0.4	121304
0.41	121588
0.42	122095
0.43	121891
0.44	121609
0.45	122360
0.46	122565
0.47	121763
0.48	121991
0.49	121992
0.5	121727
0.51	122006
0.52	121660
0.53	121486
0.54	121987
0.55	121206
0.56	121678
0.57	121759
0.58	121648
0.59	122148
0.6	121536
0.61	121851
0.62	122088
0.63	121968
0.64	121845
0.65	121256
0.66	121985
0.67	121795
0.68	121282
0.69	122168
0.7	122133
0.71	122245
0.72	121611
0.73	122030
0.74	121374
0.75	121622
0.76	121723
0.77	121664
0.78	122007
0.79	122039
0.8	121316
0.81	121901
0.82	121577
0.83	121523
0.84	122023
0.85	121544
0.86	122103
0.87	121657
0.88	121437
0.89	121488
0.9	121942
0.91	121473
0.92	121371
0.93	120990
0.94	121903
0.95	121451
0.96	121968
0.97	120949
0.98	121821
0.99	121803
Time taken: 195.485 seconds, Fetched: 100 row(s)
pre-fix[2]
# all enwiki pages with ID < 3,000,000 (i.e created until fall 2005)
SELECT bucket, COUNT(*) AS frequency FROM (
  SELECT FLOOR(page_random * 100) / 100 AS bucket
  FROM  wmf_raw.mediawiki_page 
  WHERE wiki_db = 'enwiki'
  AND snapshot = '2018-09'
  AND page_namespace = 0
  AND page_id < 3000000
) AS buckets
GROUP BY bucket
ORDER BY bucket LIMIT 10000;

bucket	frequency
0.0	28925
0.01	24084
0.02	22576
0.03	21647
0.04	20829
0.05	19893
0.06	19623
0.07	19020
0.08	18168
0.09	18082
0.1	17457
0.11	17494
0.12	17341
0.13	16941
0.14	16918
0.15	16454
0.16	16340
0.17	16320
0.18	15810
0.19	15807
0.2	15716
0.21	15614
0.22	15335
0.23	15151
0.24	14907
0.25	15041
0.26	14715
0.27	14708
0.28	14637
0.29	14399
0.3	14497
0.31	14000
0.32	14028
0.33	13922
0.34	13958
0.35	13870
0.36	13559
0.37	13504
0.38	13305
0.39	13517
0.4	13459
0.41	13336
0.42	13315
0.43	12986
0.44	13045
0.45	13035
0.46	13086
0.47	12634
0.48	12836
0.49	12877
0.5	12593
0.51	12442
0.52	12462
0.53	12267
0.54	12413
0.55	12494
0.56	12065
0.57	12111
0.58	12312
0.59	11913
0.6	11931
0.61	11758
0.62	11775
0.63	11704
0.64	11836
0.65	11780
0.66	11858
0.67	11738
0.68	11854
0.69	11521
0.7	11461
0.71	11443
0.72	11374
0.73	11495
0.74	11198
0.75	11444
0.76	11234
0.77	11204
0.78	11180
0.79	11086
0.8	11002
0.81	11007
0.82	11230
0.83	11135
0.84	10999
0.85	10758
0.86	10847
0.87	11007
0.88	10739
0.89	10685
0.9	10790
0.91	10647
0.92	10639
0.93	10645
0.94	10570
0.95	10521
0.96	10419
0.97	10296
0.98	10433
0.99	10399

[3] https://quarry.wmflabs.org/query/31022 (ptwiki pages created on or after Jan 29, 2006)
[4] https://quarry.wmflabs.org/query/31021 (ptwiki pages created on or before Dec 8, 2005)

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Tbayer updated the task description. (Show Details)Nov 7 2018, 12:06 AM

cur_random values were reset with RAND() by Brion in May 2003, as described at the top of https://en.wikipedia.org/wiki/Wikipedia:Village_pump/Archive_E . But this is not the problem since page_id<1200000 does not show the skewed distribution, meaning that it started in approximately November 2004.

`
MariaDB [enwiki]> select FLOOR(page_random * 10) / 10 as bucket,count(*) as f from page where page_id between 1000000 and 1200000 group by bucket;
+--------+-------+
| bucket | f     |
+--------+-------+
| 0.0000 | 12993 |
| 0.1000 | 12885 |
| 0.2000 | 12896 |
| 0.3000 | 12999 |
| 0.4000 | 12934 |
| 0.5000 | 13182 |
| 0.6000 | 12908 |
| 0.7000 | 13096 |
| 0.8000 | 13161 |
| 0.9000 | 13048 |
+--------+-------+

MariaDB [enwiki]> select FLOOR(page_random * 10) / 10 as bucket,count(*) as f from page where page_id between 1200000 and 1400000 group by bucket;
+--------+-------+
| bucket | f     |
+--------+-------+
| 0.0000 | 24150 |
| 0.1000 | 17113 |
| 0.2000 | 14540 |
| 0.3000 | 12887 |
| 0.4000 | 11638 |
| 0.5000 | 10635 |
| 0.6000 |  9767 |
| 0.7000 |  9227 |
| 0.8000 |  8511 |
| 0.9000 |  7934 |
+--------+-------+

It's unlikely to be related to T5946. If the UPDATE query was applied in production the same as in the core updater, with WHERE page_random=0, then it wouldn't have affected enough rows to cause this. If the update was done on all rows, then that wouldn't explain the fact that it started in November 2004. More likely it is some unrelated bug which was introduced in November 2004.

I think it's very unlikely to be a bug in MySQL's RAND().

The upper limit is around page_id 2200000, i.e. July 2005, so we should look for a fix in MW core at around that time.

MariaDB [enwiki]> select FLOOR(page_random * 10) / 10 as bucket,count(*) as f from page where page_id between 2100000 and 2200000 group by bucket;
+--------+------+
| bucket | f    |
+--------+------+
| 0.0000 | 9110 |
| 0.1000 | 7302 |
| 0.2000 | 6777 |
| 0.3000 | 6144 |
| 0.4000 | 5917 |
| 0.5000 | 5646 |
| 0.6000 | 5325 |
| 0.7000 | 5114 |
| 0.8000 | 5020 |
| 0.9000 | 4938 |
+--------+------+
10 rows in set (0.05 sec)

MariaDB [enwiki]> select FLOOR(page_random * 10) / 10 as bucket,count(*) as f from page where page_id between 2200000 and 2300000 group by bucket;
+--------+------+
| bucket | f    |
+--------+------+
| 0.0000 | 5931 |
| 0.1000 | 6025 |
| 0.2000 | 6102 |
| 0.3000 | 6183 |
| 0.4000 | 6210 |
| 0.5000 | 6053 |
| 0.6000 | 6108 |
| 0.7000 | 5948 |
| 0.8000 | 5904 |
| 0.9000 | 6137 |
+--------+------+
10 rows in set (0.06 sec)

It looks like the bug started in 2a03980093b5168a834bbf65e820c5400e29b21b (October 2004), you see Will changed the random number function to be "more random", by multiplying two random numbers together, producing a distribution heavily skewed towards zero. It was fixed in May 2005 in 133b12c9dc98e4f1abe5a1ccb3138612f3837176 and 4d556548eb8f2d020d99b813758fc50706a1821a.

Niedzielski updated the task description. (Show Details)Nov 7 2018, 1:35 AM
Niedzielski updated the task description. (Show Details)Nov 7 2018, 3:08 AM
Krinkle edited projects, added MediaWiki-General; removed SEO.Nov 7 2018, 3:28 AM
Krinkle added a subscriber: Krinkle.

@Niedzielski Should this issue be considered blocking T208789, or is this okay to resolve at a later time?

@Krinkle, resolving this task would greatly eliminate concerns for our upcoming SEO A/B test. Looping in @ovasileva and @phuedx for decisioning :]

daniel added a comment.Nov 7 2018, 8:56 AM

wfRandom currently uses:

$max = mt_getrandmax() + 1;
$rand = number_format( ( mt_rand() * $max + mt_rand() ) / $max / $max, 12, '.', '' );

This should work, though it seems odd to me. This seems much simpler, and less likely to run into issues with degrading precision in high IEEE float values:

$max = mt_getrandmax();
$rand =  number_format( ( mt_rand() + mt_rand()  ) / ( 2*$max ), 12, '.', '' );

Anyway. Looking at the documentation of mt_rand, I found this:

7.2.0 	mt_rand() has received a bug fix for a modulo bias bug. This means that sequences generated with a specific seed may differ from PHP 7.1 on 64-bit machines.
7.1.0 	rand() has been made an alias of mt_rand(). 
7.1.0 	mt_rand() has been updated to use the fixed, correct, version of the Mersenne Twister algorithm. To fall back to the old behaviour, use mt_srand() with MT_RAND_PHP as the second paramter.

And for mt_srand():

MT_RAND_PHP 	Uses an incorrect Mersenne Twister implementation which was used as the default up till PHP 7.1.0. This mode is available for backward compatibility.

I couldn't find a bug report offhand, but perhaps that "incorrect Mersenne Twister implementation" was the issue? Or the problem is indeed in degrading flat precision due to multiplying with mt_getrandmax(). Though I coouldn't confirm this in a brief experiment.

wfRandom currently uses:

$max = mt_getrandmax() + 1;
$rand = number_format( ( mt_rand() * $max + mt_rand() ) / $max / $max, 12, '.', '' );

This should work, though it seems odd to me. This seems much simpler, and less likely to run into issues with degrading precision in high IEEE float values:

$max = mt_getrandmax();
$rand =  number_format( ( mt_rand() + mt_rand()  ) / ( 2*$max ), 12, '.', '' );

That's almost as bad as Wil's idea. That would produce a triangular distribution, like throwing two dice and adding the results. https://en.wikipedia.org/wiki/Irwin%E2%80%93Hall_distribution

The point of my code in wfRandom(), which you quoted, is to produce a number uniformly distributed on [0,1) with 53 bit precision. In the case that $max is 2^31, mt_rand()*max+mt_rand() is an integer uniformly distributed between 0 and 2^62 - 1, although it is truncated to 53 bits of precision on a 32-bit platform. Then you divide it by 2^62 to produce a floating point number between 0 and 1, which then necessarily has a 53 bit precision.

I couldn't find a bug report offhand, but perhaps that "incorrect Mersenne Twister implementation" was the issue? Or the problem is indeed in degrading flat precision due to multiplying with mt_getrandmax(). Though I coouldn't confirm this in a brief experiment.

No, as I said in T208909#4727045, the issue was Wil's catastrophically awful attempt at combining two random numbers, producing a non-uniform distribution. I killed it with fire at first sight, rashly, without considering what damage it had inflicted on the database in the 9 months it was deployed. There was no bug in RAND() or mt_rand() or any precision issue.

It looks like the bug started in 2a03980093b5168a834bbf65e820c5400e29b21b (October 2004), you see Will changed the random number function to be "more random", by multiplying two random numbers together, producing a distribution heavily skewed towards zero. It was fixed in May 2005 in 133b12c9dc98e4f1abe5a1ccb3138612f3837176 and 4d556548eb8f2d020d99b813758fc50706a1821a.

So based on this, it seems we will need to update the page_random field in less than 600,000 rows of the English Wikipedia's page table[1], and less than 1.5 million rows across all projects[2].

[1]
SELECT COUNT(DISTINCT rev_page)
FROM wmf_raw.mediawiki_revision
WHERE wiki_db = 'enwiki'
AND snapshot = '2018-09'
AND rev_parent_id = 0
AND rev_timestamp < '20050601000000'
AND rev_timestamp > '20041011000000' -- https://phabricator.wikimedia.org/T208909#4727045
;

_c0
584236
[2]
SELECT COUNT(DISTINCT rev_page)
FROM wmf_raw.mediawiki_revision
WHERE snapshot = '2018-09'
AND rev_parent_id = 0
AND rev_timestamp < '20050601000000'
AND rev_timestamp > '20041011000000' -- https://phabricator.wikimedia.org/T208909#4727045
;
_c0
1476802

(Still facepalming about "more random"... oh man.)

tstarling added a comment.EditedNov 7 2018, 11:41 AM

AND rev_timestamp < '20050601000000'

We didn't have CI back then, we deployed major versions after months of delay. I think the correct end date is the deployment of MediaWiki 1.5, i.e. 2005-07-07 per https://wikitech.wikimedia.org/wiki/Server_admin_log/Archive_4#July_6 . This agrees with the query results I posted previously.

daniel added a comment.Nov 7 2018, 1:28 PM

That's almost as bad as Wil's idea. That would produce a triangular distribution, like throwing two dice and adding the results. https://en.wikipedia.org/wiki/Irwin%E2%80%93Hall_distribution

Ha, of course, sorry. Stochastics was never my strong suite. Your solution puts random things into more bits, which is what we want. I still worry that it may cause issues if mt_getrandmax() gets very large, but I suppose it's not going to exceed 2^64 soon.

No, as I said in T208909#4727045, the issue was Wil's catastrophically awful attempt at combining two random numbers, producing a non-uniform distribution. I killed it with fire at first sight, rashly, without considering what damage it had inflicted on the database in the 9 months it was deployed. There was no bug in RAND() or mt_rand() or any precision issue.

Oh, I thought that bad random numbers were still produced after your fix, and we were looking for reasons. Never mind, then.

phuedx added a project: DBA.Nov 7 2018, 6:43 PM
Niedzielski renamed this task from [Bug] Old page_random values are nonuniformly distributed to [Bug] Update old nonuniformly distributed page_random values.Nov 7 2018, 8:21 PM
Niedzielski updated the task description. (Show Details)
Tbayer updated the task description. (Show Details)Nov 7 2018, 8:37 PM
Tbayer updated the task description. (Show Details)Nov 8 2018, 9:03 AM
jcrespo added a subscriber: jcrespo.Nov 8 2018, 5:21 PM

I was asked to participate on this task explicitly:

I have no idea what this task is about but please do not think of running the query on the description AS IS. You need to create batching to update 100-1000 rows on each transaction and select (update) based on the primary key. Not sure if there is somewhere else other than page that stores supposedly random values. This is such a common task that we should have "perPageBatched" and "perRevisionBatched" skeletons/classes precreated on mediawiki/maintenance.

Other than batching and waiting to prevent service downtime, I don't have any further input on this- mediawiki platform owners (core, readers, or not sure who) have the decision on how, where and when, and if they decide to run such a script, as long at lag is not created (preventable with batching+waitforreplica() functions) they don't need my or anyone's permission - sorry there is not much I can add to the conversation.

Update from the airport for the practical bits- send a patch with a foreachwiki php script (check other past cases with similar tasks) so many people can review it and +1 it- if it takes less than 1 hour to run, !log on IRC #wikimedia-operations and run it during its own deployment window, it it is longer, write it on the [[wikitech:Deployments]] page in the "week of:" section, as recommended on "Long running tasks/scripts". Make sure no other long-running tasks such as T166733 are writing to the same tables to prevent locking (eg. deadlocks). Even if the reporter do not have access to production, once it is deployed into production, anyone with server access will be able to run it (not only ops- releng and several developer, too).

I will return to work on Monday.

Thanks for the feedback, @jcrespo. @pmiazga and @phuedx are working on a new PHP MySQL script per your advice. We’d greatly appreciate yours and your colleagues’ review of it and have some additional questions:

  1. We don’t know how long this script will take to run. Our initial draft query demonstrates page_random repopulation using the MySQL’s RAND(). Since we’re taking the trouble to write a maintenance script, we intend to use the preferred wfRandom() MediaWiki function instead which matches the implementation for newly inserted pages today. We estimate that about 1.5 million rows will be changed across projects, ~.6 million of those on enwiki. In your experience, how long does a change like this take?
  2. Can you help us identify parties we should coordinate with based on your experience making similar changes? In your comment you mentioned #wikimedia-operations if the update was quick but per the previous bullet, we’re unsure. Does ~1.5 million rows seem like a long running change? Would it be necessary or advisable to divide this change into enwiki and non-enwiki updates? Lastly, are Anomie’s T166733 and T188132 scripts running constantly or intermittently? They seem to be scheduled for weeks! Would they interfere with a simultaneous update to page_random?
  3. When would a change to page_random appear on the replicas?
  4. How should we structure our write waits for the slaves? After each write or after a batch write? Is there a specific example precedent we should follow? We’re digging into examples on our end but would appreciate any additional detail you can provide.
  5. Our initial draft query uses page ID, the primary key. Could you clarify your feedback on select / update?

Thank you so much for your help!!

Jdlrobson set the point value for this task to 13.Nov 8 2018, 10:20 PM
Jdlrobson added a subscriber: Jdlrobson.

Team has async estimated this as a super risky 13. https://wikimedia.slack.com/archives/G8QAPHCTT/p1541701428235000

Change 472596 had a related patch set uploaded (by Pmiazga; owner: Pmiazga):
[mediawiki/core@master] DNM: Provide a script to repopuplate page_random column

https://gerrit.wikimedia.org/r/472596

I have no idea what this task is about but please do not think of running the query on the description AS IS. You need to create batching to update 100-1000 rows on each transaction and select (update) based on the primary key. Not sure if there is somewhere else other than page that stores supposedly random values. This is such a common task that we should have "perPageBatched" and "perRevisionBatched" skeletons/classes precreated on mediawiki/maintenance.

For the record there is https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/core/+/refs/heads/master/maintenance/runBatchedQuery.php which is related.

Is Slack something Wikimedia uses now? That link appears to require a Foundation Google login.

AND rev_timestamp < '20050601000000'

We didn't have CI back then, we deployed major versions after months of delay. I think the correct end date is the deployment of MediaWiki 1.5, i.e. 2005-07-07 per https://wikitech.wikimedia.org/wiki/Server_admin_log/Archive_4#July_6 . This agrees with the query results I posted previously.

Thanks, so here is the updated estimate for the total number of rows (pages) affected across all projects: 1.62 million, instead of <1.5 million. (The query posted in the task yesterday already reflected the larger timespan.)

SELECT COUNT(DISTINCT rev_page)
FROM wmf_raw.mediawiki_revision
WHERE snapshot = '2018-09'
AND rev_parent_id = 0
AND rev_timestamp < '20050708000000'
AND rev_timestamp > '20041011000000' -- https://phabricator.wikimedia.org/T208909#4727045
;

_c0
1620550
Time taken: 383.897 seconds, Fetched: 1 row(s)

Thanks for the feedback, @jcrespo. @pmiazga and @phuedx are working on a new PHP MySQL script per your advice. We’d greatly appreciate yours and your colleagues’ review of it and have some additional questions:

[...]

  1. Can you help us identify parties we should coordinate with based on your experience making similar changes? In your comment you mentioned #wikimedia-operations if the update was quick but per the previous bullet, we’re unsure. Does ~1.5 million rows seem like a long running change? Would it be necessary or advisable to divide this change into enwiki and non-enwiki updates? Lastly, are Anomie’s T166733 and T188132 scripts running constantly or intermittently? They seem to be scheduled for weeks! Would they interfere with a simultaneous update to page_random?

FWIW, it seems that @Anomie's updates are not operating on the page table that we are concerned with here, but on different tables (namely image for T188132, and revision, archive, logging, ipblocks, image, oldimage, filearchive, protected_titles and recentchanges for T166733, according to T166733#4709400 - @Anomie, can you confirm?).

I understand @jcrespo's warning in T208909#4732780 was about writing to the same tables.

Thanks for the feedback, @jcrespo. @pmiazga and @phuedx are working on a new PHP MySQL script per your advice. We’d greatly appreciate yours and your colleagues’ review of it and have some additional questions:

  1. We don’t know how long this script will take to run. Our initial draft query demonstrates page_random repopulation using the MySQL’s RAND(). Since we’re taking the trouble to write a maintenance script, we intend to use the preferred wfRandom() MediaWiki function instead which matches the implementation for newly inserted pages today. We estimate that about 1.5 million rows will be changed across projects, ~.6 million of those on enwiki. In your experience, how long does a change like this take?

Yes, this would be great to know - I have no idea myself either, but looking at the above mentioned case in T166733#4709400 , it seems that batched updating of 16.8 million rows (across 9 tables) took 11 hours there.

Banyek added a subscriber: Banyek.Nov 9 2018, 11:40 AM
Anomie added a comment.EditedNov 9 2018, 3:22 PM
  1. We don’t know how long this script will take to run. [...] In your experience, how long does a change like this take?

In my experience it tends to depend rather heavily on how efficient your "select one batch of rows" query is, and on what operations exactly you need to run for each batch, and how big you can make each batch while keeping lock times (e.g. from transactions) within acceptable limits.

Usually what I do is run whatever script on the test wikis and mediawikiwiki and extrapolate the timing from there. Another option is to load a dump into a database of your own and hope differences in machine power and replication waiting don't make too much of a difference.

  1. When would a change to page_random appear on the replicas?

Typically within seconds of updating it on the master.

  1. How should we structure our write waits for the slaves? After each write or after a batch write? Is there a specific example precedent we should follow? We’re digging into examples on our end but would appreciate any additional detail you can provide.

Each batch, generally. If you wind up using Maintenance::beginTransaction() and Maintenance::commitTransaction(), the latter does that for you. If your update doesn't need transactions, you'd call LoadBalancerFactory's waitForReplication() or wfWaitForSlaves() yourself after each batch.

  1. Our initial draft query uses page ID, the primary key. Could you clarify your feedback on select / update?

The process should look something like this:

  1. Select a batch of rows with something like SELECT page_id, ... FROM page WHERE ... ORDER BY ... LIMIT 1000 or SELECT page_id, ... FROM page WHERE page_id >= X AND page_id < X+1000 AND ..., or some other query that efficiently fetches one batch of rows.
  2. Update each row like UPDATE page SET ... WHERE page_id = Y.
  3. Wait for replicas as mentioned above.
  4. Repeat the above for the next batch, until you run out of batches.

FWIW, it seems that @Anomie's updates are not operating on the page table that we are concerned with here, but on different tables (namely image for T188132, and revision, archive, logging, ipblocks, image, oldimage, filearchive, protected_titles and recentchanges for T166733, according to T166733#4709400 - @Anomie, can you confirm?).

That's correct. Also the updates are almost done, only enwiki.logging and enwiki.protected_titles remain to be processed.

@Anomie, thank you for this thoughtful feedback. It's very much appreciated!

That's correct. Also the updates are almost done, only enwiki.logging and enwiki.protected_titles remain to be processed.

What do you think of the approach we've taken? Do you have any concerns with the work you're doing if we run this script?

Anomie added a comment.Nov 9 2018, 4:57 PM

I don't see anything where your stuff would interact with my stuff.

ovasileva raised the priority of this task from Normal to High.Nov 9 2018, 11:21 PM

Change 472596 merged by jenkins-bot:
[mediawiki/core@master] Provide a script to reset the page_random column

https://gerrit.wikimedia.org/r/472596

@Tgr saved the day and merged the updater script. @phuedx and @jcrespo to execute script Monday.

Change 472958 had a related patch set uploaded (by Phuedx; owner: Pmiazga):
[mediawiki/core@wmf/1.33.0-wmf.3] Provide a script to reset the page_random column

https://gerrit.wikimedia.org/r/472958

Change 472958 merged by Phuedx:
[mediawiki/core@wmf/1.33.0-wmf.3] Provide a script to reset the page_random column

https://gerrit.wikimedia.org/r/472958

Mentioned in SAL (#wikimedia-operations) [2018-11-12T13:14:57Z] <phuedx@deploy1001> Synchronized php-1.33.0-wmf.3/autoload.php: SWAT: [[gerrit:472958|Provide a script to reset the page_random column (T208909)]] (duration: 00m 55s)

Mentioned in SAL (#wikimedia-operations) [2018-11-12T13:16:45Z] <phuedx@deploy1001> Synchronized php-1.33.0-wmf.3/maintenance/resetPageRandom.php: SWAT: [[gerrit:472958|Provide a script to reset the page_random column (T208909)]] (duration: 00m 53s)

The script is currently running against all small wikis.

… and now against all medium wikis.

… and now against all large wikis.

Niedzielski reassigned this task from phuedx to Tbayer.Nov 12 2018, 7:27 PM

Thank you, @phuedx! @Tbayer to resolve. Moving to sign off.

Repeating query [4] from the task description, the distribution on ptwiki pages created on or before Dec 8, 2005 looks plausible now: https://quarry.wmflabs.org/query/31152
Will do the other checks by tomorrow evening PST (note that the Hive query [2] can't be re-run directly right now as it depends on the monthly Data Lake snapshot, but of course we can run it in MySQL/MariaDB elsewhere).

Tbayer closed this task as Resolved.Nov 14 2018, 7:27 AM

Checked that the following sets of pages look quite uniformly distributed now:
enwiki: https://quarry.wmflabs.org/query/31221 (a version of [2] from the task description that actually completes on Quarry)
enwiki: https://quarry.wmflabs.org/query/31220
Commons: https://quarry.wmflabs.org/query/31218
dewiki: https://quarry.wmflabs.org/query/31219

As promised, here are the notes that I took whilst running the resetPageRandom.php script on 12th November:

  • @Marostegui recommends a batch size of 200 (the default)
  • Run the script on testwiki, test2wiki, and mediawikiwiki first, looking for errors, e.g.
cd /srv/mediawiki-staging/php
mwscript maintentance/resetPageRandom.php --wiki testwiki --from=20041011000000 --to=20050708000000 --batch-size=200
  • Run the script on small wikis, i.e. foreachwikiindblist small.dblist maintenance/resetPageRandom.php --from=20041011000000 --to=20050708000000 --batch-size=200
  • Run the script on medium wikis
    • The set of medium wikis contains the most affected wiki, bgwiktionary, per P7803
  • Run the script on large wikis
  • Finally, run the script on zhwikivoyage and zhwiktionary, which don't occur in {small,medium,large}.dblist
  • It took ~6 hours for the script to run on all wikis

Random thought related to this: part of the fun of "random page" was getting really weird things, and this bug might have helped with that. The older the page, the higher the probability it's a more common topic (Britney Spears was added before Free energy principle). Not sure how everyone feels about messing with "random page", but we could for example build a service that picks a random page, making it less likely to be picked if it has high pageviews.

@Milimetric, for better and worse I believe the RESTBase endpoints (e.g., https://en.wikipedia.org/api/rest_v1/page/random/title) used by the native apps contain some special logic to strongly favor "interesting" articles where "interesting" means pages with a minimum amount of content and an image IIRC so it wouldn't be without precedent.

Finally, run the script on zhwikivoyage and zhwiktionary, which don't occur in {small,medium,large}.dblist

That seems like a bug. Could you follow up on that in another task? I remember there be a structure unit test in wmf-config that is meant to protect against such (invalid) state.

Anomie added a comment.Dec 3 2018, 3:35 PM

Finally, run the script on zhwikivoyage and zhwiktionary, which don't occur in {small,medium,large}.dblist

That seems like a bug. Could you follow up on that in another task? I remember there be a structure unit test in wmf-config that is meant to protect against such (invalid) state.

I see zhwikivoyage in medium.dblist and zhwiktionary in large.dblist.

phuedx added a comment.Dec 3 2018, 3:52 PM

PEBCAK error! Great! I was likely looking at the wrong list(s) when I was trying to find 'em.