Page MenuHomePhabricator

Random pages are not truly random
Closed, ResolvedPublicBUG REPORT

Description

According to the research at https://colinmorris.github.io/blog/unpopular-wiki-articles the current setup of random page is not actually random, since there are non-uniform gaps in the random values assigned to articles that has a statistical effect on the probability that any given page will be returned by random page. Could this process be made more random?

Proposed solution:
At page updates (caused by edits, purges, template change caused reparses), we flip a coin and at 10% of the time, we change the value of page_random at the same time of changing page_touched).

Event Timeline

Also related to T22208 which reports duplicate page_random. Turns out that there are currently more than 1000 groups of articles on English Wikipedia with duplicate page_random. In other words, the gap is 0 and some of those articles will never be picked. An obvious workaround would be using a maintenance script or something to periodically regenerate page_random. Doing that, the distribution is still skewed within the same period, but in the long run each article has a fair chance to be picked.

A simple way is to regularly reset page_random value for each page.

A simple way is to regularly reset page_random value for each page.

In Commons, there are 140M pages, updating those itself takes at least a week and not really feasible.

A simpler approach would be during update of page_touched, flip a coin and update page_random in the same transaction. Let me check how hard it would be to implement this.

Change 984298 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] Title: Update page_random at random while updating page_touched

https://gerrit.wikimedia.org/r/984298

Change 984298 merged by jenkins-bot:

[mediawiki/core@master] Title: Update page_random at random while updating page_touched

https://gerrit.wikimedia.org/r/984298

Probably should be mentioned in the tech news. Beside that, while we really can't fix the randomness issues, this mitigates the problem to the best degree possible without hurting storage or speed. Shall we close this?

Re: Tech News - What wording and link(s) would you suggest as the content? I'd guess something like this, but perhaps there are better or more accurate ways to phrase it? Please propose wording, or add directly to the next issue (frozen for translations in ~26 hours).

Recent changes

  • The way that Random pages are selected has been updated. This will reduce the problem of some pages having a lower chance of appearing.

Links-wise, it could just link to this task, but if so, perhaps we could update the Task Description with a summary of what was changed?

Hi, I wrote something, please reword or change mercilessly. I'll edit the description now.

Jdforrester-WMF assigned this task to Ladsgroup.
Jdforrester-WMF added a subscriber: Jdforrester-WMF.

Anything left to do here?

Theoretically this should be fixed (eventually).

Change 989552 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[mediawiki/core@master] Title: Actually update page_random 10% of the time

https://gerrit.wikimedia.org/r/989552

Change 989552 merged by jenkins-bot:

[mediawiki/core@master] Title: Bump page_random updates from ~9% to 10%

https://gerrit.wikimedia.org/r/989552