Page MenuHomePhabricator

Drop page_random
Closed, DeclinedPublic

Description

Currently, every page gets a random number assigned to it which is stored in page table and it powers Special:Random by querying the table and ordering based on page_random.

Problems:

  • It takes a lot of space, each row takes 8 bytes for the field and eight bytes for the index, that easily translates to at least 2GB in commons for example (innodb appends page_id to index so it's more)
  • Taking a lot of space is fine (specially if it's "just" 2GB) if it's useful, but this is not really "data" and shouldn't be stored, it's literally just some random numbers. It conceptually doesn't belong there.
  • The current implementation is not really fair or harmonious and tends to favor some articles over the other. It's much worse in smaller wikis but it's even big enough to be measured in English Wikipedia. This explains it well in more depth: http://colinmorris.github.io/blog/unpopular-wiki-articles
  • It's making the schema of page table overly complex with one extra field and one extra index just for one usecase, confusing devs.

Proposed replacement when someone hits Special:Random:

  • Query max page_id, let's say it's 104000
  • Flip a virtual coin to get a random number between 0 and 1: You get 0.14626670939472441
  • By multiplying the above numbers, you get 15211.73777705134, rounded to 15212, where to start the search
  • Get the search range: 3000/log10(max page_id + 2), in this case it'll be 597.96293887481 and rounded to 598 rows
  • If search range is bigger than max page_id, just query the whole table (not the case in this example)
  • If not, then query page table with the main conditions (not redirect, in main ns, etc.) AND page_id being below and above the starting point by half of the search range (14913 < page_id < 15511). If the low point goes below zero, pick zero, if high point goes above max, leave it as is.
  • Take all of the results and pick a page_id at random from the list.
    • If no results have been returned, flip a coin again and try up to five times, if all fails, return an error to the user.

Notes:

  • 3000 above is chosen intentionally, it's multiply of 1000*log10(1000) to make sure the search range doesn't go above 1000 rows
    • With 3000, the max search size will be 1000 rows but minimum currently is at 160 rows (for commons) and it realistically can't go below 100 rows (needing 10^30 pages).
  • From looking at English Wikipedia, the ratio of articles per page_id is 1:10. So with search range of roughly ~160, you'll get around 16 articles to pick one at random. Assuming perfect randomness and using binomial distribution, the chance of hitting zero articles with ten percent article distribution and draw of 160 pages is 4.77e-08 and chance of getting it zero five times is ... basically zero. In reality it might be not that small but still negligible.

Event Timeline

  • Special:Random supports users to choose a random page in specific namespace, and if some namespaces have few page, this method will failed.
  • Some wikis (e.g. roa-rupwiki, chrwiktionary) previously has many (more than 100 thousands) main-namespace pages, most since deleted. So there will be much more page_id than main namespace pages, and the distribution of extent page_id is not uniform.

See also: T200703: Special:RandomInCategory does not return all pages with equal probability

Consider me uneducated in maths... Why do we need both a start an end range? Intuitively I would expect something like "WHERE page_id >= 15211 LIMIT N" to suffice and guaranteed to return a result, where N could be 1 if articles were never deleted and perfectly distributed, so instead we have an N like 500 and then pick a random one in PHP from the returned set.

Consider me uneducated in maths... Why do we need both a start an end range? Intuitively I would expect something like "WHERE page_id >= 15211 LIMIT N" to suffice and guaranteed to return a result, where N could be 1 if articles were never deleted and perfectly distributed, so instead we have an N like 500 and then pick a random one in PHP from the returned set.

That is to avoid unfairness towards early numbers. otherwise pages with ids such as 1 or 10 will have a unfair disadvantage.

I'll basically echo what @Bugreporter said. The problem is, you don't know anything about the density of existing pages in different parts of the page_id space. Say if the search range is 1000, and there is 1 page between page_id 1 and 1000, and 1000 pages between page_id 1,000,000 and 1,001,000. Then the page in the low range gets picked 1000 times as often as the pages in the high range. The weight is inversely proportional to the density of returnable pages in page_id space.

It's possible to select a page without this density bias, just by reducing the range to 1. Instead of picking a random page_id and querying for pages with a nearby page_id, pick 1000 random page_ids and do SELECT page_id FROM page WHERE page_id IN (... many random numbers...). Then return the page_id which was nearest to the start of the list.

Of course this is slower, and will still fail when the density is low enough.

Another approach is to permanently enumerate the pages. Have a separate table which maps a new dense ID to the page_id. Keep the dense IDs dense by either assigning newly created pages to gaps created by page deletion, or by periodically compacting the table, reassigning all the IDs. This procedure has to be repeated for all possible search criteria. So if there are 100 pages in a category, you have to enumerate them with dense IDs 1 to 100.

In my opinion, page_random is actually a decent trade-off between performance and randomness.

Fair, we could potentially address those via using t-digest to store the distribution somewhere but it'd be a lot of work. It's not worth doing.