Currently, every page gets a random number assigned to it which is stored in page table and it powers Special:Random by querying the table and ordering based on page_random.
Problems:
- It takes a lot of space, each row takes 8 bytes for the field and eight bytes for the index, that easily translates to at least 2GB in commons for example (innodb appends page_id to index so it's more)
- Taking a lot of space is fine (specially if it's "just" 2GB) if it's useful, but this is not really "data" and shouldn't be stored, it's literally just some random numbers. It conceptually doesn't belong there.
- The current implementation is not really fair or harmonious and tends to favor some articles over the other. It's much worse in smaller wikis but it's even big enough to be measured in English Wikipedia. This explains it well in more depth: http://colinmorris.github.io/blog/unpopular-wiki-articles
- It's making the schema of page table overly complex with one extra field and one extra index just for one usecase, confusing devs.
Proposed replacement when someone hits Special:Random:
- Query max page_id, let's say it's 104000
- Flip a virtual coin to get a random number between 0 and 1: You get 0.14626670939472441
- By multiplying the above numbers, you get 15211.73777705134, rounded to 15212, where to start the search
- Get the search range: 3000/log10(max page_id + 2), in this case it'll be 597.96293887481 and rounded to 598 rows
- If search range is bigger than max page_id, just query the whole table (not the case in this example)
- If not, then query page table with the main conditions (not redirect, in main ns, etc.) AND page_id being below and above the starting point by half of the search range (14913 < page_id < 15511). If the low point goes below zero, pick zero, if high point goes above max, leave it as is.
- Take all of the results and pick a page_id at random from the list.
- If no results have been returned, flip a coin again and try up to five times, if all fails, return an error to the user.
Notes:
- 3000 above is chosen intentionally, it's multiply of 1000*log10(1000) to make sure the search range doesn't go above 1000 rows
- With 3000, the max search size will be 1000 rows but minimum currently is at 160 rows (for commons) and it realistically can't go below 100 rows (needing 10^30 pages).
- From looking at English Wikipedia, the ratio of articles per page_id is 1:10. So with search range of roughly ~160, you'll get around 16 articles to pick one at random. Assuming perfect randomness and using binomial distribution, the chance of hitting zero articles with ten percent article distribution and draw of 160 pages is 4.77e-08 and chance of getting it zero five times is ... basically zero. In reality it might be not that small but still negligible.