Page MenuHomePhabricator

Special:RandomInCategory does not return all pages with equal probability
Open, Needs TriagePublic

Description

Background on the issue

RandomInCategory uses the cl_timestamp field to draw a random page. Because this field is not uniformly distributed it also uses some additional logic to make all entries likely to be drawn but for small- to medium-size categories (think hundred of pages, not tens of thousands) this does not seem to work.

Impacted use cases

On enwiki: The draft review process on en uses Category:Pending AfC submissions to keep track of all the drafts awaiting review, and Special:RandomInCategory to let reviewers pick the next draft to work on. The problem is, the results are very much non-random. See this Village Pump thread for my analysis of the problem.

On Commons: I (DerHexer) use RandomInCategory for some of the categories on Commons of images which I uploaded, bookmarked in my browser. E.g. this, even more obvious in smaller categories like this. Unlike expected, I do not get random images but some very often, sometimes even in the similar order when I click the bookmark, sometimes it opens only subcategories but not files. How can this happen? The function is very helpful and I would very much appreciate to promote this onwiki but this broken I am a bit hesitant.

Event Timeline

DerHexer created this task.Jul 30 2018, 3:23 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 30 2018, 3:23 PM
DerHexer updated the task description. (Show Details)Jul 30 2018, 3:23 PM
DerHexer updated the task description. (Show Details)
Huji renamed this task from Special:RandomInCategory doesn't show random results to Special:RandomInCategory does not return all pages with equal probability.Sun, Aug 18, 7:53 PM
Huji updated the task description. (Show Details)
Huji updated the task description. (Show Details)
Huji updated the task description. (Show Details)Sun, Aug 18, 7:56 PM
Huji updated the task description. (Show Details)
Huji updated the task description. (Show Details)Sun, Aug 18, 8:06 PM
Huji added a comment.Sun, Aug 18, 8:11 PM

Is there a reason we are using cl_timestamp and not page_random here? The code has been using cl_timestamp since 2013 at least. In comparison, Randompage and Randomrootpage both use page_random.

Huji updated the task description. (Show Details)Sun, Aug 18, 8:12 PM

Is there a reason we are using cl_timestamp and not page_random here? The code has been using cl_timestamp since 2013 at least. In comparison, Randompage and Randomrootpage both use page_random.

Efficency on large categories. Using page_random means a filesort. This means that the db has to load every page in the category, sort them, and then get the right page. Some categories have millions of pages in them (e.g. license cats on commons) so thats not ok. Order by RAND() would probably be equivalent to page_random here.

The original page was a bit of a compromise meant to be useful for backlog pages. Its pretty severely non random but was hoped to be random enough for backlog categories

Some potential options here, which may or may not be practical:

  • add a new field (or separate table) with cl_random. Unclear if worth it given obscureness of feature.
  • Come up with a better sampling algorithm that still operates in constant time without changing the data structure (probably impossible to be "good" but prob could be less terrible)
  • if the category is small (based on category table), say < 200 entries, order by rand().
  • use cirrus search to generate the results.

Im not very familar with cirrussearch, but i suspect that would be the most likely direction to succesfully go in.

Bawolff added a comment.EditedSun, Aug 18, 10:10 PM

P.s. for historical background see T27931 where the suggestion to use the search backend also came up (probably more practical now that we movedfrom lucene to cirrus)

T67366 was about moving all of special:random to cirrus. It was reverted due to load issues. However randomincat is probably less of an issue there as it gets much less hits then the main random and is sorting through much less articles. But im not really familar with how elastic scales so i dont really know

Restricted Application added a project: Discovery-Search. · View Herald TranscriptSun, Aug 18, 10:11 PM

Forgive me for what may be a naive question, since I'm not familiar with the codebase, but regarding:

Come up with a better sampling algorithm that still operates in constant time

how can this be constant time? At best, aren't you walking an index tree, so log(n)?

Huji added a comment.Mon, Aug 19, 12:07 AM

Is there a reason we are using cl_timestamp and not page_random here? The code has been using cl_timestamp since 2013 at least. In comparison, Randompage and Randomrootpage both use page_random.

Efficency on large categories. Using page_random means a filesort. This means that the db has to load every page in the category, sort them, and then get the right page. Some categories have millions of pages in them (e.g. license cats on commons) so thats not ok. Order by RAND() would probably be equivalent to page_random here.

@Bawolff How about using the existing approach for large categories, and one based on page_random for smaller categories? Determining category size has to be efficient, as we are doing it every time a category page is shown (for pagination purposes).

Forgive me for what may be a naive question, since I'm not familiar with the codebase, but regarding:

Come up with a better sampling algorithm that still operates in constant time

how can this be constant time? At best, aren't you walking an index tree, so log(n)?

Well, technically nothing is constant time (because there is a DB query behind everything, and that query likely uses an index which would be log(n) at best). However, assuming the DB has a way to sort things efficiently already, then I believe specifying an offset to retrieve a particular row in that data set would be an O(1) operation. And offsets are exactly what we use here.

Yes, i was speaking a bit informally. I meant a constant number of queries which read one-ish rows which would technically be O(log n) due to the logrithmic time of looking up an arbitrary item in a B-tree

Bawolff added a comment.EditedMon, Aug 19, 2:42 AM

@Bawolff How about using the existing approach for large categories, and one based on page_random for smaller categories? Determining category size has to be efficient, as we are doing it every time a category page is shown (for pagination purposes

I think that might be acceptable (that's mostly what I meant by the third option). Would need the approval of a DBA i believe. I have no idea where the cut off would be. 5000 might be a reasonable number as that is usually the largest list we allow to be shown in one page. Note: The category article counts (from the category table) are sometimes inaccurate for categories with > 200 entries.

If that proves to be acceptable, for larger categories it might be possible to combine the methods. Use the existing heuristic to do a starting point and then randomly select one of the 5000 articles after the starting point. E.g. something along the lines of SELECT c.cl_from FROM (SELECT cl_from FROM categorylinks WHERE cl_to = "CAT NAME HERE" and cl_timestamp > "RANDOM OFFSET HERE" ORDER BY cl_timestamp LIMIT 5000) c ORDER BY rand() LIMIT 1;

Huji added a project: DBA.Mon, Aug 19, 2:29 PM

I agree that a DBA review would be necessary.

It would be helpful if you guys can come with some specific example queries we could run on production, check their query plan etc. That would help to understand if they are viable or not :-)

Marostegui moved this task from Triage to In progress on the DBA board.Mon, Aug 19, 3:22 PM
Marostegui moved this task from In progress to Blocked external/Not db team on the DBA board.
Bawolff added a comment.EditedTue, Aug 20, 1:57 AM

It would be helpful if you guys can come with some specific example queries we could run on production, check their query plan etc. That would help to understand if they are viable or not :-)

So the plan would be - first do SELECT cat_pages FROM category WHERE cat_name = "Foo" and use one of two different methods depending on the size of the category. Currently I'm assuming the cut-off would be around 5000 pages, but we need advice where an appropriate cut-off would be.

If the category is a small category (Say Category:Articles_with_unsourced_statements_from_December_2017 on enwiki which is the worst case on enwiki). We would do the following filesort-ing query:

SELECT cl_from 
FROM categorylinks
WHERE cl_to = "Articles_with_unsourced_statements_from_December_2017" and cl_type <> 'subcat'
ORDER BY rand() LIMIT 1;

It has the following explain (Note: this is run from toolforge replicas, and may not represent production):

id 	select_type 	table 	type 	possible_keys 	key 	key_len 	ref 	rows 	Extra
1.1 	SIMPLE 	categorylinks 	ref 	cl_timestamp, cl_sortkey 	cl_sortkey 	257 	const 	9482 	Using where; Using index; Using temporary; Using filesort

And the following handler status change (Adjusting to substract initial value):

+----------------------------+-------+
| Variable_name              | Value |
+----------------------------+-------+
| Handler_commit             | 1     |
| Handler_delete             | 0     |
| Handler_discover           | 0     |
| Handler_external_lock      | 0     |
| Handler_icp_attempts       | 0     |
| Handler_icp_match          | 0     |
| Handler_mrr_init           | 0     |
| Handler_mrr_key_refills    | 0     |
| Handler_mrr_rowid_refills  | 0     |
| Handler_prepare            | 0     |
| Handler_read_first         | 0     |
| Handler_read_key           | 1 |
| Handler_read_last          | 0     |
| Handler_read_next          | 5007  |
| Handler_read_prev          | 0     |
| Handler_read_retry         | 0     |
| Handler_read_rnd           | 1     |
| Handler_read_rnd_deleted   | 0     |
| Handler_read_rnd_next      | 5008  |
| Handler_rollback           | 0     |
| Handler_savepoint          | 0     |
| Handler_savepoint_rollback | 0     |
| Handler_tmp_update         | 0     |
| Handler_tmp_write          | 5007  |
| Handler_update             | 0     |
| Handler_write              | 0     |
+----------------------------+-------+

For the large category case (Category:Noindexed_pages):

SELECT c.cl_from
FROM  (SELECT cl_from FROM categorylinks where cl_timestamp > '2013-07-03 4:15:21' and cl_to = "Noindexed_pages" and cl_type <> 'subcat' order by cl_timestamp limit 5000) c
ORDER BY rand() limit 1;

Which has explain of:

id 	select_type 	table 	type 	possible_keys 	key 	key_len 	ref 	rows 	Extra
1.1 	PRIMARY 		ALL 					5000 	Using temporary; Using filesort
2.2 	DERIVED 	categorylinks 	range 	cl_timestamp, cl_sortkey 	cl_timestamp 	261 		3715998 	Using index condition; Using where; Using filesort

[Which honestly is kind of confusing to me. The ICP seems to be for the cl_type condiiton, so that would have to use the cl_sortkey index. Can ICP apply to indexes other than the chosen index? If so, did not know that. Additionally its claiming the subquery filesorts for some reason, but it doesn't seem to filesort when I actually run it, per the handler stats].

MariaDB [enwiki_p]> SHOW STATUS like "Hand%";
+----------------------------+-------+
| Variable_name              | Value |
+----------------------------+-------+
| Handler_commit             | 1     |
| Handler_delete             | 0     |
| Handler_discover           | 0     |
| Handler_external_lock      | 0     |
| Handler_icp_attempts       | 5020  |
| Handler_icp_match          | 5020  |
| Handler_mrr_init           | 0     |
| Handler_mrr_key_refills    | 0     |
| Handler_mrr_rowid_refills  | 0     |
| Handler_prepare            | 0     |
| Handler_read_first         | 0     |
| Handler_read_key           | 1     |
| Handler_read_last          | 0     |
| Handler_read_next          | 5018  |
| Handler_read_prev          | 0     |
| Handler_read_retry         | 0     |
| Handler_read_rnd           | 1     |
| Handler_read_rnd_deleted   | 0     |
| Handler_read_rnd_next      | 10002 |
| Handler_rollback           | 0     |
| Handler_savepoint          | 0     |
| Handler_savepoint_rollback | 0     |
| Handler_tmp_update         | 0     |
| Handler_tmp_write          | 10000 |
| Handler_update             | 0     |
| Handler_write              | 0     |
+----------------------------+-------+

Some other notes: Category table is sometimes inaccurate for categories with > 200 things in them. I didn't do the join to page table to get the actual page name, as handler stats seem to indicate it would do that for every row, instead of just doing it for the final one, which is the only one we need. So i thought it'd be better to do it in a separate query. This generally assumes that there isn't any categories that contain like a million subcategories. Generally true, but there are edge cases to that like Category:Commons_category_link_is_on_Wikidata. The subcat filter is probably not critical, or we could do it after the subquery with the 5000 limit, if that's an issue. The other main thing, is its unclear what the cut-off should be between large and small categories (and what the LIMIT should be for the subquery in the big category case). I currently have 5000. I think a number somewhere between 200 to 5000 would make sense, but I'm not sure what the appropriate number would be.

Thanks.

Thanks for providing the examples!
I have tested your examples on the worst possible scenario: Old hardware with spinning disks, cool buffer pool, categorylinks table not being optimized (which is a realistic situation), and on enwiki.
Also, keep in mind that the EXPLAIN can differ from the actual query plan the optimizer chooses to do.

From the tests I have done:

SELECT cl_from 
FROM categorylinks
WHERE cl_to = "Articles_with_unsourced_statements_from_December_2017" and cl_type <> 'subcat'
ORDER BY rand();

This query seems to be using cl_sortkey, and the query seems to be pretty fast, although it does a full scan, so the limit doesn't really matter in this case, as the table will be scanned and _then_ the results will be given:

root@db2055.codfw.wmnet[enwiki]> FLUSH STATUS; pager cat > /dev/null; SELECT cl_from FROM categorylinks WHERE cl_to = "Articles_with_unsourced_statements_from_December_2017" and cl_type <> 'subcat' ORDER BY rand();  nopager; SHOW STATUS like 'Hand%';
Query OK, 0 rows affected (0.03 sec)

PAGER set to 'cat > /dev/null'
5006 rows in set (0.10 sec)

PAGER set to stdout
+----------------------------+-------+
| Variable_name              | Value |
+----------------------------+-------+
| Handler_commit             | 1     |
| Handler_delete             | 0     |
| Handler_discover           | 0     |
| Handler_external_lock      | 0     |
| Handler_icp_attempts       | 0     |
| Handler_icp_match          | 0     |
| Handler_mrr_init           | 0     |
| Handler_mrr_key_refills    | 0     |
| Handler_mrr_rowid_refills  | 0     |
| Handler_prepare            | 0     |
| Handler_read_first         | 0     |
| Handler_read_key           | 1     |
| Handler_read_last          | 0     |
| Handler_read_next          | 5006  |
| Handler_read_prev          | 0     |
| Handler_read_retry         | 0     |
| Handler_read_rnd           | 5006  |
| Handler_read_rnd_deleted   | 0     |
| Handler_read_rnd_next      | 5007  |
| Handler_rollback           | 0     |
| Handler_savepoint          | 0     |
| Handler_savepoint_rollback | 0     |
| Handler_tmp_update         | 0     |
| Handler_tmp_write          | 5006  |
| Handler_update             | 0     |
| Handler_write              | 0     |
+----------------------------+-------+
26 rows in set (0.04 sec)

root@db2055.codfw.wmnet[enwiki]> FLUSH STATUS; pager cat > /dev/null; SELECT cl_from FROM categorylinks WHERE cl_to = "Articles_with_unsourced_statements_from_December_2017" and cl_type <> 'subcat' ORDER BY rand() limit 1000;  nopager; SHOW STATUS like 'Hand%';
Query OK, 0 rows affected (0.03 sec)

PAGER set to 'cat > /dev/null'
1000 rows in set (0.04 sec)

PAGER set to stdout
+----------------------------+-------+
| Variable_name              | Value |
+----------------------------+-------+
| Handler_commit             | 1     |
| Handler_delete             | 0     |
| Handler_discover           | 0     |
| Handler_external_lock      | 0     |
| Handler_icp_attempts       | 0     |
| Handler_icp_match          | 0     |
| Handler_mrr_init           | 0     |
| Handler_mrr_key_refills    | 0     |
| Handler_mrr_rowid_refills  | 0     |
| Handler_prepare            | 0     |
| Handler_read_first         | 0     |
| Handler_read_key           | 1     |
| Handler_read_last          | 0     |
| Handler_read_next          | 5006  |
| Handler_read_prev          | 0     |
| Handler_read_retry         | 0     |
| Handler_read_rnd           | 1000  |
| Handler_read_rnd_deleted   | 0     |
| Handler_read_rnd_next      | 5007  |
| Handler_rollback           | 0     |
| Handler_savepoint          | 0     |
| Handler_savepoint_rollback | 0     |
| Handler_tmp_update         | 0     |
| Handler_tmp_write          | 5006  |
| Handler_update             | 0     |
| Handler_write              | 0     |
+----------------------------+-------+
26 rows in set (0.03 sec)

So the query would retrieve 5006 rows, so the limit doesn't really make any difference in this case.
The query itself is pretty fast, so it is not a super big deal as it is pretty small. With LIMIT 1000 it takes 0.4 and with LIMIT 5000 it takes 0.10, which is double the time, so if we extrapolate, we should probably stick to LIMIT 1000 with this particular query.

If we move to the second query which is bigger, we see a similar behaviour:

SELECT c.cl_from
FROM  (SELECT cl_from FROM categorylinks where cl_timestamp > '2013-07-03 4:15:21' and cl_to = "Noindexed_pages" and cl_type <> 'subcat' order by cl_timestamp limit 5000) c
ORDER BY rand() limit 1;

This second query seems to be correctly choosing cl_timstamp index and the order and limit avoids the full scan, which otherwise happens:

root@db2055.codfw.wmnet[enwiki]> FLUSH STATUS; pager cat > /dev/null; SELECT c.cl_from FROM  (SELECT cl_from FROM categorylinks where cl_timestamp > '2013-07-03 4:15:21' and cl_to = "Noindexed_pages" and cl_type <> 'subcat' order by cl_timestamp) c ORDER BY rand();  nopager; SHOW STATUS like 'Hand%';
Query OK, 0 rows affected (0.03 sec)

PAGER set to 'cat > /dev/null'

^CCtrl-C -- query killed. Continuing normally.
ERROR 1317 (70100): Query execution was interrupted
PAGER set to stdout
+----------------------------+--------+
| Variable_name              | Value  |
+----------------------------+--------+
| Handler_commit             | 0      |
| Handler_delete             | 0      |
| Handler_discover           | 0      |
| Handler_external_lock      | 0      |
| Handler_icp_attempts       | 283132 |
| Handler_icp_match          | 283132 |
| Handler_mrr_init           | 1      |
| Handler_mrr_key_refills    | 0      |
| Handler_mrr_rowid_refills  | 281    |
| Handler_prepare            | 0      |
| Handler_read_first         | 0      |
| Handler_read_key           | 1      |
| Handler_read_last          | 0      |
| Handler_read_next          | 283127 |
| Handler_read_prev          | 0      |
| Handler_read_retry         | 0      |
| Handler_read_rnd           | 282972 |
| Handler_read_rnd_deleted   | 0      |
| Handler_read_rnd_next      | 0      |
| Handler_rollback           | 1      |
| Handler_savepoint          | 0      |
| Handler_savepoint_rollback | 0      |
| Handler_tmp_update         | 0      |
| Handler_tmp_write          | 267978 |
| Handler_update             | 0      |
| Handler_write              | 0      |
+----------------------------+--------+

A LIMIT 1000 for this query results on 0.05, and a LIMIT 5000 results on 0.13, which is, again, fast, but if we extrapolate it is almost 3 times, so maybe also stick to LIMIT 1000 with this one too.
Would that make sense from your side?

Thank you!

A limit of 1000 should work fine for this usecase.

Cool!
I will remove the DBA tag for now, but I will remain subscribed, in case I need to help further!