Page MenuHomePhabricator

CheckUser pagers paginate on non-unique column, causing missing results
Open, Needs TriagePublic

Description

Background

This applies to CheckUserGetActionsPager, CheckUserGetUsersPager, CompareService, and TimelineService, but not CheckUserGetIPsPager since pagination is turned off pending T315612.

The above classes create queries that paginate on timestamp, which isn't necessarily unique. This causes results to go missing if the page break is between two entries with the same timestamp.

Once temporary accounts are enabled, it will become quite common for two entries to have the same timestamp: the first edit and the temporary account creation resulting from that edit. It therefore seems worth fixing this soon.

What needs doing

The pagers should paginate on the combination of (timestamp, id). This should be possible since all the tables (cuc, cule, cupe) have an index on the timestamp field and all of our indexes implicitly index by the named columns plus the primary key.

Event Timeline

My concern is that the query may be slower. I have a vague memory of removing cuc_id from the paging columns because in some situations it was making the query much slower (but can't find where this is, so I'm probably wrong with this).

Applying this change won't be as easy as changing ::getIndexField to return an array. We would need to allow callers to ::getIndexField to support more than one field as the code in AbstractCheckUserPager::reallyDoQuery assumes that the method will return a string.

We could probably avoid most of the complication if we sort by IDs after the results are combined, but this would necessarily always display events from cu_private_event over those in cu_changes (as cuc_id is always likely to be bigger).

Dreamy_Jazz moved this task from CheckUser to General / Unsorted on the CheckUser board.

This also applies to the CheckUser API AFAIK.

Testing example queries for the TimelinePager where there are at least two account targets and one account has made loads of edits, the query gets unacceptably slow. For example:

SELECT comment_text,comment_data FROM ((SELECT cuc_namespace AS `namespace`,cuc_title AS `title`,cuc_actiontext AS `actiontext`,cuc_timestamp AS `timestamp`,cuc_minor AS `minor`,cuc_page_id AS `page_id`,cuc_type AS `type`,cuc_this_oldid AS `this_oldid`,cuc_last_oldid AS `last_oldid`,cuc_ip AS `ip`,cuc_xff AS `xff`,cuc_agent AS `agent`,cuc_id AS `id`,cuc_user_actor.actor_user AS `user`,cuc_user_actor.actor_name AS `user_text`,comment_text,comment_data FROM `cu_changes` FORCE INDEX (cuc_ip_hex_time) JOIN `actor` `cuc_user_actor` ON ((cuc_user_actor.actor_id=cuc_actor)) JOIN `comment` `comment_cuc_comment` ON ((comment_cuc_comment.comment_id=cuc_comment_id)) WHERE (cuc_ip_hex = 'X') ORDER BY cuc_timestamp DESC,cuc_id DESC LIMIT 501 ) UNION (SELECT cuc_namespace AS `namespace`,cuc_title AS `title`,cuc_actiontext AS `actiontext`,cuc_timestamp AS `timestamp`,cuc_minor AS `minor`,cuc_page_id AS `page_id`,cuc_type AS `type`,cuc_this_oldid AS `this_oldid`,cuc_last_oldid AS `last_oldid`,cuc_ip AS `ip`,cuc_xff AS `xff`,cuc_agent AS `agent`,cuc_id AS `id`,cuc_user_actor.actor_user AS `user`,cuc_user_actor.actor_name AS `user_text`,comment_text,comment_data FROM `cu_changes` FORCE INDEX (cuc_actor_ip_time) JOIN `actor` `cuc_user_actor` ON ((cuc_user_actor.actor_id=cuc_actor)) JOIN `comment` `comment_cuc_comment` ON ((comment_cuc_comment.comment_id=cuc_comment_id)) WHERE (actor_user = X OR actor_user = X) ORDER BY cuc_timestamp DESC,cuc_id DESC LIMIT 501 )) `a` ORDER BY timestamp DESC,id DESC LIMIT 501;

....

501 rows in set (33.114 sec)

The query runs in around 9 secs when removing cuc_id DESC. Removing the FORCE INDEX statement did nothing as the query optimiser still chose the index that was forced. This suggests that the forced index needs to also include the ID column.