Page MenuHomePhabricator

Lower the limit for rows returned in Special:Investigate compare check
Closed, ResolvedPublic2 Estimated Story Points

Description

The query for the compare check performs a subquery for each target and aggregates the resulting rows. The subqueries are limited so that no more than 100k rows will be aggregated.

In {T248588}, we found that the query was slower than expected in extreme circumstances (large cu_changes table, checking prolific users/IPs).

Lowering the limit to 25,000 rows brought the time taken into a more acceptable range, and would still be an improvement on the existing CheckUser tool, which has a limit of 5000.

We should control the limit with a config so that we can test different limits and so that we can use different limits on different wikis. 25,000 would probably be a good default to start with.

What we should do:

  • Remove the $limit parameter from CompareService constructor
  • Instead use a new config
  • Set the config to 25000 by default in extension.json

Event Timeline

Niharika moved this task from Untriaged to Cards ready to be discussed on the Anti-Harassment board.
Niharika raised the priority of this task from Low to Medium.May 13 2020, 3:58 PM

Todo: Change this task to make this limit into a config and lower it.

ARamirez_WMF set the point value for this task to 2.May 13 2020, 4:28 PM

Todo: Change this task to make this limit into a config and lower it.

Done.

Huji added a subscriber: Huji.

@Tchanders and @Niharika let's use T257641 to address the "config var" part. This task can then focus on what the number should be for Special:Investigate.

@Tchanders and @Niharika let's use T257641 to address the "config var" part. This task can then focus on what the number should be for Special:Investigate.

Cool, thanks for filing it.

Niharika lowered the priority of this task from Medium to Low.Aug 24 2020, 10:31 PM

@Huji Thanks for pointing out T257641.

We need a separate config for the "IPs & User agents" tab in Special:Investigate, since it will need to be higher. This is because the query is done differently: subqueries fetch the relevant rows from the table, which are then grouped by the username, IP and user agent by the main query. Since this grouping is inefficient, we need to limit the number of rows that we group - this is the limit that we'd like to lower from 100k to 25k. The total number of rows after the grouping will usually be much lower than this limit. (It is these grouped rows that are returned to the pager, so no more than the pager limit will actually be returned in one go).

If all of this sounds inefficient it's an unfortunate side-effect of maintaining both Special:CheckUser and Special:Investigate. The database tables are only designed to be queried efficiently from Special:CheckUser.

Change 651837 had a related patch set uploaded (by Tchanders; owner: Tchanders):
[mediawiki/extensions/CheckUser@master] Lower the limit for rows grouped in CompareService query

https://gerrit.wikimedia.org/r/651837

Change 651837 merged by jenkins-bot:
[mediawiki/extensions/CheckUser@master] Lower the limit for rows grouped in CompareService query

https://gerrit.wikimedia.org/r/651837