Page MenuHomePhabricator

CheckUser "get edits" is sorted incorrectly for users with many edits, which may be causing a performance issue for those queries
Closed, ResolvedPublic

Description

Hello,

A user noted on the checkuser mailing list that, when performing "get edits" for a user who has a lot of recent edits, that the results were grouped by IP address rather than simply sorted by date. This occurs when the estimated row count (which is highly inaccurate) is greater than 5000. This performs a query that is ORDER BY cuc_ip ASC, cuc_timestamp DESC. It has always been this way.

This alternate ordering frankly doesn't make any sense - the 5000 most recent edits are far more useful than the 5000 first edits sorted by IP address lexicographically. (Yes, 100.100.100.100 comes /before/ 2.0.0.0 in this sorting.) I thought it might be a performance decision - maybe someone didn't want to sort a large number of rows by timestamp - but cuc_timestamp has an index, and cuc_ip does not. From a performance standpoint, this seems to be making a slow query even slower, though perhaps there is some DBA magic that I'm not getting.

If possible, I would like this query to match the normal "get edits" case, that is, 'ORDER BY' => 'cuc_timestamp DESC',. It would also be necessary to modify userEditsRequestLimitExceeded() by removing the parts of that function which break up the list by IP address.

If that is not possible, then I suspect that it should at least be sorting by cuc_ip_hex, and not cuc_ip.

Event Timeline

I am guessing that the motivation here was that if a user has edited from 3 IP addresses, and one of them has 6000 edits and happens to be the last used IP, by fetching the last 5000 results you will get very limited knowledge about the user (the UAs associated with one IP), but by grouping the results by IP, you will get some more details (the UAs associated with other IPs). In other words, because the "get IPs" feature does not return UAs and the only way to get them is through "get edits", this may have been a hacky effort to make sure that you get a broader combination of IP-UA pairs in such scenarios.

I think we should stop doing this though. The results of "get edits" should always be the last 5000 edits (and I agree with you that it will be better from a performance perspective). Instead, the "get IPs" view should be modified to not just list each IP, but also list the UAs associated with it. I had a T170508 opened for this, though eventually I decided to close it because, at the time, the cost-benefit trade-off did not seem appealing. I think we should revisit that though.

Other potentially related tasks are T255312 and T147894. The latter, in particular, is a key task that has been hanging there for a long time and needs some love.

I happened to see this 5000 lines limitation relatively frequently, probably a few times a year.

The most frustrating case is: a user made over 5000 edits from the same IP, and used another account from this same IP. In this case there is no good way to match UAs (and it really happened, it was an unapproved bot of a known sockmaster).

In my view, the ideal behaviour would be:

  • if I am checking an IP and it has over 5000 edits, group by users and make sure all users are shown, even if some show only a part of the total number of lines
  • if I am checking an account and it has over 5000 edits, group by IP and make sure all IPs are shown, again it's fine if some show only a part of the total.

This way I would probably have the best chances of getting all UAs.

Alternatives are:

  1. just show all edits in chronological order, and no luck if old ones are the ones I need. Possibly add an option of looking at edits that are NOT most recent, e.g. 'latest month' and 'older than 1 month'
  2. add UAs to 'get IPs' for a registered user.

Thanks

Claiming, I will develop and test a patch to make get_edits get the 5000 most recent edits, and to format the results in the same way as in the normal (fewer than 5000 edits) case.

@ST47: just wanted to make sure you know that you can temporarily modify the code to reduce the limit to, say, 50, and that may make it much easier for you to test your patch?

@NickK: "if I am checking an account and it has over 5000 edits, group by IP and make sure all IPs are shown, again it's fine if some show only a part of the total." is kind of the current behavior, and is contested in this task.

Change 611028 had a related patch set uploaded (by Gerrit Patch Uploader; owner: [[mw:User:ST47]]):
[mediawiki/extensions/CheckUser@master] T257538: Consistent sorting for "get edits"

https://gerrit.wikimedia.org/r/611028

@Huji, I think it's actually easier to do this one first, since it deletes a bunch of duplicate code. I will look at T257641 as well, though.

@NickK, It isn't very easy to make sure that all IPs are shown, likely impossible in a single query. Modifying "get IPs" to also show user agents (i.e., T170508) is a better way to get what you want.

@ST47 I am okay with you submitting a single patch that does both. But I am not okay with https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CheckUser/+/611028 in its current form, where it still keeps "5000" hard-coded. If you don't want your patch to address the 5000 issue, then I am happy to submit another patch that does so, and then you can modify your patch to be based on mine. If you don't feel comfortable with rebasing your patch to depend on mine, I am happy to do that for you as well.

@Huji No problem, I do have a patch for that one as well. I have some doubts about how to handle changing the system message, but I'll ask you over in that ticket.

I don't think the gerrit patch uploader takes Depends-On into account, as I am getting a merge conflict when trying to upload the rebased version of this patch.

@ST47 I am happy to address those issues by submitting a new patch set on each of your patches. I don't mean to be too forward; I hope you are okay with me making those changes. With that assumption, please expect a new patch set on each of them within the next few hours.

I didn't see that you had edited your comment to say that you were working on a new patch set until I had already uploaded a new version of https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CheckUser/+/611399 to fix the "is_null" lint error and the message documentation error.

It is even better. I'd rather stay in my "code review" capacity and have the ability to +2 this ultimately.

@NickK: "if I am checking an account and it has over 5000 edits, group by IP and make sure all IPs are shown, again it's fine if some show only a part of the total." is kind of the current behavior, and is contested in this task.

It is very roughly kind of. It tries to sort by IP, but there is no guarantee all IPs would be shown. It looks like it shows first IPs lexicographically, which means this approach will fail if the first IP lexicographically is the one with most edits.

@ST47 I cleaned up the commit message for https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CheckUser/+/611028 a bit.

Once the other patch for T257641 is actually merged (should be in a half hour), you need to rebase 611028. Also, you need to update the documentation for the interface message checkuser-too-many in the qqq.json file. At that point we would be able to review it again.

Huji triaged this task as Medium priority.Jul 11 2020, 12:27 PM

I don't know what you mean about checkuser-too-many changing. That is used when a range check for "get edits" or "get users" returns too many results, so the special page shows a list of individual IP addresses with edits. That message was never shown when running "get edits" on a user account or a single IP address, "checkuser-limited" was.