Page MenuHomePhabricator

Temporary account IP reveal: CheckUserTemporaryAccountsByIPLookup::getTempAccountsFromIPAddress uses a slow query
Open, Needs TriagePublic

Description

Summary

The query to cu_changes in CheckUserTemporaryAccountsByIPLookup::getTempAccountsFromIPAddress isn't efficient and can trigger TransactionProfiler warnings. We should look at how this query performance can be improved, as it is called when viewing Special:Contributions and Special:IPContributions

Background

  • Logstash is reporting that the query performed by CheckUserTemporaryAccountsByIPLookup::getTempAccountsFromIPAddress isn't efficient, as it has triggered TransactionProfiler warnings
  • Pasting that query into production English Wikipedia and running an EXPLAIN produces the following
    • The "Using filesort" suggests that the query isn't efficient enough
describe SELECT  actor_name,MAX(cuc_timestamp) AS `timestamp`  FROM `cu_changes` JOIN `actor` ON ((actor_id=cuc_actor))   WHERE (actor_name LIKE '~2%' ESCAPE '`') AND ((cuc_ip_hex >= 'v6-X' AND cuc_ip_hex <= 'v6-X'))  GROUP BY actor_name ORDER BY timestamp DESC LIMIT 100;
+------+-------------+------------+--------+---------------------------------------------------------+-----------------+---------+-----------------------------+-------+--------------------------------------------------------+
| id   | select_type | table      | type   | possible_keys                                           | key             | key_len | ref                         | rows  | Extra                                                  |
+------+-------------+------------+--------+---------------------------------------------------------+-----------------+---------+-----------------------------+-------+--------------------------------------------------------+
|    1 | SIMPLE      | cu_changes | range  | cuc_ip_hex_time,cuc_actor_ip_time,cuc_actor_ip_hex_time | cuc_ip_hex_time | 258     | NULL                        | 74040 | Using index condition; Using temporary; Using filesort |
|    1 | SIMPLE      | actor      | eq_ref | PRIMARY,actor_name                                      | PRIMARY         | 8       | enwiki.cu_changes.cuc_actor | 1     | Using where                                            |
+------+-------------+------------+--------+---------------------------------------------------------+-----------------+---------+-----------------------------+-------+--------------------------------------------------------+
  • Actually running the query on production showed that the query took 2.2 seconds to run
    • Repeating the query again makes it run in 0.2 seconds, which suggests that the replica MariaDB instance is storing some kind of data that caches the expensive part of the operation
  • However, I could not get the query to run slowly on the special page on production English Wikipedia
    • Given the above point, it's possible the replica DBs are storing some kind of data to make the query faster over multiple calls

Acceptance criteria

  • The query to cu_changes is more efficient, so that TransctionProfiler warnings are not created and ideally that the EXPLAIN does not indicate that filesorts are being used

Event Timeline

I'm not quite sure where the number of rows is coming from. Some related numbers:

  • There are about 35000 rows in cu_changes on that range
  • There are about 2500 temp account rows in cu_changes on that range

Could it refer to the fact that it's having to sort multiple rows belonging to the same actor to find the maximum timestamp, then sort those maximum timestamps in descending order?

I'd guess this is happening:

  • Use the cuc_ip_hex_time index to find the rows from that IP range (fast)
  • Pick out the rows that fit the temp user condition
  • Sort the rows for each temp user by time, to find the max timestamp for each, then sort those in descending order (slow)

What we could do

The tables aren't really designed to query by user and IP and sort by timestamp - hence it being awkward to make query.

I had a look at what Special:CheckUser does when looking up users by IP (range) and it just fetches all the matching rows up to the limit, then post processes them into individual user info in PHP - so you never find data beyond the limit. If that is acceptable for Special:CheckUser, would it be acceptable here?

I had a look at what Special:CheckUser does when looking up users by IP (range) and it just fetches all the matching rows up to the limit, then post processes them into individual user info in PHP - so you never find data beyond the limit. If that is acceptable for Special:CheckUser, would it be acceptable here?

Probably. However:

  • Special:CheckUser there are paging links which makes it possible to see there are more results to fetch
    • This means that a user can see that there is more data, so there isn't this context lost
    • When they page to the next page it selects the data that was beyond the end of the limit (so the data never gets truncated in a way which the CheckUser wouldn't be then able to see)
  • For this count, there would be no paging links
    • Therefore, any truncated data would be impossible to see and there would be no indication that the data was truncated
      • Maybe this can be solved by adding an indicator that the total count is not accurate?

Some other approaches to consider:

  • We could use ::estimateRowCount to get a count which is more efficent? That is used by other CheckUser interfaces to first get a rough number and then make it more accurate if the rough number isn't too high
  • We could see about making the SQL query count the users for us, instead of having it return the data to then be simply counted in PHP? I could see the optimiser finding a better query plan if it knows it doesn't need to return the actual names of the users / associated timestamps

We could also narrow the maximum range for this type of query - for example only allow single IPs (but treat IPv6 as /64).

Temp accounts don't generally seem to have a very high number of edits:

image.png (566×817 px, 46 KB)

We could also narrow the maximum range for this type of query - for example only allow single IPs (but treat IPv6 as /64).

Sure, though AFAICS the problematic query mentioned in the task description was for a /64 IPv6 range

We could also narrow the maximum range for this type of query - for example only allow single IPs (but treat IPv6 as /64).

Sure, though AFAICS the problematic query mentioned in the task description was for a /64 IPv6 range

The example on logstash looks like a /43

We could also narrow the maximum range for this type of query - for example only allow single IPs (but treat IPv6 as /64).

Sure, though AFAICS the problematic query mentioned in the task description was for a /64 IPv6 range

The example on logstash looks like a /43

Ah, I had missed that. I looked at the referrer field it looks like and assumed that it was the URL (which is a /64)

I've found a /32 on English Wikipedia which has the query take 30 seconds to run when I try it and caused a request timeout exception for a same IP range for another user as evidenced here:

image.png (289×996 px, 54 KB)

cc @Tchanders

Change #1264561 had a related patch set uploaded (by Mpostoronca; author: Mpostoronca):

[mediawiki/extensions/CheckUser@master] CheckUserTemporaryAccountsByIPLookup: Use two-step query to avoid filesort

https://gerrit.wikimedia.org/r/1264561

Change #1264562 had a related patch set uploaded (by Mpostoronca; author: Mpostoronca):

[mediawiki/extensions/CheckUser@master] CheckUserTemporaryAccountsByIPLookup: Reject IPv6 ranges wider than /64

https://gerrit.wikimedia.org/r/1264562

Change #1264562 abandoned by Mpostoronca:

[mediawiki/extensions/CheckUser@master] CheckUserTemporaryAccountsByIPLookup: Reject IPv6 ranges wider than /64

https://gerrit.wikimedia.org/r/1264562

Change #1269043 had a related patch set uploaded (by Mpostoronca; author: Mpostoronca):

[mediawiki/extensions/CheckUser@master] CheckUserTemporaryAccountsByIPLookup: Use two-step query to avoid filesort

https://gerrit.wikimedia.org/r/1269043

Change #1269043 abandoned by Mpostoronca:

[mediawiki/extensions/CheckUser@master] CheckUserTemporaryAccountsByIPLookup: Use two-step query to avoid filesort

https://gerrit.wikimedia.org/r/1269043