Page MenuHomePhabricator

Provide a strategy for testing the performance of queries needed to show the list of user-agents for each IP
Closed, ResolvedPublic

Description

In this task, I will provide a set of queries to identify some of the "worst case scenario" examples for the queries introduced in the parent task, so that our DBA team can estimate the performance of the query in those extreme cases. Hopefully, the outcome of this task is either that the queries run efficiently enough already, or a specific recommendation for a new index is made.

Some of the new queries use GROUP BY cuc_agent or have a WHERE clause that uses cuc_agent. This further signifies the need for T147894: Create index for agent_id columns in the CheckUser result tables to be addressed. As in, if the queries below show that an index is not needed, then we can possible close T147894 too (note that the patch associated with that task has already been turned down).

Event Timeline

Huji triaged this task as High priority.Dec 16 2018, 9:04 PM
Huji created this task.
Step 1: Distinct UAs used by an IP

Identify a user-ip combination that has many user-agent values associated with it

SELECT
  cuc_user,
  cuc_ip,
  COUNT(DISTINCT cuc_agent) AS cnt
FROM cu_changes
GROUP BY
  cuc_user,
  cuc_ip
ORDER BY cnt DESC

Let's say the top row has cuc_user = 2 and cuc_ip = 22.33.55.66. Now check the performance of the following query, which closely resembles the actual query that would be run by the new code:

SELECT
  cuc_agent,
  COUNT(*) AS cnt
FROM cu_changes
USE INDEX (cuc_user_ip_time) -- also try without this hint and assess the difference
WHERE
  cuc_user = 2 -- replace with actual user ID
  AND cuc_ip = '22.33.55.66' -- replace with actual IP address
  AND cuc_timestamp > '20180101000000'
  AND cuc_timestamp < '20190101000000'
GROUP BY cuc_agent
ORDER BY cuc_agent
LIMIT 5001;
Step 2: Estimated edits by a UA

Identify a user-agent that has many edits across different IPs, and pick the most frequent one of those IPs

SELECT
  cuc_agent,
  cuc_hex,
  COUNT(*) AS cnt
FROM cu_changes
GROUP BY
  cuc_agent,
  cuc_hex
ORDER BY cnt DESC

Let's say the top row has cuc_agent = SOMEUSERAGENT and cuc_ip_hex = C0A8000A. Now check the performance of the following two queries, which closely resembles the actual query that would be run by the new code:

SELECT COUNT(*)
FROM cu_changes
WHERE
  cuc_agent = 'SOMEUSERAGENT' -- replace with actual user-agent value
  AND cuc_timestamp > '20180101000000'
  AND cuc_timestamp < '20190101000000'
SELECT COUNT(*)
FROM cu_changes
WHERE
  cuc_agent = 'SOMEUSERAGENT' -- replace with actual user-agent value
  AND cuc_ip_hex = 'C0A8000A' -- replace with actual IP HEX value
  AND cuc_timestamp > '20180101000000'
  AND cuc_timestamp < '20190101000000'

In all of the queries above, timestamps can be modified to even more closely resemble the actual query (typically, the bigger and smaller timestamps are "now" and "three months ago", respectively)

Huji updated the task description. (Show Details)

@jcrespo I created this while you aware away. Just making sure this has not gone unnoticed, and also making sure this is what you would want from me to complete DBA review for CheckUser related sign-offs.

root@db1106.eqiad.wmnet[enwiki]> EXPLAIN SELECT   cuc_agent,   COUNT(*) AS cnt FROM cu_changes USE INDEX (cuc_user_ip_time)  WHERE   cuc_user = 2    AND cuc_ip = '22.33.55.66'    AND cuc_timestamp > '20180101000000'   AND cuc_timestamp < '20190101000000' GROUP BY cuc_agent ORDER BY cuc_agent LIMIT 5001\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: cu_changes
         type: range
possible_keys: cuc_user_ip_time
          key: cuc_user_ip_time
      key_len: 278
          ref: NULL
         rows: 1
        Extra: Using index condition; Using temporary; Using filesort
1 row in set (0.00 sec)

root@db1106.eqiad.wmnet[enwiki]> flush status; pager cat > /dev/null; SELECT   cuc_agent,   COUNT(*) AS cnt FROM cu_changes USE INDEX (cuc_user_ip_time)  WHERE   cuc_user = 2    AND cuc_ip = '22.33.55.66'    AND cuc_timestamp > '20180101000000'   AND cuc_timestamp < '20190101000000' GROUP BY cuc_agent ORDER BY cuc_agent LIMIT 5001; nopager; show status like 'Hand%';
Query OK, 0 rows affected (0.00 sec)

PAGER set to 'cat > /dev/null'
Empty set (0.00 sec)

PAGER set to stdout
+----------------------------+-------+
| Variable_name              | Value |
+----------------------------+-------+
| Handler_commit             | 1     |
| Handler_delete             | 0     |
| Handler_discover           | 0     |
| Handler_external_lock      | 0     |
| Handler_icp_attempts       | 0     |
| Handler_icp_match          | 0     |
| Handler_mrr_init           | 0     |
| Handler_mrr_key_refills    | 0     |
| Handler_mrr_rowid_refills  | 0     |
| Handler_prepare            | 0     |
| Handler_read_first         | 0     |
| Handler_read_key           | 1     |
| Handler_read_last          | 0     |
| Handler_read_next          | 0     |
| Handler_read_prev          | 0     |
| Handler_read_retry         | 0     |
| Handler_read_rnd           | 0     |
| Handler_read_rnd_deleted   | 0     |
| Handler_read_rnd_next      | 1     |
| Handler_rollback           | 0     |
| Handler_savepoint          | 0     |
| Handler_savepoint_rollback | 0     |
| Handler_tmp_update         | 0     |
| Handler_tmp_write          | 0     |
| Handler_update             | 0     |
| Handler_write              | 0     |
+----------------------------+-------+
26 rows in set (0.17 sec)

But obviously this is not a good example, I am running:

select cuc_agent, cuc_ip, count(*) FROM cu_changes GROUP BY cuc_agent, cuc_ip ORDER by count(*) DESC limit 10;

To get the pathological one. This is not a fast query, as it will have to read the whole 12TB of enwiki (and I guess wikidatawiki will be worse).

To confirm: is the cu_changes table 12TB for enwiki?

Sorry, 12GB :-) We should only get tables so big (>5TB) on the external store servers (content) :-)

Please give me some time, these analytics I need to gather will not be fast, but I am on it.

I had a minor heart attack ;) *jk*

I will remain patient. If there is anything else I can do to help, please let me know.

So, first of all, the USE is not necessary, it uses the right index without needing a hint.

The plan for the first query is:

           id: 1
  select_type: SIMPLE
        table: cu_changes
         type: range
possible_keys: cuc_timestamp,cuc_user_time,cuc_user_ip_time
          key: cuc_user_ip_time
      key_len: 278
          ref: NULL
         rows: <high number>
        Extra: Using index condition; Using temporary; Using filesort
1 row in set (0.00 sec)

+----------------------------+---------------+
| Variable_name              | Value         |
+----------------------------+---------------+
| Handler_commit             | 1             |
| Handler_delete             | 0             |
| Handler_discover           | 0             |
| Handler_external_lock      | 0             |
| Handler_icp_attempts       | <high number> |
| Handler_icp_match          | <high number> |
| Handler_mrr_init           | 0             |
| Handler_mrr_key_refills    | 0             |
| Handler_mrr_rowid_refills  | 0             |
| Handler_prepare            | 0             |
| Handler_read_first         | 0             |
| Handler_read_key           | <high number> |
| Handler_read_last          | 0             |
| Handler_read_next          | <high number> |
| Handler_read_prev          | 0             |
| Handler_read_retry         | 0             |
| Handler_read_rnd           | <number equal to the rows returned> |
| Handler_read_rnd_deleted   | 0             |
| Handler_read_rnd_next      | <number equal to the rows returned + 1> |
| Handler_rollback           | 0             |
| Handler_savepoint          | 0             |
| Handler_savepoint_rollback | 0             |
| Handler_tmp_update         | <high number> |
| Handler_tmp_write          | 5             |
| Handler_update             | 0             |
| Handler_write              | 0             |
+----------------------------+---------------+
26 rows in set (0.14 sec)

That high number can be close to 10 million in the worst case scenario (wikidata bot), which the host has to read all, and then sort in a temporary table, to finally get just a few rows (normally <10).

I don't think it is possible to get real-time data on a 36GB table (wikidatawiki), no matter what index is being used, which for the most of the time, it will be very cold and data will have to be read from disk and sort manually. To do column-based analytics like this, you would need a column-based engine (with the downside that probably point selects would be quite slow).

I can see 3 alternatives:

  • setup a specific data store we replicate to and send those specific reports outside of mysql. This is how pageviews are calculated and requires a high amount of engineering.
  • Emulate a column-based store by generating reports every X amount of time and store it pre-calculated on mysql (think of a small table with {user, ip and list of user agents}. This is similar to how special pages get updated.
  • Generate that table on edit time as a post-editing hook, by checking if the user agent is in the list of existing ones and inserting it if not. Potentially calculating the number of times/last time used, but that could cause congestion for high editing users, and would need to be designed carefully (it is possible, just not easy). This is similar to how site statistics are calculated.
  • Restrict the querying to 24-hour and use cuc_timestamp (not even this would work)

Let me know your thoughts before going with the analysis of the second query.

I don't like any of those three options; they are all too much cost (engineering cost, code review time cost, etc.) for little benefit. There exists a completely different solution: rewrite the CU tool such that once it loads the IP list, it uses the CU API to fetch detailed edits for each user or IP and parse out the UA frequencies in a JavaScript code run on the client side. It is much easier to engineer, it will shift the workload mostly to the client-side (though there will be tens or hundreds of CU queries over the API, but they will be queries that are already optimized), and its only negative side effect that I can think of is that it spams the CU log.

Please go ahead and analyze the second query; even if we end up pursuing the API-based approach, it might still be a good idea to optimize some of the work through additional API calls that are optimized to answer specific questions (e.g. estimated number of edits by an IP-UA pair in a given time period).

Mentioned in SAL (#wikimedia-operations) [2019-02-07T09:23:17Z] <jynus> running alter table on db2055 for perforamance testing T212092

So for the second query, the problems are similar-- they are "GROUP BYs without WHERE*" (* or the WHERE clause is too open), which means it will have to scan as many rows as there are occurences, which are dependent on the edits. for the top editor on Wikipedia in English that is:

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: cu_changes
         type: ALL
possible_keys: cuc_timestamp
          key: NULL
      key_len: NULL
          ref: NULL
         rows: <high number>
        Extra: Using where
1 row in set (0.04 sec)

The high number is close to several millions. A full table scan is done- there is a lack of an index on cuc_agent. I created one on a test host:

ALTER TABLE cu_changes ADD INDEX (cuc_agent, cuc_timestamp);
+----------+
| COUNT(*) |
+----------+
|   149135 |
+----------+
1 row in set (0.18 sec)

root@db2055.codfw.wmnet[enwiki]> show status like 'hand%';
+----------------------------+--------+
| Variable_name              | Value  |
+----------------------------+--------+
| Handler_commit             | 1      |
| Handler_delete             | 0      |
| Handler_discover           | 0      |
| Handler_external_lock      | 0      |
| Handler_icp_attempts       | 0      |
| Handler_icp_match          | 0      |
| Handler_mrr_init           | 0      |
| Handler_mrr_key_refills    | 0      |
| Handler_mrr_rowid_refills  | 0      |
| Handler_prepare            | 0      |
| Handler_read_first         | 0      |
| Handler_read_key           | 1      |
| Handler_read_last          | 0      |
| Handler_read_next          | 149135 |
| Handler_read_prev          | 0      |
| Handler_read_retry         | 0      |
| Handler_read_rnd           | 0      |
| Handler_read_rnd_deleted   | 0      |
| Handler_read_rnd_next      | 0      |
| Handler_rollback           | 0      |
| Handler_savepoint          | 0      |
| Handler_savepoint_rollback | 0      |
| Handler_tmp_update         | 0      |
| Handler_tmp_write          | 0      |
26 rows in set (0.09 sec)

So as long as there is not a disproportionate amount of edits for a single agent, it should work.

I did not do lots of testing for the second, but that would require a separate index, cuc_ip_hex_time cannot be used efficiently due to the way ranges work (despite ICP)- you would want an specific (agent, ip, time) or (ip, agent, time). Without changing it it would take quite a lot of extra dives, so I would prefer the extra index rather than the 1 second query on an idle host (much longer on a regular one).

Note that a difference on the query such as changing the ip check to a range (like) or something else would have an impact on the performance and indexes would not be used efficiently.