Page MenuHomePhabricator

Do test queries for range contributions to gauge performance of using different tables
Closed, ResolvedPublic2 Story Points

Description

Let's test some different possible queries for getting range contributions and see what the performance is like. Specifically I'd like to know if using the recentchanges table (as @jcrespo suggested) will provide a noticeable improvement over joining against the revision table. Since we don't actually have our IP-hex table available yet, let's use the CheckUser table for that data here.

Tests are being done on production due to the sensitivity of data on db2048. The following tables have to be dropped before this can be closed (while binlog is disabled for the session):
ip_changes
ip_changes2

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Restricted Application added a project: Security. · View Herald TranscriptJan 30 2017, 5:29 AM
kaldari changed the visibility from "Custom Policy" to "Custom Policy".Jan 30 2017, 6:06 AM
Aklapper renamed this task from Do test queries for range contributions to guage performance of using different tables to Do test queries for range contributions to gauge performance of using different tables.Jan 30 2017, 11:19 AM

MusikAnimal set Security to Software security bug.

@MusikAnimal: If this is a Security issue, why did the Security tag get removed?

@MusikAnimal: If this is a Security issue, why did the Security tag get removed?

It's not a security issue, per se, rather we might be discussing some sensitive data. I wrote this in the "comments" when changing the visibility. I just didn't want to trouble the security team :) Is there a way to make tasks private without it auto-tagging as a security issue? Or perhaps this is considered misuse of custom policy, and we shouldn't use Phabricator for things like this?

jcrespo added a comment.EditedJan 30 2017, 6:03 PM

Tip: as a DBA I usually handle sensitive data but not software bugs. What I do is to create a paste with NDA-only access so sensitive data is not accessible (never passswords or things like that, which should never get out of the cluster), and refer it from the ticket, leaving it open. I can also change a ticket to be NDA-only without being a security issue, but probably that option is not available to everyone. I can help handling that, if that helps (specially if it is query-related).

Another tip: Using pager cat > /dev/nulls on the mysql command line client gets you the timing and allows you to run SHOW STATUS like 'Hand%' without showing the row results. Disable with nopager.

MusikAnimal added a comment.EditedJan 30 2017, 9:27 PM

@jcrespo Changing this to be non-security NDA-only seems most appropriate. Moving forward we will use pastes. I still think it is weird to even be discussing running queries on production cu_changes. It's in a work capacity so I know it's fine, but still :)

Anyhoo, I've found what I believe to be sufficiently "busy" IP ranges. If you want to help with this Jamie that'd be great. If these are too slow, maybe you know how to make them not slow... =P Otherwise @Niharika or @kaldari will run it at their convenience. Just a heads up that these ranges do contain data from registered accounts. Again we just want the run times, not the results.

I'm going to list out the full queries for each scenario with IPv4 and IPv6. The "more complex" queries are what they would look like if more filtering options were entered at Special:RangeContribs (I'm trying to cover all grounds), but since they narrow down the results maybe it will actually go faster than with no options. We'll see!

Joining with revision

Simple query

IPv4 /16 range

SELECT rev_id,rev_page,rev_text_id,rev_timestamp,rev_comment,rev_user_text,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,rev_content_format,rev_content_model,page_namespace,page_title,page_is_new,page_latest,page_is_redirect,page_len,(SELECT GROUP_CONCAT(ct_tag SEPARATOR ',') FROM `change_tag` WHERE ct_rev_id=rev_id) AS `ts_tags` FROM `cu_changes` LEFT JOIN `revision` FORCE INDEX (usertext_timestamp) ON (cuc_this_oldid = rev_id) INNER JOIN `page` ON (page_id = rev_page) WHERE (cuc_ip_hex BETWEEN '3F8F0000' AND '3F8FFFFF') AND ((rev_deleted & 4) = 0) AND (rev_timestamp > '20161101212048') AND (cuc_type IN (0, 1)) ORDER BY rev_timestamp DESC LIMIT 501;

IPv6 /32 range

SELECT rev_id,rev_page,rev_text_id,rev_timestamp,rev_comment,rev_user_text,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,rev_content_format,rev_content_model,page_namespace,page_title,page_is_new,page_latest,page_is_redirect,page_len,(SELECT GROUP_CONCAT(ct_tag SEPARATOR ',') FROM `change_tag` WHERE ct_rev_id=rev_id) AS `ts_tags` FROM `cu_changes` LEFT JOIN `revision` FORCE INDEX (usertext_timestamp) ON (cuc_this_oldid = rev_id) INNER JOIN `page` ON (page_id = rev_page) WHERE (cuc_ip_hex BETWEEN 'v6-26008800000000000000000000000000' AND 'v6-26008800FFFFFFFFFFFFFFFFFFFFFFFF') AND ((rev_deleted & 4) = 0) AND (rev_timestamp > '20161101212048') AND (cuc_type IN (0, 1)) ORDER BY rev_timestamp DESC LIMIT 501;

More complex query

IPv4 /16 range

SELECT rev_id,rev_page,rev_text_id,rev_timestamp,rev_comment,rev_user_text,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,rev_content_format,rev_content_model,page_namespace,page_title,page_is_new,page_latest,page_is_redirect,page_len,(SELECT GROUP_CONCAT(ct_tag SEPARATOR ',') FROM `change_tag` WHERE ct_rev_id=rev_id) AS `ts_tags` FROM `cu_changes` LEFT JOIN `revision` FORCE INDEX (usertext_timestamp) ON (cuc_this_oldid = rev_id) INNER JOIN `page` ON (page_id = rev_page) INNER JOIN `change_tag` ON (ct_rev_id=rev_id) WHERE (cuc_ip_hex BETWEEN '3F8F0000' AND '3F8FFFFF') AND (rev_id = page_latest) AND (rev_parent_id = 0) AND (rev_minor_edit = 0) AND (page_namespace != '13' AND page_namespace != '12') AND ((rev_deleted & 4) = 0) AND ct_tag = 'Foobar' AND (rev_timestamp > '20161101212048') AND (cuc_type IN (0, 1)) ORDER BY cuc_ip_hex DESC, rev_timestamp DESC LIMIT 501;

IPv6 /32 range

SELECT rev_id,rev_page,rev_text_id,rev_timestamp,rev_comment,rev_user_text,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,rev_content_format,rev_content_model,page_namespace,page_title,page_is_new,page_latest,page_is_redirect,page_len,(SELECT GROUP_CONCAT(ct_tag SEPARATOR ',') FROM `change_tag` WHERE ct_rev_id=rev_id) AS `ts_tags` FROM `cu_changes` LEFT JOIN `revision` FORCE INDEX (usertext_timestamp) ON (cuc_this_oldid = rev_id) INNER JOIN `page` ON (page_id = rev_page) INNER JOIN `change_tag` ON (ct_rev_id=rev_id) WHERE (cuc_ip_hex BETWEEN 'v6-26008800000000000000000000000000' AND 'v6-26008800FFFFFFFFFFFFFFFFFFFFFFFF') AND (rev_id = page_latest) AND (rev_parent_id = 0) AND (rev_minor_edit = 0) AND (page_namespace != '13' AND page_namespace != '12') AND ((rev_deleted & 4) = 0) AND ct_tag = 'Foobar' AND (rev_timestamp > '20161101212048') AND (cuc_type IN (0, 1)) ORDER BY cuc_ip_hex DESC, rev_timestamp DESC LIMIT 501;

Joining with recentchanges

Simple query

IPv4 /16 range

SELECT rc_id,rc_cur_id,rc_timestamp,rc_comment,rc_user_text,rc_minor,rc_deleted,rc_old_len,rc_new_len,rc_this_oldid,rc_namespace,rc_title,rc_new,page_latest,page_is_redirect,page_len,(SELECT GROUP_CONCAT(ct_tag SEPARATOR ',') FROM `change_tag` WHERE ct_rev_id=rc_this_oldid) AS `ts_tags` FROM `cu_changes` LEFT JOIN `recentchanges` ON (cuc_this_oldid = rc_this_oldid) INNER JOIN `page` ON (page_id = rc_cur_id) WHERE (cuc_ip_hex BETWEEN '3F8F0000' AND '3F8FFFFF') AND ((rc_deleted & 4) = 0) AND (rc_timestamp > '20161101212048') AND (cuc_type IN (0, 1)) ORDER BY rc_timestamp DESC LIMIT 501;

IPv6 /32 range

SELECT rc_id,rc_cur_id,rc_timestamp,rc_comment,rc_user_text,rc_minor,rc_deleted,rc_old_len,rc_new_len,rc_this_oldid,rc_namespace,rc_title,rc_new,page_latest,page_is_redirect,page_len,(SELECT GROUP_CONCAT(ct_tag SEPARATOR ',') FROM `change_tag` WHERE ct_rev_id=rc_this_oldid) AS `ts_tags` FROM `cu_changes` LEFT JOIN `recentchanges` ON (cuc_this_oldid = rc_this_oldid) INNER JOIN `page` ON (page_id = rc_cur_id) WHERE (cuc_ip_hex BETWEEN 'v6-26008800000000000000000000000000' AND 'v6-26008800FFFFFFFFFFFFFFFFFFFFFFFF') AND ((rc_deleted & 4) = 0) AND (rc_timestamp > '20161101212048') AND (cuc_type IN (0, 1)) ORDER BY rc_timestamp DESC LIMIT 501;

More complex query

IPv4 /16 range

SELECT rc_id,rc_cur_id,rc_timestamp,rc_comment,rc_user_text,rc_minor,rc_deleted,rc_old_len,rc_new_len,rc_this_oldid,rc_namespace,rc_title,rc_new,page_latest,page_is_redirect,page_len,(SELECT GROUP_CONCAT(ct_tag SEPARATOR ',') FROM `change_tag` WHERE ct_rev_id=rc_this_oldid) AS `ts_tags` FROM `cu_changes` LEFT JOIN `recentchanges` ON (cuc_this_oldid = rc_this_oldid) INNER JOIN `page` ON (page_id = rc_cur_id) INNER JOIN `change_tag` ON (ct_rev_id=rc_this_oldid) WHERE (cuc_ip_hex BETWEEN '3F8F0000' AND '3F8FFFFF') AND (rc_this_oldid = page_latest) AND (rc_last_oldid = 0) AND (rc_minor = 0) AND (page_namespace != '13' AND page_namespace != '12') AND ((rc_deleted & 4) = 0) AND ct_tag = 'Foobar' AND (rc_timestamp > '20161101212048') AND (cuc_type IN (0, 1)) ORDER BY cuc_ip_hex DESC, rc_timestamp DESC LIMIT 501;

IPv6 /32 range

SELECT rc_id,rc_cur_id,rc_timestamp,rc_comment,rc_user_text,rc_minor,rc_deleted,rc_old_len,rc_new_len,rc_this_oldid,rc_namespace,rc_title,rc_new,page_latest,page_is_redirect,page_len,(SELECT GROUP_CONCAT(ct_tag SEPARATOR ',') FROM `change_tag` WHERE ct_rev_id=rc_this_oldid) AS `ts_tags` FROM `cu_changes` LEFT JOIN `recentchanges` ON (cuc_this_oldid = rc_this_oldid) INNER JOIN `page` ON (page_id = rc_cur_id) INNER JOIN `change_tag` ON (ct_rev_id=rc_this_oldid) WHERE (cuc_ip_hex BETWEEN 'v6-26008800000000000000000000000000' AND 'v6-26008800FFFFFFFFFFFFFFFFFFFFFFFF') AND (rc_this_oldid = page_latest) AND (rc_last_oldid = 0) AND (rc_minor = 0) AND (page_namespace != '13' AND page_namespace != '12') AND ((rc_deleted & 4) = 0) AND ct_tag = 'Foobar' AND (rc_timestamp > '20161101212048') AND (cuc_type IN (0, 1)) ORDER BY cuc_ip_hex DESC, rc_timestamp DESC LIMIT 501;

@jcrespo Changing this to be non-security NDA-only seems most appropriate. Moving forward we will use pastes. I still think it is weird to even be discussing running queries on production cu_changes. It's in a work capacity so I know it's fine, but still :)

Anyhoo, I've found what I believe to be sufficiently "busy" IP ranges. If you want to help with this Jamie that'd be great. If these are too slow, maybe you know how to make them not slow... =P Otherwise @Niharika or @kaldari will run it at their convenience. Just a heads up that these ranges do contain data from registered accounts. Again we just want the run times, not the results.

I'm going to list out the full queries for each scenario with IPv4 and IPv6. The "more complex" queries are what they would look like if more filtering options were entered at Special:RangeContribs (I'm trying to cover all grounds), but since they narrow down the results maybe it will actually go faster than with no options. We'll see!

= Joining with revision =
=== Simple query ===
IPv4 /16 range

SELECT rev_id,rev_page,rev_text_id,rev_timestamp,rev_comment,rev_user_text,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,rev_content_format,rev_content_model,page_namespace,page_title,page_is_new,page_latest,page_is_redirect,page_len,(SELECT GROUP_CONCAT(ct_tag SEPARATOR ',') FROM `change_tag` WHERE ct_rev_id=rev_id) AS `ts_tags` FROM `cu_changes` LEFT JOIN `revision` FORCE INDEX (usertext_timestamp) ON (cuc_this_oldid = rev_id) INNER JOIN `page` ON (page_id = rev_page) WHERE (cuc_ip_hex BETWEEN '3F8F0000' AND '3F8FFFFF') AND ((rev_deleted & 4) = 0) AND (rev_timestamp > '20161101212048') AND (cuc_type IN (0, 1)) ORDER BY rev_timestamp DESC LIMIT 501;

Has been running for ~90 minutes now. I'll keep a check and let you know the exact runtime if/when it finishes. I didn't bother with the query below because it's gonna take even longer, I expect.

IPv6 /32 range

SELECT rev_id,rev_page,rev_text_id,rev_timestamp,rev_comment,rev_user_text,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,rev_content_format,rev_content_model,page_namespace,page_title,page_is_new,page_latest,page_is_redirect,page_len,(SELECT GROUP_CONCAT(ct_tag SEPARATOR ',') FROM `change_tag` WHERE ct_rev_id=rev_id) AS `ts_tags` FROM `cu_changes` LEFT JOIN `revision` FORCE INDEX (usertext_timestamp) ON (cuc_this_oldid = rev_id) INNER JOIN `page` ON (page_id = rev_page) WHERE (cuc_ip_hex BETWEEN 'v6-26008800000000000000000000000000' AND 'v6-26008800FFFFFFFFFFFFFFFFFFFFFFFF') AND ((rev_deleted & 4) = 0) AND (rev_timestamp > '20161101212048') AND (cuc_type IN (0, 1)) ORDER BY rev_timestamp DESC LIMIT 501;

=== More complex query ===
IPv4 /16 range

SELECT rev_id,rev_page,rev_text_id,rev_timestamp,rev_comment,rev_user_text,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,rev_content_format,rev_content_model,page_namespace,page_title,page_is_new,page_latest,page_is_redirect,page_len,(SELECT GROUP_CONCAT(ct_tag SEPARATOR ',') FROM `change_tag` WHERE ct_rev_id=rev_id) AS `ts_tags` FROM `cu_changes` LEFT JOIN `revision` FORCE INDEX (usertext_timestamp) ON (cuc_this_oldid = rev_id) INNER JOIN `page` ON (page_id = rev_page) INNER JOIN `change_tag` ON (ct_rev_id=rev_id) WHERE (cuc_ip_hex BETWEEN '3F8F0000' AND '3F8FFFFF') AND (rev_id = page_latest) AND (rev_parent_id = 0) AND (rev_minor_edit = 0) AND (page_namespace != '13' AND page_namespace != '12') AND ((rev_deleted & 4) = 0) AND ct_tag = 'Foobar' AND (rev_timestamp > '20161101212048') AND (cuc_type IN (0, 1)) ORDER BY cuc_ip_hex DESC, rev_timestamp DESC LIMIT 501;

This seems like a dummy query. ct_tag='Foobar'?

IPv6 /32 range

SELECT rev_id,rev_page,rev_text_id,rev_timestamp,rev_comment,rev_user_text,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,rev_content_format,rev_content_model,page_namespace,page_title,page_is_new,page_latest,page_is_redirect,page_len,(SELECT GROUP_CONCAT(ct_tag SEPARATOR ',') FROM `change_tag` WHERE ct_rev_id=rev_id) AS `ts_tags` FROM `cu_changes` LEFT JOIN `revision` FORCE INDEX (usertext_timestamp) ON (cuc_this_oldid = rev_id) INNER JOIN `page` ON (page_id = rev_page) INNER JOIN `change_tag` ON (ct_rev_id=rev_id) WHERE (cuc_ip_hex BETWEEN 'v6-26008800000000000000000000000000' AND 'v6-26008800FFFFFFFFFFFFFFFFFFFFFFFF') AND (rev_id = page_latest) AND (rev_parent_id = 0) AND (rev_minor_edit = 0) AND (page_namespace != '13' AND page_namespace != '12') AND ((rev_deleted & 4) = 0) AND ct_tag = 'Foobar' AND (rev_timestamp > '20161101212048') AND (cuc_type IN (0, 1)) ORDER BY cuc_ip_hex DESC, rev_timestamp DESC LIMIT 501;

Same as above.

= Joining with recentchanges =
=== Simple query ===
IPv4 /16 range

SELECT rc_id,rc_cur_id,rc_timestamp,rc_comment,rc_user_text,rc_minor,rc_deleted,rc_old_len,rc_new_len,rc_this_oldid,rc_namespace,rc_title,rc_new,page_latest,page_is_redirect,page_len,(SELECT GROUP_CONCAT(ct_tag SEPARATOR ',') FROM `change_tag` WHERE ct_rev_id=rc_this_oldid) AS `ts_tags` FROM `cu_changes` LEFT JOIN `recentchanges` ON (cuc_this_oldid = rc_this_oldid) INNER JOIN `page` ON (page_id = rc_cur_id) WHERE (cuc_ip_hex BETWEEN '3F8F0000' AND '3F8FFFFF') AND ((rc_deleted & 4) = 0) AND (rc_timestamp > '20161101212048') AND (cuc_type IN (0, 1)) ORDER BY rc_timestamp DESC LIMIT 501;

501 rows in set (5.11 sec)

IPv6 /32 range

SELECT rc_id,rc_cur_id,rc_timestamp,rc_comment,rc_user_text,rc_minor,rc_deleted,rc_old_len,rc_new_len,rc_this_oldid,rc_namespace,rc_title,rc_new,page_latest,page_is_redirect,page_len,(SELECT GROUP_CONCAT(ct_tag SEPARATOR ',') FROM `change_tag` WHERE ct_rev_id=rc_this_oldid) AS `ts_tags` FROM `cu_changes` LEFT JOIN `recentchanges` ON (cuc_this_oldid = rc_this_oldid) INNER JOIN `page` ON (page_id = rc_cur_id) WHERE (cuc_ip_hex BETWEEN 'v6-26008800000000000000000000000000' AND 'v6-26008800FFFFFFFFFFFFFFFFFFFFFFFF') AND ((rc_deleted & 4) = 0) AND (rc_timestamp > '20161101212048') AND (cuc_type IN (0, 1)) ORDER BY rc_timestamp DESC LIMIT 501;

501 rows in set (1 min 11.60 sec)

=== More complex query ===
IPv4 /16 range

SELECT rc_id,rc_cur_id,rc_timestamp,rc_comment,rc_user_text,rc_minor,rc_deleted,rc_old_len,rc_new_len,rc_this_oldid,rc_namespace,rc_title,rc_new,page_latest,page_is_redirect,page_len,(SELECT GROUP_CONCAT(ct_tag SEPARATOR ',') FROM `change_tag` WHERE ct_rev_id=rc_this_oldid) AS `ts_tags` FROM `cu_changes` LEFT JOIN `recentchanges` ON (cuc_this_oldid = rc_this_oldid) INNER JOIN `page` ON (page_id = rc_cur_id) INNER JOIN `change_tag` ON (ct_rev_id=rc_this_oldid) WHERE (cuc_ip_hex BETWEEN '3F8F0000' AND '3F8FFFFF') AND (rc_this_oldid = page_latest) AND (rc_last_oldid = 0) AND (rc_minor = 0) AND (page_namespace != '13' AND page_namespace != '12') AND ((rc_deleted & 4) = 0) AND ct_tag = 'Foobar' AND (rc_timestamp > '20161101212048') AND (cuc_type IN (0, 1)) ORDER BY cuc_ip_hex DESC, rc_timestamp DESC LIMIT 501;

This seems like a dummy query. ct_tag='Foobar'?

IPv6 /32 range

SELECT rc_id,rc_cur_id,rc_timestamp,rc_comment,rc_user_text,rc_minor,rc_deleted,rc_old_len,rc_new_len,rc_this_oldid,rc_namespace,rc_title,rc_new,page_latest,page_is_redirect,page_len,(SELECT GROUP_CONCAT(ct_tag SEPARATOR ',') FROM `change_tag` WHERE ct_rev_id=rc_this_oldid) AS `ts_tags` FROM `cu_changes` LEFT JOIN `recentchanges` ON (cuc_this_oldid = rc_this_oldid) INNER JOIN `page` ON (page_id = rc_cur_id) INNER JOIN `change_tag` ON (ct_rev_id=rc_this_oldid) WHERE (cuc_ip_hex BETWEEN 'v6-26008800000000000000000000000000' AND 'v6-26008800FFFFFFFFFFFFFFFFFFFFFFFF') AND (rc_this_oldid = page_latest) AND (rc_last_oldid = 0) AND (rc_minor = 0) AND (page_namespace != '13' AND page_namespace != '12') AND ((rc_deleted & 4) = 0) AND ct_tag = 'Foobar' AND (rc_timestamp > '20161101212048') AND (cuc_type IN (0, 1)) ORDER BY cuc_ip_hex DESC, rc_timestamp DESC LIMIT 501;

Same as above.

Please do not leave queries running on active production servers, I have killed them. Wikiadmin user queries are not killed after 300 seconds.

MusikAnimal added a comment.EditedJan 31 2017, 4:10 PM

Thanks for checking. All of these run times are too long, in my opinion... Are there any improvements we can make to the queries themselves? I think I'm taking advantage of indexes but maybe not. For the record, I am able to search for both IP ranges more or less get instant results using Special:CheckUser on enwiki.

Please do not leave queries running on active production servers, I have killed them. Wikiadmin user queries are not killed after 300 seconds.

My apologies! I wasn't aware that they run indefinitely. For future, is there a better venue for long-running queries where they won't be killed?

MaxSem added a subscriber: MaxSem.Feb 1 2017, 7:40 PM

Query plans:

mysql:wikiadmin@db1083 [enwiki]> explain SELECT rev_id,rev_page,rev_text_id,rev_timestamp,rev_comment,rev_user_text,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,rev_content_format,rev_content_model,page_namespace,page_title,page_is_new,page_latest,page_is_redirect,page_len,(SELECT GROUP_CONCAT(ct_tag SEPARATOR ',') FROM `change_tag` WHERE ct_rev_id=rev_id) AS `ts_tags` FROM `cu_changes` LEFT JOIN `revision` FORCE INDEX (usertext_timestamp) ON (cuc_this_oldid = rev_id) INNER JOIN `page` ON (page_id = rev_page) WHERE (cuc_ip_hex BETWEEN '3F8F0000' AND '3F8FFFFF') AND ((rev_deleted & 4) = 0) AND (rev_timestamp > '20161101212048') AND (cuc_type IN (0, 1)) ORDER BY rev_timestamp DESC LIMIT 501;
+------+--------------------+------------+------------+------------------------------+---------------------------+---------+--------------------------+-----------+--------------------------------------------------+
| id   | select_type        | table      | type       | possible_keys                | key                       | key_len | ref                      | rows      | Extra                                            |
+------+--------------------+------------+------------+------------------------------+---------------------------+---------+--------------------------+-----------+--------------------------------------------------+
|    1 | PRIMARY            | revision   | ALL        | NULL                         | NULL                      | NULL    | NULL                     | 742066494 | Using where; Using temporary; Using filesort     |
|    1 | PRIMARY            | page       | eq_ref     | PRIMARY                      | PRIMARY                   | 4       | enwiki.revision.rev_page |         1 |                                                  |
|    1 | PRIMARY            | cu_changes | hash_range | cuc_ip_hex_time              | #hash#$hj:cuc_ip_hex_time | 4:258   | enwiki.revision.rev_id   |     10026 | Using where; Using join buffer (flat, BNLH join) |
|    2 | DEPENDENT SUBQUERY | change_tag | ref        | ct_rev_id,change_tag_rev_tag | ct_rev_id                 | 5       | enwiki.revision.rev_id   |         1 | Using where; Using index                         |
+------+--------------------+------------+------------+------------------------------+---------------------------+---------+--------------------------+-----------+--------------------------------------------------+

mysql:wikiadmin@db1083 [enwiki]> explain SELECT rev_id,rev_page,rev_text_id,rev_timestamp,rev_comment,rev_user_text,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,rev_content_format,rev_content_model,page_namespace,page_title,page_is_new,page_latest,page_is_redirect,page_len,(SELECT GROUP_CONCAT(ct_tag SEPARATOR ',') FROM `change_tag` WHERE ct_rev_id=rev_id) AS `ts_tags` FROM `cu_changes` LEFT JOIN `revision` FORCE INDEX (usertext_timestamp) ON (cuc_this_oldid = rev_id) INNER JOIN `page` ON (page_id = rev_page) WHERE (cuc_ip_hex BETWEEN 'v6-26008800000000000000000000000000' AND 'v6-26008800FFFFFFFFFFFFFFFFFFFFFFFF') AND ((rev_deleted & 4) = 0) AND (rev_timestamp > '20161101212048') AND (cuc_type IN (0, 1)) ORDER BY rev_timestamp DESC LIMIT 501;
+------+--------------------+------------+------------+------------------------------+---------------------------+---------+--------------------------+-----------+--------------------------------------------------+
| id   | select_type        | table      | type       | possible_keys                | key                       | key_len | ref                      | rows      | Extra                                            |
+------+--------------------+------------+------------+------------------------------+---------------------------+---------+--------------------------+-----------+--------------------------------------------------+
|    1 | PRIMARY            | revision   | ALL        | NULL                         | NULL                      | NULL    | NULL                     | 742066662 | Using where; Using temporary; Using filesort     |
|    1 | PRIMARY            | page       | eq_ref     | PRIMARY                      | PRIMARY                   | 4       | enwiki.revision.rev_page |         1 |                                                  |
|    1 | PRIMARY            | cu_changes | hash_range | cuc_ip_hex_time              | #hash#$hj:cuc_ip_hex_time | 4:258   | enwiki.revision.rev_id   |     51040 | Using where; Using join buffer (flat, BNLH join) |
|    2 | DEPENDENT SUBQUERY | change_tag | ref        | ct_rev_id,change_tag_rev_tag | ct_rev_id                 | 5       | enwiki.revision.rev_id   |         1 | Using where; Using index                         |
+------+--------------------+------------+------------+------------------------------+---------------------------+---------+--------------------------+-----------+--------------------------------------------------+

mysql:wikiadmin@db1083 [enwiki]> explain SELECT rev_id,rev_page,rev_text_id,rev_timestamp,rev_comment,rev_user_text,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,rev_content_format,rev_content_model,page_namespace,page_title,page_is_new,page_latest,page_is_redirect,page_len,(SELECT GROUP_CONCAT(ct_tag SEPARATOR ',') FROM `change_tag` WHERE ct_rev_id=rev_id) AS `ts_tags` FROM `cu_changes` LEFT JOIN `revision` FORCE INDEX (usertext_timestamp) ON (cuc_this_oldid = rev_id) INNER JOIN `page` ON (page_id = rev_page) INNER JOIN `change_tag` ON (ct_rev_id=rev_id) WHERE (cuc_ip_hex BETWEEN '3F8F0000' AND '3F8FFFFF') AND (rev_id = page_latest) AND (rev_parent_id = 0) AND (rev_minor_edit = 0) AND (page_namespace != '13' AND page_namespace != '12') AND ((rev_deleted & 4) = 0) AND ct_tag = 'Foobar' AND (rev_timestamp > '20161101212048') AND (cuc_type IN (0, 1)) ORDER BY cuc_ip_hex DESC, rev_timestamp DESC LIMIT 501;
+------+--------------------+------------+----------+-------------------------------------------------------+-----------------+---------+----------------------------------+-----------+------------------------------------------------------------------------+
| id   | select_type        | table      | type     | possible_keys                                         | key             | key_len | ref                              | rows      | Extra                                                                  |
+------+--------------------+------------+----------+-------------------------------------------------------+-----------------+---------+----------------------------------+-----------+------------------------------------------------------------------------+
|    1 | PRIMARY            | change_tag | ref      | ct_rev_id,change_tag_rev_tag,ct_tag,change_tag_tag_id | ct_tag          | 257     | const                            |         1 | Using where; Using index; Using temporary; Using filesort              |
|    1 | PRIMARY            | cu_changes | range    | cuc_ip_hex_time                                       | cuc_ip_hex_time | 258     | NULL                             |     10026 | Using index condition; Using where; Using join buffer (flat, BNL join) |
|    1 | PRIMARY            | revision   | hash_ALL | NULL                                                  | #hash#$hj       | 4       | enwiki.cu_changes.cuc_this_oldid | 742066828 | Using where; Using join buffer (incremental, BNLH join)                |
|    1 | PRIMARY            | page       | eq_ref   | PRIMARY,name_title                                    | PRIMARY         | 4       | enwiki.revision.rev_page         |         1 | Using where                                                            |
|    2 | DEPENDENT SUBQUERY | change_tag | ref      | ct_rev_id,change_tag_rev_tag                          | ct_rev_id       | 5       | enwiki.revision.rev_id           |         1 | Using where; Using index                                               |
+------+--------------------+------------+----------+-------------------------------------------------------+-----------------+---------+----------------------------------+-----------+------------------------------------------------------------------------+

mysql:wikiadmin@db1083 [enwiki]> explain SELECT rev_id,rev_page,rev_text_id,rev_timestamp,rev_comment,rev_user_text,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,rev_content_format,rev_content_model,page_namespace,page_title,page_is_new,page_latest,page_is_redirect,page_len,(SELECT GROUP_CONCAT(ct_tag SEPARATOR ',') FROM `change_tag` WHERE ct_rev_id=rev_id) AS `ts_tags` FROM `cu_changes` LEFT JOIN `revision` FORCE INDEX (usertext_timestamp) ON (cuc_this_oldid = rev_id) INNER JOIN `page` ON (page_id = rev_page) INNER JOIN `change_tag` ON (ct_rev_id=rev_id) WHERE (cuc_ip_hex BETWEEN 'v6-26008800000000000000000000000000' AND 'v6-26008800FFFFFFFFFFFFFFFFFFFFFFFF') AND (rev_id = page_latest) AND (rev_parent_id = 0) AND (rev_minor_edit = 0) AND (page_namespace != '13' AND page_namespace != '12') AND ((rev_deleted & 4) = 0) AND ct_tag = 'Foobar' AND (rev_timestamp > '20161101212048') AND (cuc_type IN (0, 1)) ORDER BY cuc_ip_hex DESC, rev_timestamp DESC LIMIT 501;
+------+--------------------+------------+----------+-------------------------------------------------------+-----------------+---------+----------------------------------+-----------+------------------------------------------------------------------------+
| id   | select_type        | table      | type     | possible_keys                                         | key             | key_len | ref                              | rows      | Extra                                                                  |
+------+--------------------+------------+----------+-------------------------------------------------------+-----------------+---------+----------------------------------+-----------+------------------------------------------------------------------------+
|    1 | PRIMARY            | change_tag | ref      | ct_rev_id,change_tag_rev_tag,ct_tag,change_tag_tag_id | ct_tag          | 257     | const                            |         1 | Using where; Using index; Using temporary; Using filesort              |
|    1 | PRIMARY            | cu_changes | range    | cuc_ip_hex_time                                       | cuc_ip_hex_time | 258     | NULL                             |     51040 | Using index condition; Using where; Using join buffer (flat, BNL join) |
|    1 | PRIMARY            | revision   | hash_ALL | NULL                                                  | #hash#$hj       | 4       | enwiki.cu_changes.cuc_this_oldid | 742066861 | Using where; Using join buffer (incremental, BNLH join)                |
|    1 | PRIMARY            | page       | eq_ref   | PRIMARY,name_title                                    | PRIMARY         | 4       | enwiki.revision.rev_page         |         1 | Using where                                                            |
|    2 | DEPENDENT SUBQUERY | change_tag | ref      | ct_rev_id,change_tag_rev_tag                          | ct_rev_id       | 5       | enwiki.revision.rev_id           |         1 | Using where; Using index                                               |
+------+--------------------+------------+----------+-------------------------------------------------------+-----------------+---------+----------------------------------+-----------+------------------------------------------------------------------------+

mysql:wikiadmin@db1083 [enwiki]> explain SELECT rc_id,rc_cur_id,rc_timestamp,rc_comment,rc_user_text,rc_minor,rc_deleted,rc_old_len,rc_new_len,rc_this_oldid,rc_namespace,rc_title,rc_new,page_latest,page_is_redirect,page_len,(SELECT GROUP_CONCAT(ct_tag SEPARATOR ',') FROM `change_tag` WHERE ct_rev_id=rc_this_oldid) AS `ts_tags` FROM `cu_changes` LEFT JOIN `recentchanges` ON (cuc_this_oldid = rc_this_oldid) INNER JOIN `page` ON (page_id = rc_cur_id) WHERE (cuc_ip_hex BETWEEN '3F8F0000' AND '3F8FFFFF') AND ((rc_deleted & 4) = 0) AND (rc_timestamp > '20161101212048') AND (cuc_type IN (0, 1)) ORDER BY rc_timestamp DESC LIMIT 501;
+------+--------------------+---------------+--------+------------------------------+-----------------+---------+------------------------------------+-------+---------------------------------------------------------------------+
| id   | select_type        | table         | type   | possible_keys                | key             | key_len | ref                                | rows  | Extra                                                               |
+------+--------------------+---------------+--------+------------------------------+-----------------+---------+------------------------------------+-------+---------------------------------------------------------------------+
|    1 | PRIMARY            | cu_changes    | range  | cuc_ip_hex_time              | cuc_ip_hex_time | 258     | NULL                               | 10026 | Using index condition; Using where; Using temporary; Using filesort |
|    1 | PRIMARY            | recentchanges | ref    | rc_timestamp,rc_cur_id,tmp_1 | tmp_1           | 4       | enwiki.cu_changes.cuc_this_oldid   |     1 | Using where                                                         |
|    1 | PRIMARY            | page          | eq_ref | PRIMARY                      | PRIMARY         | 4       | enwiki.recentchanges.rc_cur_id     |     1 |                                                                     |
|    2 | DEPENDENT SUBQUERY | change_tag    | ref    | ct_rev_id,change_tag_rev_tag | ct_rev_id       | 5       | enwiki.recentchanges.rc_this_oldid |     1 | Using where; Using index                                            |
+------+--------------------+---------------+--------+------------------------------+-----------------+---------+------------------------------------+-------+---------------------------------------------------------------------+

mysql:wikiadmin@db1083 [enwiki]> explain SELECT rc_id,rc_cur_id,rc_timestamp,rc_comment,rc_user_text,rc_minor,rc_deleted,rc_old_len,rc_new_len,rc_this_oldid,rc_namespace,rc_title,rc_new,page_latest,page_is_redirect,page_len,(SELECT GROUP_CONCAT(ct_tag SEPARATOR ',') FROM `change_tag` WHERE ct_rev_id=rc_this_oldid) AS `ts_tags` FROM `cu_changes` LEFT JOIN `recentchanges` ON (cuc_this_oldid = rc_this_oldid) INNER JOIN `page` ON (page_id = rc_cur_id) WHERE (cuc_ip_hex BETWEEN 'v6-26008800000000000000000000000000' AND 'v6-26008800FFFFFFFFFFFFFFFFFFFFFFFF') AND ((rc_deleted & 4) = 0) AND (rc_timestamp > '20161101212048') AND (cuc_type IN (0, 1)) ORDER BY rc_timestamp DESC LIMIT 501;
+------+--------------------+---------------+--------+------------------------------+-----------------+---------+------------------------------------+-------+---------------------------------------------------------------------+
| id   | select_type        | table         | type   | possible_keys                | key             | key_len | ref                                | rows  | Extra                                                               |
+------+--------------------+---------------+--------+------------------------------+-----------------+---------+------------------------------------+-------+---------------------------------------------------------------------+
|    1 | PRIMARY            | cu_changes    | range  | cuc_ip_hex_time              | cuc_ip_hex_time | 258     | NULL                               | 51040 | Using index condition; Using where; Using temporary; Using filesort |
|    1 | PRIMARY            | recentchanges | ref    | rc_timestamp,rc_cur_id,tmp_1 | tmp_1           | 4       | enwiki.cu_changes.cuc_this_oldid   |     1 | Using where                                                         |
|    1 | PRIMARY            | page          | eq_ref | PRIMARY                      | PRIMARY         | 4       | enwiki.recentchanges.rc_cur_id     |     1 |                                                                     |
|    2 | DEPENDENT SUBQUERY | change_tag    | ref    | ct_rev_id,change_tag_rev_tag | ct_rev_id       | 5       | enwiki.recentchanges.rc_this_oldid |     1 | Using where; Using index                                            |
+------+--------------------+---------------+--------+------------------------------+-----------------+---------+------------------------------------+-------+---------------------------------------------------------------------+

mysql:wikiadmin@db1083 [enwiki]> explain SELECT rc_id,rc_cur_id,rc_timestamp,rc_comment,rc_user_text,rc_minor,rc_deleted,rc_old_len,rc_new_len,rc_this_oldid,rc_namespace,rc_title,rc_new,page_latest,page_is_redirect,page_len,(SELECT GROUP_CONCAT(ct_tag SEPARATOR ',') FROM `change_tag` WHERE ct_rev_id=rc_this_oldid) AS `ts_tags` FROM `cu_changes` LEFT JOIN `recentchanges` ON (cuc_this_oldid = rc_this_oldid) INNER JOIN `page` ON (page_id = rc_cur_id) INNER JOIN `change_tag` ON (ct_rev_id=rc_this_oldid) WHERE (cuc_ip_hex BETWEEN '3F8F0000' AND '3F8FFFFF') AND (rc_this_oldid = page_latest) AND (rc_last_oldid = 0) AND (rc_minor = 0) AND (page_namespace != '13' AND page_namespace != '12') AND ((rc_deleted & 4) = 0) AND ct_tag = 'Foobar' AND (rc_timestamp > '20161101212048') AND (cuc_type IN (0, 1)) ORDER BY cuc_ip_hex DESC, rc_timestamp DESC LIMIT 501;
+------+--------------------+---------------+--------+-------------------------------------------------------+-----------------+---------+------------------------------------+-------+------------------------------------------------------------------------+
| id   | select_type        | table         | type   | possible_keys                                         | key             | key_len | ref                                | rows  | Extra                                                                  |
+------+--------------------+---------------+--------+-------------------------------------------------------+-----------------+---------+------------------------------------+-------+------------------------------------------------------------------------+
|    1 | PRIMARY            | change_tag    | ref    | ct_rev_id,change_tag_rev_tag,ct_tag,change_tag_tag_id | ct_tag          | 257     | const                              |     1 | Using where; Using index; Using temporary; Using filesort              |
|    1 | PRIMARY            | recentchanges | ref    | rc_timestamp,rc_cur_id,tmp_1                          | tmp_1           | 4       | enwiki.change_tag.ct_rev_id        |     1 | Using index condition; Using where                                     |
|    1 | PRIMARY            | page          | eq_ref | PRIMARY,name_title                                    | PRIMARY         | 4       | enwiki.recentchanges.rc_cur_id     |     1 | Using where                                                            |
|    1 | PRIMARY            | cu_changes    | range  | cuc_ip_hex_time                                       | cuc_ip_hex_time | 258     | NULL                               | 10026 | Using index condition; Using where; Using join buffer (flat, BNL join) |
|    2 | DEPENDENT SUBQUERY | change_tag    | ref    | ct_rev_id,change_tag_rev_tag                          | ct_rev_id       | 5       | enwiki.recentchanges.rc_this_oldid |     1 | Using where; Using index                                               |
+------+--------------------+---------------+--------+-------------------------------------------------------+-----------------+---------+------------------------------------+-------+------------------------------------------------------------------------+

mysql:wikiadmin@db1083 [enwiki]> explain SELECT rc_id,rc_cur_id,rc_timestamp,rc_comment,rc_user_text,rc_minor,rc_deleted,rc_old_len,rc_new_len,rc_this_oldid,rc_namespace,rc_title,rc_new,page_latest,page_is_redirect,page_len,(SELECT GROUP_CONCAT(ct_tag SEPARATOR ',') FROM `change_tag` WHERE ct_rev_id=rc_this_oldid) AS `ts_tags` FROM `cu_changes` LEFT JOIN `recentchanges` ON (cuc_this_oldid = rc_this_oldid) INNER JOIN `page` ON (page_id = rc_cur_id) INNER JOIN `change_tag` ON (ct_rev_id=rc_this_oldid) WHERE (cuc_ip_hex BETWEEN 'v6-26008800000000000000000000000000' AND 'v6-26008800FFFFFFFFFFFFFFFFFFFFFFFF') AND (rc_this_oldid = page_latest) AND (rc_last_oldid = 0) AND (rc_minor = 0) AND (page_namespace != '13' AND page_namespace != '12') AND ((rc_deleted & 4) = 0) AND ct_tag = 'Foobar' AND (rc_timestamp > '20161101212048') AND (cuc_type IN (0, 1)) ORDER BY cuc_ip_hex DESC, rc_timestamp DESC LIMIT 501;
+------+--------------------+---------------+--------+-------------------------------------------------------+-----------------+---------+------------------------------------+-------+------------------------------------------------------------------------+
| id   | select_type        | table         | type   | possible_keys                                         | key             | key_len | ref                                | rows  | Extra                                                                  |
+------+--------------------+---------------+--------+-------------------------------------------------------+-----------------+---------+------------------------------------+-------+------------------------------------------------------------------------+
|    1 | PRIMARY            | change_tag    | ref    | ct_rev_id,change_tag_rev_tag,ct_tag,change_tag_tag_id | ct_tag          | 257     | const                              |     1 | Using where; Using index; Using temporary; Using filesort              |
|    1 | PRIMARY            | recentchanges | ref    | rc_timestamp,rc_cur_id,tmp_1                          | tmp_1           | 4       | enwiki.change_tag.ct_rev_id        |     1 | Using index condition; Using where                                     |
|    1 | PRIMARY            | page          | eq_ref | PRIMARY,name_title                                    | PRIMARY         | 4       | enwiki.recentchanges.rc_cur_id     |     1 | Using where                                                            |
|    1 | PRIMARY            | cu_changes    | range  | cuc_ip_hex_time                                       | cuc_ip_hex_time | 258     | NULL                               | 51040 | Using index condition; Using where; Using join buffer (flat, BNL join) |
|    2 | DEPENDENT SUBQUERY | change_tag    | ref    | ct_rev_id,change_tag_rev_tag                          | ct_rev_id       | 5       | enwiki.recentchanges.rc_this_oldid |     1 | Using where; Using index                                               |
+------+--------------------+---------------+--------+-------------------------------------------------------+-----------------+---------+------------------------------------+-------+------------------------------------------------------------------------+
jcrespo added a comment.EditedFeb 1 2017, 7:56 PM

No need to apologize.

We have some anlytics stores for very long running queries (dbstore1002), they have the content, but highly compressed, so ok for data checks, but not 100% production experience.

For production testing, any of the passive slaves (currently, the ones on codfw) and specifically, the vslow/dump servers (eqiad or codfw) are the best choice. If you think it is going to take some time, I can even reserve some slaves for testing (I have some plans to do that permanently, on codfw). If you use codfw slaves, it is preferred to run queries from the terbium equivalent (wasat.codfw.wmnet).

You can also connect as wikiadmin and kill them manually if needed with the KILL command (SHOW PROCESSLIST to show them running). They impact performance because all subsequent writes have store its UNDO data for consistent view, so normally we do not allow long running queries, except for the wikiadmin user.

Focusing on revision, I think I ran into some query caching which does not help for testing, but these went pretty fast (a few seconds at most):

SELECT rev_id FROM cu_changes LEFT JOIN revision ON (cuc_this_oldid = rev_id) INNER JOIN page ON (page_id = rev_page) WHERE (cuc_ip_hex BETWEEN '3F8F0000' AND '3F8FFFFF') AND (rev_timestamp > '20161101212048') LIMIT 51;

SELECT rev_id,rev_page,rev_text_id,rev_timestamp,rev_comment,rev_user_text,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,rev_content_format,rev_content_model,page_namespace,page_title,page_is_new,page_latest,page_is_redirect,page_len,(SELECT GROUP_CONCAT(ct_tag SEPARATOR ',') FROM `change_tag` WHERE ct_rev_id=rev_id) AS `ts_tags` FROM cu_changes LEFT JOIN `revision` ON (cuc_this_oldid = rev_id) INNER JOIN `page` ON (page_id = rev_page) WHERE (cuc_ip_hex BETWEEN 'v6-26008800000000000000000000000000' AND 'v6-26008800FFFFFFFFFFFFFFFFFFFFFFFF') AND (rev_timestamp > '20161101212048') LIMIT 51;

The first is a very basic query for an IPv4 range, the second gets all the data we need for an IPv6 range. The one thing missing is ordering by rev_timestamp. I guess it's not surprising this slows it down – significantly:

SELECT rev_id,rev_page,rev_text_id,rev_timestamp,rev_comment,rev_user_text,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,rev_content_format,rev_content_model,page_namespace,page_title,page_is_new,page_latest,page_is_redirect,page_len,(SELECT GROUP_CONCAT(ct_tag SEPARATOR ',') FROM `change_tag` WHERE ct_rev_id=rev_id) AS `ts_tags` FROM cu_changes LEFT JOIN `revision` ON (cuc_this_oldid = rev_id) INNER JOIN `page` ON (page_id = rev_page) WHERE (cuc_ip_hex BETWEEN 'v6-26008800000000000000000000000000' AND 'v6-26008800FFFFFFFFFFFFFFFFFFFFFFFF') AND (rev_timestamp > '20161101212048') ORDER BY rev_timestamp DESC LIMIT 51;

51 rows in set (50.79 sec)

We obviously have to order by timestamp... any ideas from you kind SQL experts? For reference the indexes on revision are here.

These test queries also only fetch 51 rows. We need 501 to get a max sized page.


However, looking at recent checks on enwiki, many queries went plenty fast:

SELECT rev_id,rev_page,rev_text_id,rev_timestamp,rev_comment,rev_user_text,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,rev_content_format,rev_content_model,page_namespace,page_title,page_is_new,page_latest,page_is_redirect,page_len,(SELECT GROUP_CONCAT(ct_tag SEPARATOR ',') FROM `change_tag` WHERE ct_rev_id=rev_id) AS `ts_tags` FROM cu_changes LEFT JOIN `revision` ON (cuc_this_oldid = rev_id) INNER JOIN `page` ON (page_id = rev_page) WHERE (cuc_ip_hex BETWEEN 'v6-20018003A0AC00000000000000000000' AND 'v6-20018003A0ACFFFFFFFFFFFFFFFFFFFF') AND (rev_timestamp > '20161101212048') ORDER BY rev_timestamp DESC LIMIT 51;

26 rows in set (0.29 sec)

SELECT rev_id,rev_page,rev_text_id,rev_timestamp,rev_comment,rev_user_text,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,rev_content_format,rev_content_model,page_namespace,page_title,page_is_new,page_latest,page_is_redirect,page_len,(SELECT GROUP_CONCAT(ct_tag SEPARATOR ',') FROM `change_tag` WHERE ct_rev_id=rev_id) AS `ts_tags` FROM cu_changes LEFT JOIN `revision` ON (cuc_this_oldid = rev_id) INNER JOIN `page` ON (page_id = rev_page) WHERE (cuc_ip_hex BETWEEN 'v6-26056001E29879000000000000000000' AND 'v6-26056001E2987900FFFFFFFFFFFFFFFF') AND (rev_timestamp > '20161101212048') ORDER BY rev_timestamp DESC LIMIT 51;

51 rows in set (0.31 sec)


Others did not:

SELECT rev_id,rev_page,rev_text_id,rev_timestamp,rev_comment,rev_user_text,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,rev_content_format,rev_content_model,page_namespace,page_title,page_is_new,page_latest,page_is_redirect,page_len,(SELECT GROUP_CONCAT(ct_tag SEPARATOR ',') FROM `change_tag` WHERE ct_rev_id=rev_id) AS `ts_tags` FROM cu_changes LEFT JOIN `revision` ON (cuc_this_oldid = rev_id) INNER JOIN `page` ON (page_id = rev_page) WHERE (cuc_ip_hex BETWEEN 'v6-2A032880200000000000000000000000' AND 'v6-2A0328803FFFFFFFFFFFFFFFFFFFFFFF') AND (rev_timestamp > '20161101212048') ORDER BY rev_timestamp DESC LIMIT 51;

51 rows in set (42.36 sec)

SELECT rev_id,rev_page,rev_text_id,rev_timestamp,rev_comment,rev_user_text,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,rev_content_format,rev_content_model,page_namespace,page_title,page_is_new,page_latest,page_is_redirect,page_len,(SELECT GROUP_CONCAT(ct_tag SEPARATOR ',') FROM `change_tag` WHERE ct_rev_id=rev_id) AS `ts_tags` FROM cu_changes LEFT JOIN `revision` ON (cuc_this_oldid = rev_id) INNER JOIN `page` ON (page_id = rev_page) WHERE (cuc_ip_hex BETWEEN 'v6-24050204B00000000000000000000000' AND 'v6-24050204B1FFFFFFFFFFFFFFFFFFFFFF') AND (rev_timestamp > '20161101212048') ORDER BY rev_timestamp DESC LIMIT 51;

51 rows in set (25.05 sec)

SELECT rev_id,rev_page,rev_text_id,rev_timestamp,rev_comment,rev_user_text,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,rev_content_format,rev_content_model,page_namespace,page_title,page_is_new,page_latest,page_is_redirect,page_len,(SELECT GROUP_CONCAT(ct_tag SEPARATOR ',') FROM `change_tag` WHERE ct_rev_id=rev_id) AS `ts_tags` FROM cu_changes LEFT JOIN `revision` ON (cuc_this_oldid = rev_id) INNER JOIN `page` ON (page_id = rev_page) WHERE (cuc_ip_hex BETWEEN 'v6-24050205600000000000000000000000' AND 'v6-2405020563FFFFFFFFFFFFFFFFFFFFFF') AND (rev_timestamp > '20161101212048') ORDER BY rev_timestamp DESC LIMIT 51;

51 rows in set (18.16 sec)

SELECT rev_id,rev_page,rev_text_id,rev_timestamp,rev_comment,rev_user_text,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,rev_content_format,rev_content_model,page_namespace,page_title,page_is_new,page_latest,page_is_redirect,page_len,(SELECT GROUP_CONCAT(ct_tag SEPARATOR ',') FROM `change_tag` WHERE ct_rev_id=rev_id) AS `ts_tags` FROM cu_changes LEFT JOIN `revision` ON (cuc_this_oldid = rev_id) INNER JOIN `page` ON (page_id = rev_page) WHERE (cuc_ip_hex BETWEEN 'v6-26020306000000000000000000000000' AND 'v6-26020306FFFFFFFFFFFFFFFFFFFFFFFF') AND (rev_timestamp > '20161101212048') ORDER BY rev_timestamp DESC LIMIT 51;

/32 range. Never finished after 15 minutes...


It's the wider ranges with more edits that slow things down, which is a no-brainer. Before we look at recentchanges I want to see if anyone knows how to improve the above queries. Going off of revision is most ideal, but queries taking nearly a minute to complete aren't really that acceptable. We might want to limit the IPv6 range to say, /48. /32 is ridiculous and no one should ever be blocking a range that wide.

Unless im missing something, these queries seem to include logged in users?

For the query you have above:

SELECT rev_id,rev_page,rev_text_id,rev_timestamp,rev_comment,rev_user_text,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,rev_content_format,rev_content_model,page_namespace,page_title,page_is_new,page_latest,page_is_redirect,page_len,(SELECT GROUP_CONCAT(ct_tag SEPARATOR ',') FROM `change_tag` WHERE ct_rev_id=rev_id) AS `ts_tags` FROM cu_changes LEFT JOIN `revision` ON (cuc_this_oldid = rev_id) INNER JOIN `page` ON (page_id = rev_page) WHERE (cuc_ip_hex BETWEEN 'v6-26020306000000000000000000000000' AND 'v6-26020306FFFFFFFFFFFFFFFFFFFFFFFF') AND (rev_timestamp > '20161101212048') ORDER BY rev_timestamp DESC LIMIT 51;

It may be a bit faster by instead doing:

SELECT rev_id,rev_page,rev_text_id,rev_timestamp,rev_comment,rev_user_text,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,rev_content_format,rev_content_model,page_namespace,page_title,page_is_new,page_latest,page_is_redirect,page_len,(SELECT GROUP_CONCAT(ct_tag SEPARATOR ',') FROM `change_tag` WHERE ct_rev_id=rev_id) AS `ts_tags` FROM cu_changes LEFT JOIN `revision` ON (cuc_this_oldid = rev_id) INNER JOIN `page` ON (page_id = rev_page) WHERE (cuc_ip_hex BETWEEN 'v6-26020306000000000000000000000000' AND 'v6-26020306FFFFFFFFFFFFFFFFFFFFFFFF') AND cuc_user = 0 AND AND (cuc_type IN (0, 1)) AND (cuc_timestamp > '20161101212048')  ORDER BY cuc_timestamp DESC LIMIT 51;

The difference being:

  • Sorting on cuc_timestamp instead of rev_timestamp
  • Add the date cut off via cuc_timestamp instead of rev_timestamp
  • Adding the cuc_user = 0 condition (That's partially for logical correctness, my understanding is you only want logged out users here). This could help by eliminating rows before filesorting. On the other hand, it might make the cuc_user_ip_time be used, which could have other affects on query performance. (In which cases it'd be better to just do a like query over the non-hex ip field)
  • cuc_type = 0,1 only.

Hopefully this will make the query not use a join buffer and reduce the number of rows it has to sort (by removing logged in users). The query is still not what I would deem "efficient" though.

Depending on the distribution of data, it could actually possibly be more efficient to give

Actually, for the above example, the most efficient thing to do (for ipv6 when the cidr range is a multiple of 16) would be just rev_user_text like "2602:306:%" dito for any ipv4 multiple of 8.

In the cases where they arent nice multiples, due to the way the indexes on cu_changes are, it might still be more efficient to add a like condition on cuc_ip for the leftmost prefix that is constant, so combined with cuc_user can use the cuc_user_ip_time index. Maybe

16)

@Bawolff: For these test queries, we're only using the cuc_this_oldid and cuc_ip_hex fields from the CheckUser table because we want to simulate not relying on the CheckUser tables at all, but having a new ip_hex table that just maps revision_ids to IP hex values. Ideally, we don't want to rely on the CheckUser extension at all.

That being said, it looks like none of these test queries are that efficient, so I think we need to back up. Instead of comparing using the recentchanges table against using the revision table, I think we need to compare using the recentchanges table against fully using the cu_changes table. If there's a significant difference in speed, we should just build Special:RangeContributions as part of the CheckUser extension. I'm going to close this ticket for now and open a new one to do that.

kaldari closed this task as Resolved.Feb 9 2017, 5:10 PM
kaldari claimed this task.
kaldari moved this task from Needs Review/Feedback to Q1 2018-19 on the Community-Tech-Sprint board.

@Bawolff You are absolutely correct about those query improvements. My last example which never finished after 15 minutes took only 0.28 seconds (!!!) when I went off of cuc_timestamp. I believe this mainly because cu_changes is a significantly smaller table, also with indexing on cuc_timestamp. Our table would be replicating revision indefinitely, and even though it's only IP edits/page creations, it would eventually grow larger than the regularly pruned cu_changes. Furthermore, we weren't going to store any timestamps, instead only including the foreign key rev_id and an IP hex column.

So maybe, just maybe, we could get away with replicating rev_timestamp in our table, too? I think that might make a world of a difference. How often would this information change and need to be backfilled into our table? Pinging @Legoktm as I think had some reservations about duplicating data beyond the foreign key.

Id actually be suprised if the index on cuc_timestamp came into play. That should only happen if the time range was small enough that filtering by timestamp was more efficient than filtering by ip address (mysql can only use indexes for one range in a query except in certain obscure circumstances that dont apply here). In any case you can check which index was used by running EXPLAIN.

For reducing the number of rows the db has to look at, one option is further denormalization, and have some sort of partial date (e.g. just the year or how many six month periods have passed) and then index on (year, ip_hex, timestamp, rev_id) and make separate queries for each year you need.

Rev_timestamp never changes. The only thing that can happen is deletion moves the entire revision row to the archive table. If there is concern about size of table, storing it as an int instead of typical mediawiki timestamp will reduce space needed by a bit.

So maybe, just maybe, we could get away with replicating rev_timestamp in our table, too?

Fwiw, its unclear how much of the speed improvements were eliminating registered users (cuc_user=0) from the filesort vs eliminating rows earlier in the checkuser table before doing the join (using cuc_timestamp for the cut off, cuc_type in (0,1) ).Its possible you may get enough of a speed up just from eliminating logged in users that you wouldnt have to denormalize the timestamp.

Actually, for the above example, the most efficient thing to do (for ipv6 when the cidr range is a multiple of 16) would be just rev_user_text like "2602:306:%" dito for any ipv4 multiple of 8.

We plan to support wildcard searches as well, which indeed go lightning fast when querying revision by itself. I don't think I'll use our new table at all if the user requests a wildcard search.

MusikAnimal added a comment.EditedFeb 13 2017, 10:59 PM

Fwiw, its unclear how much of the speed improvements were eliminating registered users (cuc_user=0) from the filesort vs eliminating rows earlier in the checkuser table before doing the join (using cuc_timestamp for the cut off, cuc_type in (0,1) ).Its possible you may get enough of a speed up just from eliminating logged in users that you wouldnt have to denormalize the timestamp.

cuc_user = 0 and cuc_type IN (0,1) definitely helps, and I will include it in my tests below because that would seemingly be closer to the runtime of querying a table that is IPs and revisions-only. So here's the query I used on cuc_timestamp, similar to what you proposed:

SELECT rev_id,rev_page,rev_text_id,rev_timestamp,rev_comment,rev_user_text,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,rev_content_format,rev_content_model,page_namespace,page_title,page_is_new,page_latest,page_is_redirect,page_len,(SELECT GROUP_CONCAT(ct_tag SEPARATOR ',') FROM `change_tag` WHERE ct_rev_id=rev_id) AS `ts_tags` FROM cu_changes LEFT JOIN `revision` ON (cuc_this_oldid = rev_id) INNER JOIN `page` ON (page_id = rev_page) WHERE (cuc_ip_hex BETWEEN 'v6-26020306000000000000000000000000' AND 'v6-26020306FFFFFFFFFFFFFFFFFFFFFFFF') AND cuc_user = 0 AND (cuc_type IN (0, 1)) AND (cuc_timestamp > '20161101212048')  ORDER BY cuc_timestamp DESC LIMIT 51;

Result: 51 rows in set (2.52 sec)

I'm guessing this used the cuc_ip_hex_time timestamp, on cuc_ip_hex and cuc_timestamp. Here's the EXPLAIN:

idselect_typetabletypepossible_keyskeykey_lenrefrowsExtra
1PRIMARYcu_changesindexcuc_timestamp,cuc_ip_hex_time,cuc_user_time,cuc_user_ip_timecuc_timestamp16NULL2111Using where
1PRIMARYrevisioneq_refPRIMARY,rev_id,page_timestamprev_id4enwiki.cu_changes.cuc_this_oldid1
1PRIMARYpageeq_refPRIMARYPRIMARY4enwiki.revision.rev_page1
2DEPENDENT SUBQUERYchange_tagrefct_rev_id,change_tag_rev_tagct_rev_id5enwiki.revision.rev_id1Using where; Using index

The same exact query except using rev_timestamp > '20161101212048' again never finished after ~10 minutes. So whether or not you take into account narrowing down by cuc_user and cuc_type, it seems going off of the timestamp in the replicated table provides a huge performance improvement.


Going back over some of the other test ranges, using the same query as above:

RangeUsing rev_timestampUsing cuc_timestamp
2a03:2880:2000::/3528.50 sec4.39 sec
2405:205:6000::/3816.73 sec10.24 sec
2405:204:b000::/3935.71 sec0.10 sec
2001:8003:a0ac::/481.72 sec0.03 sec
2605:6001:e298:7900::/640.46 sec0.03 sec

To test these queries, I ran the rev_timestamp version first, followed by the cuc_timestamp version. Then an hour later, when the mysterious caching goes away, I did it in the opposite order. Similar runtimes were found, so I don't think one was faster than the other because it was ran second.

Conclusion

I personally am now convinced the best solution is for our new table to replicate rev_id and rev_timestamp, along with our IP hex column. With this we can query any time range (as opposed to last 30 days through recentchanges), and we can put this feature into core (as opposed to piggybacking off of the CheckUser extension).

The only outstanding question in my mind is what happens when our new table grows say, 2 or 3 times as big as cu_changes. Do you think we will see the same performance? I suspect we would given we impose the same limitations: 500 max results per page, /32 for a IPv6 range, etc. Most will be checking /64, and we might even limit our tool to be up to /48. Blocks on ranges wider than that are rare and frankly discouraged.

Bawolff added a comment.EditedFeb 15 2017, 6:51 PM

I'm guessing this used the cuc_ip_hex_time timestamp, on cuc_ip_hex and cuc_timestamp. Here's the EXPLAIN:

The explain output says its ignoring the cuc_ip_hex_time and the cuc_user_time index, and is instead just using cuc_timestamp. Which seems not the best index. Possibly mysql thinks that filtering using the cuc_ip_hex_time or the cuc_user_time index won't filter out enough rows to make it worth while.

*edit*: I'm not sure why I didn't come to this conclusion earlier, but mysql probably choses the cuc_timestamp on big IP ranges to avoid the filesort.

To explain the explain output:
The possible keys column is all the keys that mysql considered. The key column is the key it actually used. The key_length column allows you to tell how much of the key was used (So in the case of a multi-part key, you can tell if its using the entire thing, or just the first bit of it). The extra column tells you how its using the key ("Using where" for example, means that its filtering out any db rows that don't match the query). The thing in that column to watch out for is lines like "using filesort" or "using temporary" - if you see those, the query is probably going to be super slow (broadly speaking. There's exceptions of course).

Actually, for the above example, the most efficient thing to do (for ipv6 when the cidr range is a multiple of 16) would be just rev_user_text like "2602:306:%" dito for any ipv4 multiple of 8.

We plan to support wildcard searches as well, which indeed go lightning fast when querying revision by itself. I don't think I'll use our new table at all if the user requests a wildcard search.

I mean, we should possibly automatically convert 10.0.0.0/24 into "10.0.0.%" in cases where we can cleanly convert

I mean, we should possibly automatically convert 10.0.0.0/24 into "10.0.0.%" in cases where we can cleanly convert

Makes perfect sense! I've created T158230. Thank you :)

@jcrespo @Bawolff So here's what I've got for a proposed schema, modeled after revision and cu_changes:

--
-- Every time an edit by a logged out user is saved,
-- a row is created in ip_changes. This stores
-- the IP as a hex representation so that we can more
-- easily find edits within an IP range.
--
CREATE TABLE /*_*/ip_changes (
  -- Unique ID to identify each IP change
  ipc_id int unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,

  -- Foreign key to the revision table
  ipc_rev_id int unsigned NOT NULL DEFAULT '0',

  -- The timestamp of the revision
  ipc_rev_timestamp binary(14) NOT NULL DEFAULT '',

  -- Hex representation of the IP address
  ipc_hex varbinary(255) DEFAULT NULL,

) /*$wgDBTableOptions*/;

CREATE INDEX /*i*/ipc_rev_timestamp ON /*_*/ip_changes (ipc_rev_timestamp);
CREATE INDEX /*i*/ipc_hex_time ON /*_*/ip_changes (ipc_hex,ipc_rev_timestamp);

I could just as happily go with ip_revision, if that is a better name, then prefix everything with ipv_ instead of ipc_. I also wasn't sure if it was helpful to have our own primary key for this table, or if we should just use ipc_rev_id as the unique identifier (serving both as primary and foreign).

As far as the actual schema goes, do you think this is a good start? We can optimize queries as we go along.

kaldari changed the visibility from "Custom Policy" to "Public (No Login Required)".Feb 22 2017, 6:40 PM
MusikAnimal reopened this task as Open.Mar 1 2017, 8:02 AM
MusikAnimal added a project: DBA.

Shouldn't ipc_hex be NOT NULL? Also, are we going to have an identical table for archive?

@MusikAnimal Table design it 100% dependent of the queries that are going to be done, please add a summary of the final example queries that are made easier with that design, that is the right way to judge if it is correct or not.

Some good things:

  • The table has a primary key (required)

Questions:

  • Why do you have an extra arbitrary key as PK? Wouldn't ipc_rev_id be a candidate key, being unique and not null? Maybe you will have several entries for each revision, in which case, you should clarify that. Maybe you need to keep this even if the revision disappear (that may be the case of other similar tables)? However, again, only looking at the table, without more context, it seems this table will be mainly used alone or joining it to revision, in which case it makes sense to use rev_id as PK.
  • Why 255 bytes for the ip address? If that represents a number, ipv4 and ipv6 have, respectively, 32 and 128 bits. If with hexadecimal you mean a string with the hexadecimal representation (and you need to be inefficient because manipulation needs, which is ok but please explicitly say so, it should be enough with 13 (18 with useless separators) bytes for ipv4 and 33 (40 with separators) for ipv6 (more with scope identifier, etc., but the point really is much less than 255 bytes). If you want to store something else there, like ranges, please specify, or any other format, please clarify. It looks like you put varchar(255) arbitrarily, which is a really bad idea, specially for old versions of mysql, that we sadly have to support.

Again, I have not followed all the conversation, so I may be missing essential information here, I am just going by your last comment. All design decisions are ok as long as they are justified with queries.

If you need proper testing I may be able to create a mockup in production with production data- but only if it has been tested on non-production first.

Why 255 bytes for the ip address? If that represents a number, ipv4 and ipv6 have, respectively, 32 and 128 bits. If with hexadecimal you mean a string with the hexadecimal representation (and you need to be inefficient because manipulation needs, which is ok but please explicitly say so, it should be enough with 13 (18 with useless separators) bytes for ipv4 and 33 (40 with separators) for ipv6 (more with scope identifier, etc., but the point really is much less than 255 bytes). If you want to store something else there, like ranges, please specify, or any other format, please clarify. It looks like you put varchar(255) arbitrarily, which is a really bad idea, specially for old versions of mysql, that we sadly have to support.

Would you suggest doing something like 2 unsigned ints (for 128 bits), or as small a varchar as possible?

Would you suggest doing something like 2 unsigned ints (for 128 bits), or as small a varchar as possible?

[s/ints/longs/ I think they are caller bigints on mysql, cannot remember.]

Both are ok options, depends on what it is more practical for the queries, but not varbinary(255). varbinary(100) is much better than varbinary(255), and varbinary(50), better, for example. As small as possible. I have written about this before here: http://dba.stackexchange.com/questions/76469/mysql-varchar-length-and-performance/76470#76470 (the update I did doesn't apply to us because we are on both an old db version && MariaDB).

MusikAnimal added a comment.EditedMar 2 2017, 8:49 PM

Why 255 bytes for the ip address?

I went off of cu_changes which for some reason has it set to 255. 35 will do for us just fine, I believe :)

Also, are we going to have an identical table for archive?

Talked to you on IRC, but to repeat here, I think for this first iteration we're going to only add Special:RangeContribs, and not a Special:RangeDeletedContribs counterpart.

We also *might* bundle the range contribs functionality into the core Special:Contribs as opposed to having a separate Special page, but that's pending further investigation, and either way we need this new table as proposed below. The queries will be the same regardless of where the code lives.

Based on feedback here's what I've got for ip_changes:

--
-- Every time an edit by a logged out user is saved,
-- a row is created in ip_changes. This stores
-- the IP as a hex representation so that we can more
-- easily find edits within an IP range.
--
CREATE TABLE /*_*/ip_changes (
  -- Foreign key to the revision table, also serves as the unique primary key
  ipc_rev_id int unsigned NOT NULL PRIMARY KEY DEFAULT '0',

  -- The timestamp of the revision
  ipc_rev_timestamp binary(14) NOT NULL DEFAULT '',

  -- Hex representation of the IP address, as returned by IP::toHex()
  -- For IPv4 it will resemble: ABCD1234
  -- For IPv6: v6-ABCD1234000000000000000000000000
  -- BETWEEN is then used to identify revisions within a given range
  ipc_hex varbinary(35) NOT NULL DEFAULT '',

) /*$wgDBTableOptions*/;

CREATE INDEX /*i*/ipc_rev_timestamp ON /*_*/ip_changes (ipc_rev_timestamp);
CREATE INDEX /*i*/ipc_hex_time ON /*_*/ip_changes (ipc_hex,ipc_rev_timestamp);

Example query: Getting all relevant info about edits made after 2016-11-01 by IPs within 2602:0306::/32
This is the same info you would get for Special:Contributions, except within the given range.

SELECT rev_id,rev_page,rev_text_id,rev_timestamp,rev_comment,rev_user_text,rev_minor_edit,
  rev_deleted,rev_len,rev_parent_id,rev_sha1,rev_content_format,rev_content_model,
  page_namespace,page_title,page_is_new,page_latest,page_is_redirect,page_len,
  (SELECT GROUP_CONCAT(ct_tag SEPARATOR ',') FROM `change_tag` WHERE ct_rev_id=rev_id) AS `ts_tags`
FROM `ip_changes`
LEFT JOIN `revision`
ON (ipc_rev_id = rev_id)
INNER JOIN `page`
ON (page_id = rev_page)
WHERE (ipc_hex BETWEEN 'v6-26020306000000000000000000000000' AND 'v6-26020306FFFFFFFFFFFFFFFFFFFFFFFF')
AND (ipc_rev_timestamp > '20161101000000')
ORDER BY ipc_rev_timestamp DESC LIMIT 51;

@jcrespo Any chance you could give this another looksee? Hopefully the one query is enough to illustrate what we'd be doing with this table. Other queries would vary only based on the provided filtering options, like they do at the normal Special:Contribs (date ranges, namespace restriction, etc.). The way we join and work off of the new table will be the same. Ten thousand thanks for your help!

jcrespo added a comment.EditedMar 8 2017, 8:52 PM

@MusikAnimal The best way to validate the query is to have some sample values (as similar as they will be in production), and run such a query in beta, labs, analytics, or your personal working space and get:

  • an EXPLAIN of the query, to observe the query plan before execution
  • FLUSH STATUS; SELECT...; SHOW STATUS like 'Hand%'; to show the performance and the query plan post-execution.

See if you can obtain those, the query is not precisely obvious to say if it is a good or a bad query.

@MusikAnimal The best way to validate the query is to have some sample values (as similar as they will be in production), and run such a query in beta, labs, analytics, or your personal working space and get:

  • an EXPLAIN of the query, to observe the query plan before execution
  • FLUSH STATUS; SELECT...; SHOW STATUS like 'Hand%'; to show the performance and the query plan post-execution.

    See if you can obtain those, the query is not precisely obvious to say if it is a good or a bad query.

We did thorough testing against the production cu_changes table, which has the same indexing (ipc_hex_time) and has the schema for the IP hex column and revision date. I am under the assumption this very closely mimics the queries we'd run on our new table. The only thing extra we had to do was get only edits and page creations cuc_type IN (0, 1) and filter out accounts cuc_user = 0, since our table will only be IP revisions. After doing this execution times were sufficiently fast, but I do not have handler stats to show you. I will try to get the new table on analytics and populated with enough data to do proper testing. For the handler stats, does it matter how much data is in the table? I guess I'll start with 1,000 rows or so and see how she flies.

DannyH moved this task from Ready for development to Bug backlog on the Community-Tech board.

@jcrespo Finally got around to this, below are my results. This test query was ran on my local Vagrant, where there are maybe 1,000 or so revisions. I'm hoping this is enough for you to do your analysis.

The query that was ran:

SELECT rev_id,rev_page,rev_text_id,rev_timestamp,rev_comment,rev_user_text,rev_minor_edit,
  rev_deleted,rev_len,rev_parent_id,rev_sha1,rev_content_format,rev_content_model,
  page_namespace,page_title,page_is_new,page_latest,page_is_redirect,page_len,
  (SELECT GROUP_CONCAT(ct_tag SEPARATOR ',') FROM `change_tag` WHERE ct_rev_id=rev_id) AS `ts_tags`
FROM `ip_changes`
LEFT JOIN `revision`
ON (ipc_rev_id = rev_id)
INNER JOIN `page`
ON (page_id = rev_page)
WHERE (ipc_hex BETWEEN 'C0000000' AND 'C00000FF')
AND (ipc_rev_timestamp > '20161101000000')
ORDER BY ipc_rev_timestamp DESC LIMIT 51;

Running EXPLAIN:

idselect_typetabletypepossible_keyskeykey_lenrefrowsExtra
1PRIMARYpageALLPRIMARYNULLNULLNULL4Using temporary; Using filesort
1PRIMARYrevisionrefPRIMARY,rev_page_id,page_timestamp,page_user_timestamprev_page_id4wiki.page.page_id9Using where
1PRIMARYip_changeseq_refPRIMARY,ipc_rev_timestamp,ipc_hex_timePRIMARY4wiki.revision.rev_id1Using where
2DEPENDENT SUBQUERYchange_tagrefchange_tag_rev_tagchange_tag_rev_tag5wiki.revision.rev_id1Using where; Using index

And got nothing for the handler stats (I did run FLUSH STATUS; first):

Handler_commit0
Handler_delete0
Handler_discover0
Handler_prepare0
Handler_read_first0
Handler_read_key0
Handler_read_last0
Handler_read_next0
Handler_read_prev0
Handler_read_rnd0
Handler_read_rnd_next0
Handler_rollback0
Handler_savepoint0
Handler_savepoint_rollback0
Handler_update0
Handler_write0

A second query, identical to the first except for an IPv6 range:

SELECT rev_id,rev_page,rev_text_id,rev_timestamp,rev_comment,rev_user_text,rev_minor_edit,   rev_deleted,rev_len,rev_parent_id,rev_sha1,rev_content_format,rev_content_model,   page_namespace,page_title,page_is_new,page_latest,page_is_redirect,page_len,   (SELECT GROUP_CONCAT(ct_tag SEPARATOR ',') FROM `change_tag` WHERE ct_rev_id=rev_id) AS `ts_tags` FROM `ip_changes` LEFT JOIN `revision` ON (ipc_rev_id = rev_id) INNER JOIN `page` ON (page_id = rev_page) WHERE (ipc_hex BETWEEN 'v6-26020306000000000000000000000000' AND 'v6-26020306FFFFFFFFFFFFFFFFFFFFFFFF') AND (ipc_rev_timestamp > '20161101000000') ORDER BY ipc_rev_timestamp DESC LIMIT 51;

Now EXPLAIN suggests the index is being used:

1PRIMARYip_changesrangePRIMARY,ipc_rev_timestamp,ipc_hex_timeipc_hex_time51NULL1Using where; Using index; Using filesort
1PRIMARYrevisioneq_refPRIMARY,rev_page_id,page_timestamp,page_user_timestampPRIMARY4wiki.ip_changes.ipc_rev_id1Using where
1PRIMARYpageeq_refPRIMARYPRIMARY4wiki.revision.rev_page1
2DEPENDENT SUBQUERYchange_tagrefchange_tag_rev_tagchange_tag_rev_tag5wiki.revision.rev_id1Using where; Using index

And handler stats:

Handler_commit1
Handler_delete0
Handler_discover0
Handler_prepare0
Handler_read_first0
Handler_read_key1
Handler_read_last0
Handler_read_next0
Handler_read_prev0
Handler_read_rnd0
Handler_read_rnd_next0
Handler_rollback0
Handler_savepoint0
Handler_savepoint_rollback0
Handler_update0
Handler_write0

I will finally note that I'm not forcing any indexes like the normal Special:Contribs and CheckUser does. That may help, not sure.

jcrespo added a comment.EditedMar 22 2017, 5:56 PM

You have to run the query (not the explain).

FLUSH STATUS; SELECT ...; SHOW STATUS like 'Hand%';

The type:ALL on page and the filesort don't look good... But need the handlers to confirm.

Maybe it is the lack of rows- if there are just a few rows on the table or the conditions have low selectivity, the plans will not be the same, and they cannot be used as a proper test.

MusikAnimal added a comment.EditedMar 22 2017, 5:59 PM

You have to run the query (not the explain).

FLUSH STATUS; SELECT ...; SHOW STATUS like 'Hand%';

The type:ALL on page and the filesort don't look good... But need the handlers to confirm.

I did run the query (sorry for the confusion), like so:

FLUSH STATUS;

SELECT rev_id,rev_page,rev_text_id,rev_timestamp,rev_comment,rev_user_text,rev_minor_edit,   rev_deleted,rev_len,rev_parent_id,rev_sha1,rev_content_format,rev_content_model,   page_namespace,page_title,page_is_new,page_latest,page_is_redirect,page_len,   (SELECT GROUP_CONCAT(ct_tag SEPARATOR ',') FROM `change_tag` WHERE ct_rev_id=rev_id) AS `ts_tags` FROM `ip_changes` LEFT JOIN `revision` ON (ipc_rev_id = rev_id) INNER JOIN `page` ON (page_id = rev_page) WHERE (ipc_hex BETWEEN 'v6-26020306000000000000000000000000' AND 'v6-26020306FFFFFFFFFFFFFFFFFFFFFFFF') AND (ipc_rev_timestamp > '20161101000000') ORDER BY ipc_rev_timestamp DESC LIMIT 51;

SHOW STATUS like 'Hand%';

Same for the IPv4 version.

If these results don't look, I can report back with what I get when forcing indexes, which is how it is done in both CheckUser and Special:Contribs.

jcrespo added a comment.EditedMar 22 2017, 6:02 PM

Then it is the rows- running that on 4/9 rows is not useful (there are 1000 million revision rows only on enwiki). Go to labs and copy some public revisions and pages, that is the easiest to get some sample data.

Compare the results you got, doing filesorts and full table scans with the ones on T156318#3023578. The test invalides your thesis, not validates it. Probably a fault on the test.

Then it is the rows- running that on 4/9 rows is not useful (there are 1000 million revision rows only on enwiki). Go to labs and copy some public revisions and pages, that is the easiest to get some sample data.

The issue with Labs is that I need the revisions in a queryable range (max /16 for IPv4, /32 for IPv6), and there's no way to do that :( ...hence why we want to create this tool :) Copying a month's worth of revision data might provide enough data to do proper testing, but I question if I can fit that much on my local.

Maybe this is what you meant, but what if I created the table on labsdb1003 so that I can JOIN it against revision? The thing is I need to populate the new table with a lot of data, which hopefully doesn't mean we run into issues like T133321 again.

Issues with other environments:

  • Analytics slaves: I asked about creating the new table and populating it, but they said that was not a good testing environment, and also I do not have rights to create tables.
  • Beta cluster: I see only a handful of IP edits over the past 30 days, so I don't know how helpful it would be.
  • My local: There are over 1,000 edits by random IPs within 192.0.0.0/24, which I generated using a script (P4725). From my experience it would be unusual to see that many edits within a range that small in production, all within a short timeframe. So in a way my local data should be testing the limits, at least in terms of the amount of data returned. However, all the edits are to a single page, so maybe spreading it across many pages would help?

Let me create a test table with you somewhere- but please give me a one-liner to set it up.- E.g.

CREATE TABLE /* this would be the mock table to test */ revision_2 like revision; ALTER TABLE revision_2 ADD INDEX blah blah ; INSERT INTO revision_2 SELECT * FROM revision WHERE rev_id < 1000; /* This is query #1 to test */SELECT ... ; /* This is query number 2 to test */ SELECT ...;

I will create it on a passive slave you have access to and delete it afterwards.

Let me create a test table with you somewhere- but please give me a one-liner to set it up.- E.g.

Awesome, thank you :) Here's a one-liner to create the table and add the indexing:

CREATE TABLE ip_changes (ipc_rev_id int unsigned NOT NULL PRIMARY KEY DEFAULT '0', ipc_rev_timestamp binary(14) NOT NULL DEFAULT '', ipc_hex varbinary(35) NOT NULL DEFAULT ''); CREATE INDEX ipc_rev_timestamp ON ip_changes (ipc_rev_timestamp); CREATE INDEX ipc_hex_time ON ip_changes (ipc_hex,ipc_rev_timestamp);

From here we can't do an INSERT with a SELECT subquery because we need the hexadecimal representation of the IPs. With MediaWiki I do this with use \IP; IP::toHex( '192.0.0.1' ). Is there any MediaWiki installation we could hook up or a way to use eval.php against the slave you create? If so I am happy to do the work of populating the new table.

@MusikAnimal that is very difficult. I need to disable the binary logs and use SUPER to write to a read-only slave, and that would require giving you root privileges.

Can you create a SELECT mysql expression for the table, just for the data creation? there are several helper functions: https://dev.mysql.com/doc/refman/5.7/en/miscellaneous-functions.html and probably that and IF( and/or CONCAT( would be enough to generate the full table without having to import and export everything?

I have created the table on db2048.codfw.wmnet:enwiki , but obviously, it is empty.

For example:

root@db2048[enwiki]> SELECT HEX(INET_ATON('192.0.0.1'));
+-----------------------------+
| HEX(INET_ATON('192.0.0.1')) |
+-----------------------------+
| C0000001                    |
+-----------------------------+
1 row in set (0.02 sec)

@jcrespo Nice! It looks like SELECT HEX(INET6_ATON('192.168.0.1')); does exactly what we want, but it doesn't work on my local MySQL. If it works on the slave you created, I think you could do something like (I'm unable to test):

INSERT INTO ip_changes (ipc_rev_id, ipc_rev_timestamp, ipc_hex) SELECT rev_id, rev_timestamp, CONCAT(IF(LENGTH(rev_user_text) > 15, 'v6-', ''), HEX(INET6_ATON(rev_user_text))) AS ipc_hex FROM revision WHERE rev_timestamp > '20170222000000' AND rev_user = 0;

That will give you all IP edits over the last month. The hex for IPv6 is prefixed with v6-

Thanks, I may be going soon, but I will try to execute that and see what we get- and have it ready for tomorrow. There is one thing I last may need from you. Please remind us to delete the table when tests are finished- otherwise when the table is created everywhere, replication will fail (and that means an outage).

jcrespo added a comment.EditedMar 22 2017, 7:44 PM

It took less time than I thought- you can check the table now.

jcrespo moved this task from Triage to In progress on the DBA board.Mar 22 2017, 8:04 PM

I am in the process of getting prod db access and will get back to you. Rest assured I won't forget about this, though, we will get that table deleted when we are finished :) There a slew of types of queries I would like to test. I happen to know of some busy-ish ranges over the past month, too, so hopefully this go around we'll have some definitive answers on query performance.

@jcrespo So I've got prod db access now and have finally got back to this task... Apparently I need "RELOAD" privileges to flush the handler stats?

mysql:wikiadmin@db2048.codfw.wmnet [enwiki]> FLUSH STATUS;
ERROR 1227 (42000): Access denied; you need (at least one of) the RELOAD privilege(s) for this operation

Are you able to grant me these privileges, or should I provide you with some sample queries to test?

In what I believe to be good news, I tried some of the queries above that we were testing on cu_changes and they all ran very fast :) The one query that took ~42 seconds took only 0.38 seconds with ip_changes! I realize this does not give you the stats you need, but nonetheless I found it promising.

@jcrespo Sorry to bug you. I'm guessing I will not be given the necessary rights to FLUSH STATUSon db2048, so I have prepared a series of test queries for you. Please note these queries are not hand-written, but are constructed by the ContribsPager and Revision classes. The only changes I've made are to JOIN with ip_changes and add the BETWEEN clause.

Normal query at Special:Contributions, but for a very busy IPv6 range:

SELECT rev_id,rev_page,rev_text_id,rev_timestamp,rev_comment,rev_user_text,rev_user,rev_minor_edit,
       rev_deleted,rev_len,rev_parent_id,rev_sha1,rev_content_format,rev_content_model,user_name,
       page_namespace,page_title,page_is_new,page_latest,page_is_redirect,page_len,
       (
         SELECT GROUP_CONCAT(ct_tag SEPARATOR ',') FROM `change_tag` WHERE ct_rev_id=rev_id
       ) AS `ts_tags`
FROM `revision`
INNER JOIN `page` ON ((page_id = rev_page))
LEFT JOIN `user` ON ((rev_user != 0) AND (user_id = rev_user))
LEFT JOIN `ip_changes` ON ((ipc_rev_id = rev_id))
WHERE (ipc_hex BETWEEN 'v6-2607FB90000000000000000000000000' AND 'v6-2607FB90FFFFFFFFFFFFFFFFFFFFFFFF')
AND ((rev_deleted & 4) = 0)
ORDER BY rev_timestamp DESC LIMIT 51;

Same query but with more results, and limited to revisions within a certain date range. This query is not currently possible to do with Special:Contribs, but will be with T120733:

SELECT rev_id,rev_page,rev_text_id,rev_timestamp,rev_comment,rev_user_text,rev_user,rev_minor_edit,
       rev_deleted,rev_len,rev_parent_id,rev_sha1,rev_content_format,rev_content_model,user_name,
       page_namespace,page_title,page_is_new,page_latest,page_is_redirect,page_len,
       (
         SELECT GROUP_CONCAT(ct_tag SEPARATOR ',') FROM `change_tag` WHERE ct_rev_id=rev_id
       ) AS `ts_tags`
FROM `revision`
INNER JOIN `page` ON ((page_id = rev_page))
LEFT JOIN `user` ON ((rev_user != 0) AND (user_id = rev_user))
LEFT JOIN `ip_changes` ON ((ipc_rev_id = rev_id))
WHERE (ipc_hex BETWEEN 'v6-2607FB90000000000000000000000000' AND 'v6-2607FB90FFFFFFFFFFFFFFFFFFFFFFFF')
AND ((rev_deleted & 4) = 0)
AND rev_timestamp > '20170301000000'
AND rev_timestamp < '20170401000000'
ORDER BY rev_timestamp DESC LIMIT 501;

Another similar query, but with various options set through the Special:Contribs interface:

SELECT rev_id,rev_page,rev_text_id,rev_timestamp,rev_comment,rev_user_text,rev_user,rev_minor_edit,
       rev_deleted,rev_len,rev_parent_id,rev_sha1,rev_content_format,rev_content_model,user_name,
       page_namespace,page_title,page_is_new,page_latest,page_is_redirect,page_len,
       (
         SELECT GROUP_CONCAT(ct_tag SEPARATOR ',') FROM `change_tag` WHERE ct_rev_id=rev_id
       ) AS `ts_tags`
FROM `revision`
INNER JOIN `page` ON ((page_id = rev_page))
LEFT JOIN `user` ON ((rev_user != 0) AND (user_id = rev_user))
LEFT JOIN `ip_changes` ON ((ipc_rev_id = rev_id))
INNER JOIN `change_tag` ON ((ct_rev_id=rev_id))
WHERE (ipc_hex BETWEEN 'v6-2607FB90000000000000000000000000' AND 'v6-2607FB90FFFFFFFFFFFFFFFFFFFFFFFF')
AND (rev_id = page_latest)
AND (rev_parent_id = 0)
AND (rev_minor_edit = 0)
AND ((rev_deleted & 4) = 0)
AND ct_tag = 'Vandal'
AND (rev_timestamp>='20170301000000')
ORDER BY rev_timestamp LIMIT 501;

Another basic query but using a busy-ish IPv4 range:

SELECT rev_id,rev_page,rev_text_id,rev_timestamp,rev_comment,rev_user_text,rev_user,rev_minor_edit,
       rev_deleted,rev_len,rev_parent_id,rev_sha1,rev_content_format,rev_content_model,user_name,
       page_namespace,page_title,page_is_new,page_latest,page_is_redirect,page_len,
       (
         SELECT GROUP_CONCAT(ct_tag SEPARATOR ',') FROM `change_tag` WHERE ct_rev_id=rev_id
       ) AS `ts_tags`
FROM `revision`
INNER JOIN `page` ON ((page_id = rev_page))
LEFT JOIN `user` ON ((rev_user != 0) AND (user_id = rev_user))
LEFT JOIN `ip_changes` ON ((ipc_rev_id = rev_id))
WHERE (ipc_hex BETWEEN '05AC0000' AND '05ACFFFF')
AND ((rev_deleted & 4) = 0)
AND (rev_timestamp>='20170301000000')
ORDER BY rev_timestamp LIMIT 51;

All of the above queries use rev_timestamp instead of ipc_rev_timestamp. It currently is very tricky to get the code to use ipc_rev_timestamp conditionally if we're querying for a range, but I can probably make it happen.

For now, going by trivial run times, it seems using rev_timestamp is just as fast. If possible, it would be great if we could run the same tests above but instead using ipc_rev_timestamp, to see if it helps at all. If it doesn't, then we can drop that field from the table altogether and save us the hassle of reworking more code.

For convenience, below are the same queries except using ipc_rev_timestamp:


SELECT rev_id,rev_page,rev_text_id,rev_timestamp,rev_comment,rev_user_text,rev_user,rev_minor_edit,
       rev_deleted,rev_len,rev_parent_id,rev_sha1,rev_content_format,rev_content_model,user_name,
       page_namespace,page_title,page_is_new,page_latest,page_is_redirect,page_len,
       (
         SELECT GROUP_CONCAT(ct_tag SEPARATOR ',') FROM `change_tag` WHERE ct_rev_id=rev_id
       ) AS `ts_tags`
FROM `revision`
INNER JOIN `page` ON ((page_id = rev_page))
LEFT JOIN `user` ON ((rev_user != 0) AND (user_id = rev_user))
LEFT JOIN `ip_changes` ON ((ipc_rev_id = rev_id))
WHERE (ipc_hex BETWEEN 'v6-2607FB90000000000000000000000000' AND 'v6-2607FB90FFFFFFFFFFFFFFFFFFFFFFFF')
AND ((rev_deleted & 4) = 0)
ORDER BY ipc_rev_timestamp DESC LIMIT 51;


SELECT rev_id,rev_page,rev_text_id,rev_timestamp,rev_comment,rev_user_text,rev_user,rev_minor_edit,
       rev_deleted,rev_len,rev_parent_id,rev_sha1,rev_content_format,rev_content_model,user_name,
       page_namespace,page_title,page_is_new,page_latest,page_is_redirect,page_len,
       (
         SELECT GROUP_CONCAT(ct_tag SEPARATOR ',') FROM `change_tag` WHERE ct_rev_id=rev_id
       ) AS `ts_tags`
FROM `revision`
INNER JOIN `page` ON ((page_id = rev_page))
LEFT JOIN `user` ON ((rev_user != 0) AND (user_id = rev_user))
LEFT JOIN `ip_changes` ON ((ipc_rev_id = rev_id))
WHERE (ipc_hex BETWEEN 'v6-2607FB90000000000000000000000000' AND 'v6-2607FB90FFFFFFFFFFFFFFFFFFFFFFFF')
AND ((rev_deleted & 4) = 0)
AND ipc_rev_timestamp > '20170301000000'
AND ipc_rev_timestamp < '20170401000000'
ORDER BY ipc_rev_timestamp DESC LIMIT 501;


SELECT rev_id,rev_page,rev_text_id,rev_timestamp,rev_comment,rev_user_text,rev_user,rev_minor_edit,
       rev_deleted,rev_len,rev_parent_id,rev_sha1,rev_content_format,rev_content_model,user_name,
       page_namespace,page_title,page_is_new,page_latest,page_is_redirect,page_len,
       (
         SELECT GROUP_CONCAT(ct_tag SEPARATOR ',') FROM `change_tag` WHERE ct_rev_id=rev_id
       ) AS `ts_tags`
FROM `revision`
INNER JOIN `page` ON ((page_id = rev_page))
LEFT JOIN `user` ON ((rev_user != 0) AND (user_id = rev_user))
LEFT JOIN `ip_changes` ON ((ipc_rev_id = rev_id))
INNER JOIN `change_tag` ON ((ct_rev_id=rev_id))
WHERE (ipc_hex BETWEEN 'v6-2607FB90000000000000000000000000' AND 'v6-2607FB90FFFFFFFFFFFFFFFFFFFFFFFF')
AND (rev_id = page_latest)
AND (rev_parent_id = 0)
AND (rev_minor_edit = 0)
AND ((rev_deleted & 4) = 0)
AND ct_tag = 'Vandal'
AND (ipc_rev_timestamp>='20170301000000')
ORDER BY ipc_rev_timestamp LIMIT 501;


SELECT rev_id,rev_page,rev_text_id,rev_timestamp,rev_comment,rev_user_text,rev_user,rev_minor_edit,
       rev_deleted,rev_len,rev_parent_id,rev_sha1,rev_content_format,rev_content_model,user_name,
       page_namespace,page_title,page_is_new,page_latest,page_is_redirect,page_len,
       (
         SELECT GROUP_CONCAT(ct_tag SEPARATOR ',') FROM `change_tag` WHERE ct_rev_id=rev_id
       ) AS `ts_tags`
FROM `revision`
INNER JOIN `page` ON ((page_id = rev_page))
LEFT JOIN `user` ON ((rev_user != 0) AND (user_id = rev_user))
LEFT JOIN `ip_changes` ON ((ipc_rev_id = rev_id))
WHERE (ipc_hex BETWEEN '05AC0000' AND '05ACFFFF')
AND ((rev_deleted & 4) = 0)
AND (ipc_rev_timestamp>='20170301000000')
ORDER BY ipc_rev_timestamp LIMIT 51;

I now have working IP range support at Special:Contributions using the above queries (https://gerrit.wikimedia.org/r/#/c/349457/). While there's still more work to be done (add tests, etc.), the major blocker is getting approval on the new table. It'd be great if we could flesh this out sometime in May, in hopes for a Q1 release. Thanks a million!

you do not need flush status, as long as you run them as the first query of the connection- just connect, run the query and do show status like 'Hand%';.

you do not need flush status, as long as you run them as the first query of the connection- just connect, run the query and do show status like 'Hand%';.

Awesome :) So before I give you the stats, I am realizing another problem – that ipc_rev_timestamp may or may not be needed. My implementation approach has changed since we first started this, but in short, it is non-trivial to make the ReverseChronologicalPager class to conditionally use ipc_rev_timestamp for IP ranges. It is instead using the normal rev_timestamp. I want to know how well things work without ipc_rev_timestamp, so I'm hoping you can copy the existing ip_changes table to a new one with a different schema. Then I can run tests on both and decide whether it's worth the extra effort to make ReverseChronologicalPager use ipc_rev_timestamp column.

I think the necessary queries are as follows:

CREATE TABLE ip_changes2 AS SELECT * FROM ip_changes;
DROP INDEX ipc_rev_timestamp ON ip_changes2;
DROP INDEX ipc_hex_time ON ip_changes2;
ALTER TABLE ip_changes2 DROP COLUMN ipc_rev_timestamp;
CREATE INDEX ipc_hex ON ip_changes2 (ipc_hex);

I have the stats ready to go for the current ip_changes table if you want them, but again, it's using ipc_hex_time index (according to the EXPLAIN) which is on (ipc_hex,ipc_rev_timestamp), where ipc_rev_timestamp is not being use in our queries at all. Surely you want stats based on a finalized schema.

Another quick question – when evaluating handler stats, lower numbers mean better performance, right?

so I'm hoping you can copy the existing

I can do that, but not probably in the next 2 weeks- there is important maintenance scheduled next week. After that me or @Marostegui will help you.

when evaluating handler stats, lower numbers mean better performance, right?

Yes 1 unit = one row read

@jcrespo Are you able to create that new table now, by chance? Here's the SQL that I think will do what I need:

CREATE TABLE ip_changes2 AS SELECT * FROM ip_changes;
DROP INDEX ipc_rev_timestamp ON ip_changes2;
DROP INDEX ipc_hex_time ON ip_changes2;
ALTER TABLE ip_changes2 DROP COLUMN ipc_rev_timestamp;
CREATE INDEX ipc_hex ON ip_changes2 (ipc_hex);

So then I can compare which is faster – with or without a replicated timestamp. Many thanks for your help!

yes, I can do this now.

I've run the following on db2048:

root@db2048.codfw.wmnet[enwiki]> CREATE TABLE ip_changes2 like ip_changes;
Query OK, 0 rows affected (0.12 sec)

root@db2048.codfw.wmnet[enwiki]> ALTER TABLE ip_changes2 DROP INDEX ipc_rev_timestamp, DROP INDEX ipc_hex_time, DROP COLUMN ipc_rev_timestamp, ADD INDEX ipc_hex (ipc_hex);
Query OK, 0 rows affected (0.15 sec)
Records: 0  Duplicates: 0  Warnings: 0

root@db2048.codfw.wmnet[enwiki]> INSERT INTO ip_changes2 SELECT ipc_rev_id, ipc_hex FROM  ip_changes;
Query OK, 859289 rows affected (30.58 sec)
Records: 859289  Duplicates: 0  Warnings: 0
jcrespo updated the task description. (Show Details)May 15 2017, 10:52 AM

Well here are my results, sorry for the super long post! There are two variations for each query: (a) which is on the ip_changes table using the ipc_rev_timestamp column, and (b) on the ip_changes2 using rev_timestamp from the revision table.

[1a] (ip_changes/ipc_rev_timestamp) - typical query for busy IPv6 range
SELECT rev_id,rev_page,rev_text_id,ipc_rev_timestamp,rev_comment,rev_user_text,rev_user,rev_minor_edit,
       rev_deleted,rev_len,rev_parent_id,rev_sha1,rev_content_format,rev_content_model,user_name,
       page_namespace,page_title,page_is_new,page_latest,page_is_redirect,page_len,
       (
         SELECT GROUP_CONCAT(ct_tag SEPARATOR ',') FROM `change_tag` WHERE ct_rev_id=rev_id
       ) AS `ts_tags`
FROM `revision`
INNER JOIN `page` ON ((page_id = rev_page))
LEFT JOIN `user` ON ((rev_user != 0) AND (user_id = rev_user))
LEFT JOIN `ip_changes` ON ((ipc_rev_id = rev_id))
WHERE (ipc_hex BETWEEN 'v6-2607FB90000000000000000000000000' AND 'v6-2607FB90FFFFFFFFFFFFFFFFFFFFFFFF')
AND ((rev_deleted & 4) = 0)
ORDER BY ipc_rev_timestamp DESC LIMIT 51;

Run time:
51 rows in set (0.05 sec)

EXPLAIN:

idselect_typetabletypepossible_keyskeykey_lenrefrowsExtra
1PRIMARYip_changesrangePRIMARY, ipc_hex_timeipc_hex_time37NULL4614Using where; Using index; Using filesort
1PRIMARYrevisioneq_refPRIMARY, page_timestamp, page_user_timestamp, rev_page_idPRIMARY4enwiki.ip_changes.ipc_rev_id1Using where
1PRIMARYpageeq_refPRIMARYPRIMARY4enwiki.revision.rev_page1
1PRIMARYusereq_refPRIMARYPRIMARY4enwiki.revision.rev_user1Using where
2DEPENDENT SUBQUERYchange_tagrefct_rev_id, change_tag_rev_tagct_rev_id5enwiki.revision.rev_id1Using where; Using index

Handler stats:

Variable_nameValue
Handler_commit1
Handler_read_key149
Handler_read_next2318
Handler_read_rnd_next152
Handler_tmp_write150

[1b] (ip_changes2/rev_timestamp) - typical query for busy IPv6 range
SELECT rev_id,rev_page,rev_text_id,rev_timestamp,rev_comment,rev_user_text,rev_user,rev_minor_edit,
       rev_deleted,rev_len,rev_parent_id,rev_sha1,rev_content_format,rev_content_model,user_name,
       page_namespace,page_title,page_is_new,page_latest,page_is_redirect,page_len,
       (
         SELECT GROUP_CONCAT(ct_tag SEPARATOR ',') FROM `change_tag` WHERE ct_rev_id=rev_id
       ) AS `ts_tags`
FROM `revision`
INNER JOIN `page` ON ((page_id = rev_page))
LEFT JOIN `user` ON ((rev_user != 0) AND (user_id = rev_user))
LEFT JOIN `ip_changes2` ON ((ipc_rev_id = rev_id))
WHERE (ipc_hex BETWEEN 'v6-2607FB90000000000000000000000000' AND 'v6-2607FB90FFFFFFFFFFFFFFFFFFFFFFFF')
AND ((rev_deleted & 4) = 0)
ORDER BY rev_timestamp DESC LIMIT 51;

Run time:
51 rows in set (0.11 sec)

EXPLAIN:

idselect_typetabletypepossible_keyskeykey_lenrefrowsExtra
1PRIMARYip_changes2rangePRIMARY, ipc_hexipc_hex37NULL2245Using where; Using index; Using temporary; Using filesort
1PRIMARYrevisioneq_refPRIMARY, page_timestamp, page_user_timestamp, rev_page_idPRIMARY4enwiki.ip_changes2.ipc_rev_id1Using where
1PRIMARYpageeq_refPRIMARYPRIMARY4enwiki.revision.rev_page1
1PRIMARYusereq_refPRIMARYPRIMARY4enwiki.revision.rev_user1Using where
2DEPENDENT SUBQUERYchange_tagrefct_rev_id, change_tag_rev_tagct_rev_id5enwiki.revision.rev_id1Using where; Using index

Handler stats:

Variable_nameValue
Handler_commit1
Handler_read_key6077
Handler_read_next5461
Handler_read_rnd51
Handler_read_rnd_next2375
Handler_tmp_write2372

[2a] (ip_changes/ipc_rev_timestamp) - Same query but with more results and date range limitation
SELECT rev_id,rev_page,rev_text_id,ipc_rev_timestamp,rev_comment,rev_user_text,rev_user,rev_minor_edit,
       rev_deleted,rev_len,rev_parent_id,rev_sha1,rev_content_format,rev_content_model,user_name,
       page_namespace,page_title,page_is_new,page_latest,page_is_redirect,page_len,
       (
         SELECT GROUP_CONCAT(ct_tag SEPARATOR ',') FROM `change_tag` WHERE ct_rev_id=rev_id
       ) AS `ts_tags`
FROM `revision`
INNER JOIN `page` ON ((page_id = rev_page))
LEFT JOIN `user` ON ((rev_user != 0) AND (user_id = rev_user))
LEFT JOIN `ip_changes` ON ((ipc_rev_id = rev_id))
WHERE (ipc_hex BETWEEN 'v6-2607FB90000000000000000000000000' AND 'v6-2607FB90FFFFFFFFFFFFFFFFFFFFFFFF')
AND ((rev_deleted & 4) = 0)
AND ipc_rev_timestamp > '20170301000000'
AND ipc_rev_timestamp < '20170401000000'
ORDER BY ipc_rev_timestamp DESC LIMIT 501;

Run time:
501 rows in set (0.16 sec)

EXPLAIN:

idselect_typetabletypepossible_keyskeykey_lenrefrowsExtra
1PRIMARYip_changesrangePRIMARY, ipc_rev_timestamp, ipc_hex_timeipc_hex_time51NULL4612Using where; Using index; Using filesort
1PRIMARYrevisioneq_refPRIMARY, page_timestamp, page_user_timestamp, rev_page_idPRIMARY4enwiki.ip_changes.ipc_rev_id1Using where
1PRIMARYpageeq_refPRIMARYPRIMARY4enwiki.revision.rev_page1
1PRIMARYusereq_refPRIMARYPRIMARY4enwiki.revision.rev_user1Using where
2DEPENDENT SUBQUERYchange_tagrefct_rev_id, change_tag_rev_tagct_rev_id5enwiki.revision.rev_id1Using where; Using index

Handler stats:

Variable_nameValue
Handler_read_key1439
Handler_read_next2828
Handler_read_rnd_next152
Handler_tmp_write150

[2b] (ip_changes2/rev_timestamp) - Same query but with more results and date range limitation
SELECT rev_id,rev_page,rev_text_id,rev_timestamp,rev_comment,rev_user_text,rev_user,rev_minor_edit,
       rev_deleted,rev_len,rev_parent_id,rev_sha1,rev_content_format,rev_content_model,user_name,
       page_namespace,page_title,page_is_new,page_latest,page_is_redirect,page_len,
       (
         SELECT GROUP_CONCAT(ct_tag SEPARATOR ',') FROM `change_tag` WHERE ct_rev_id=rev_id
       ) AS `ts_tags`
FROM `revision`
INNER JOIN `page` ON ((page_id = rev_page))
LEFT JOIN `user` ON ((rev_user != 0) AND (user_id = rev_user))
LEFT JOIN `ip_changes2` ON ((ipc_rev_id = rev_id))
WHERE (ipc_hex BETWEEN 'v6-2607FB90000000000000000000000000' AND 'v6-2607FB90FFFFFFFFFFFFFFFFFFFFFFFF')
AND ((rev_deleted & 4) = 0)
AND rev_timestamp > '20170301000000'
AND rev_timestamp < '20170401000000'
ORDER BY rev_timestamp DESC LIMIT 501;

Run time:
501 rows in set (0.23 sec)

EXPLAIN:

idselect_typetabletypepossible_keyskeykey_lenrefrowsExtra
1PRIMARYip_changes2rangePRIMARY, ipc_hexipc_hex37NULL2245Using where; Using index; Using temporary; Using filesort
1PRIMARYrevisioneq_refPRIMARY, rev_timestamp, page_timestamp, page_user_timestamp, rev_page_idPRIMARY4enwiki.ip_changes2.ipc_rev_id1Using where
1PRIMARYusereq_refPRIMARYPRIMARY4enwiki.revision.rev_user1Using where
1PRIMARYpageeq_refPRIMARYPRIMARY4enwiki.revision.rev_page1
2DEPENDENT SUBQUERYchange_tagrefct_rev_id, change_tag_rev_tagct_rev_id5enwiki.revision.rev_id1Using where; Using index

Handler stats:

Variable_nameValue
Handler_commit1
Handler_read_key6077
Handler_read_next5461
Handler_read_rnd501
Handler_read_rnd_next2375
Handler_tmp_write2372

[3a] (ip_changes/ipc_rev_timestamp) - Another basic query but using a busy-ish IPv4 range
SELECT rev_id,rev_page,rev_text_id,ipc_rev_timestamp,rev_comment,rev_user_text,rev_user,rev_minor_edit,
       rev_deleted,rev_len,rev_parent_id,rev_sha1,rev_content_format,rev_content_model,user_name,
       page_namespace,page_title,page_is_new,page_latest,page_is_redirect,page_len,
       (
         SELECT GROUP_CONCAT(ct_tag SEPARATOR ',') FROM `change_tag` WHERE ct_rev_id=rev_id
       ) AS `ts_tags`
FROM `revision`
INNER JOIN `page` ON ((page_id = rev_page))
LEFT JOIN `user` ON ((rev_user != 0) AND (user_id = rev_user))
LEFT JOIN `ip_changes` ON ((ipc_rev_id = rev_id))
WHERE (ipc_hex BETWEEN '05AC0000' AND '05ACFFFF')
AND ((rev_deleted & 4) = 0)
AND (ipc_rev_timestamp>='20170301000000')
ORDER BY ipc_rev_timestamp LIMIT 51;

Run time:
51 rows in set (0.04 sec)

EXPLAIN:

idselect_typetabletypepossible_keyskeykey_lenrefrowsExtra
1PRIMARYip_changesrangePRIMARY, ipc_rev_timestamp, ipc_hex_timeipc_hex_time51NULL77Using where; Using index; Using filesort
1PRIMARYrevisioneq_refPRIMARY, page_timestamp, page_user_timestamp, rev_page_idPRIMARY4enwiki.ip_changes.ipc_rev_id1Using where
1PRIMARYpageeq_refPRIMARYPRIMARY4enwiki.revision.rev_page1
1PRIMARYusereq_refPRIMARYPRIMARY4enwiki.revision.rev_user1Using where
2DEPENDENT SUBQUERYchange_tagrefct_rev_id, change_tag_rev_tagct_rev_id5enwiki.revision.rev_id1Using where; Using index

Handler stats:

Variable_nameValue
Handler_commit1
Handler_read_key142
Handler_read_next107
Handler_read_rnd_next152
Handler_tmp_write150

[3b] (ip_changes2/rev_timestamp) - Another basic query but using a busy-ish IPv4 range
SELECT rev_id,rev_page,rev_text_id,rev_timestamp,rev_comment,rev_user_text,rev_user,rev_minor_edit,
       rev_deleted,rev_len,rev_parent_id,rev_sha1,rev_content_format,rev_content_model,user_name,
       page_namespace,page_title,page_is_new,page_latest,page_is_redirect,page_len,
       (
         SELECT GROUP_CONCAT(ct_tag SEPARATOR ',') FROM `change_tag` WHERE ct_rev_id=rev_id
       ) AS `ts_tags`
FROM `revision`
INNER JOIN `page` ON ((page_id = rev_page))
LEFT JOIN `user` ON ((rev_user != 0) AND (user_id = rev_user))
LEFT JOIN `ip_changes2` ON ((ipc_rev_id = rev_id))
WHERE (ipc_hex BETWEEN '05AC0000' AND '05ACFFFF')
AND ((rev_deleted & 4) = 0)
AND (rev_timestamp>='20170301000000')
ORDER BY rev_timestamp LIMIT 51;

Run time:
51 rows in set (0.05 sec)

EXPLAIN:

idselect_typetabletypepossible_keyskeykey_lenrefrowsExtra
1PRIMARYip_changes2rangePRIMARY, ipc_hexipc_hex37NULL76Using where; Using index; Using temporary; Using filesort
1PRIMARYrevisioneq_refPRIMARY, rev_timestamp, page_timestamp, page_user_timestamp, rev_page_idPRIMARY4enwiki.ip_changes2.ipc_rev_id1Using where
1PRIMARYusereq_refPRIMARYPRIMARY4enwiki.revision.rev_user1Using where
1PRIMARYpageeq_refPRIMARYPRIMARY4enwiki.revision.rev_page1
2DEPENDENT SUBQUERYchange_tagrefct_rev_id, change_tag_rev_tagct_rev_id5enwiki.revision.rev_id1Using where; Using index

Handler stats:

Variable_nameValue
Handler_commit1
Handler_read_key185
Handler_read_next109
Handler_read_rnd51
Handler_read_rnd_next216
Handler_tmp_write213

Conclusions

If I'm reading this correctly, it seems pretty clear that replicating the timestamp (using ipc_rev_timestamp) is definitely faster than using rev_timestamp from revision. This means the required changes to existing PHP classes to make this work will be a bit messy... but if we have to do it we will!

So my questions are:

  • Do the stats for 1a, 2a and 3a (which use ipc_rev_timestamp) look good?
  • For ip_changes2 (which does not have ipc_rev_timestamp), would it speed things up at all to add the index ipc_rev_id, ipc_hex? Currently it is just indexing the ipc_hex column by itself.

@jcrespo (sorry for the double ping) Just wanted to say I got the code working to use ipc_rev_timestamp only for ranges, which I thought was going to be a problem. So all I need at this point is approval on the ip_changes table (not ip_changes2). Many thanks!

For the query you are doing, there are 2 main indexes, ipc_hex and ipc_rev_timestamp. When one index gets too broad, it starts using the other (both cannot be used at the same time because they both use ranges). Each one would be preferred when less rows are filtered, while the timestamp is slightly better on equal grounds because it avoids the filesort. However, there is the problem of both being too broad, or choosing the wrong one at a specific time. How do you plan to limit the ranges in both cases to avoid issues? Note that now it tries to do the best every time, but when it gets into production -and data keeps being updated and inserted and purged-, statistics may not be perfect, going to cases like:

root@db2048[enwiki]> SELECT rev_id,rev_page,rev_text_id,ipc_rev_timestamp,rev_comment,rev_user_text,rev_user,rev_minor_edit,        rev_deleted,rev_len,rev_parent_id,rev_sha1,rev_content_format,rev_content_model,user_name,
        page_namespace,page_title,page_is_new,page_latest,page_is_redirect,page_len,
        (          SELECT GROUP_CONCAT(ct_tag SEPARATOR ',') FROM `change_tag` WHERE ct_rev_id=rev_id
        ) AS `ts_tags` 
FROM `revision` 
INNER JOIN `page` ON ((page_id = rev_page)) 
LEFT JOIN `user` ON ((rev_user != 0) AND (user_id = rev_user)) 
LEFT JOIN `ip_changes` FORCE INDEX(ipc_rev_timestamp) ON ((ipc_rev_id = rev_id)) 
WHERE (ipc_hex BETWEEN 'v6-2607FB90000000000000000000000000' AND 'v6-2607FB900000000000000000FFFFFFFF') 
AND ((rev_deleted & 4) = 0) AND ipc_rev_timestamp > '20170301000000' 
AND ipc_rev_timestamp < '20170401000000' 
ORDER BY ipc_rev_timestamp DESC LIMIT 501;                
Empty set (1.93 sec)

Note the FORCE INDEX simulating a suboptimal query plan. Some kind of filtering should be in place to limit either too broad timestamps or too broad ip filters (or one based on the other).

How do you plan to limit the ranges in both cases to avoid issues?

The widest range that can be queried is currently set by $wgRangeContributionsCIDRLimit, which is by default is /16 for IPv4 and /32 for IPv6.

I'm not sure what to do about date ranges. The current Special:Contributions allows you to select a month/year and view all contributions prior to that. My hope is we can offer the same when checking ranges. Note also the upcoming T120733, which will allow you to check any arbitrary date range.

Note the FORCE INDEX simulating a suboptimal query plan. Some kind of filtering should be in place to limit either too broad timestamps or too broad ip filters (or one based on the other).

Right now, as far as I can tell, the code does not generate a query that forces an index, so this shouldn't be a problem, right? In earlier comments my query used FORCE INDEX, but that was back when we were still trying to work off of the CheckUser extension. The new queries are generated by ContribsPager.php and IndexPager.php.

Right now, as far as I can tell, the code does not generate a query that forces an index

No, of course not, I was simulating that manually. What I wanted to show is that under normal circumstances, that is a plan that mysql can chose to execute. I would suggest parsing the options in a way that the combination is surely to get a good plan, by limiting both to avoid a large scan of rows. Note how the problem on the given example query is that the range is very small, so no rows are found, and that means it will keep scanning rows on the very generous other range.

I would be happy if you demonstrated that such a plan cannot happen with the limits you impose. Choosing the wrong index is a common mistake that mysql does. When does the index stop using one index and start using the other? That would probably the worse case scenario.

@jcrespo I tried numerous variations of queries (wide range / short timespan, narrow range / long timespan, vice versa, etc., all that returned actual results), and MySQL seemed to choose the most efficient index each time. I tested this by looking at the "rows" column when running EXPLAIN, and doing an EXPLAIN on the same query but forcing the other index.

You obviously know more about this kind of stuff than me :), can you think of a scenario where you believe it might choose the wrong index? Maybe we need to populate more data into our test table to replicate such a scenario? Like I said there is a configurable hard limit on a queryable range (hopefully will be the same as the blockable ranges), but as for the date range I'm not sure what to restrict it to, if anything. I suppose a hard limit of a year will cover most use cases and ensure we don't have any crazy slow queries? Or even 90 days – that would at least put this on par with what CheckUser is capable of.

There is also Wikimedia\Rdbms\Database::estimateRowCount, maybe we could use it to sort of test the waters before executing the actual query? If the number is too high it can abort and throw the error "date or IP range too wide, please choose a narrower options".

If the schema looks OK, but we still want more data to figure out what limits to impose, maybe we could do this as a two-part process – first deploy the hooks and new table so that it is populated, then run test queries for a while and make sure things look good before deploying the actual interface. How does that sound?

If the schema looks OK

Looks ok to me, this won't be hopefully a high-rate request point. Can I delete the extra tables?

Looks ok to me, this won't be hopefully a high-rate request point. Can I delete the extra tables?

Yay! :D It shouldn't be super high-rate, mostly admins will use it. However it is possible they'll request wide ranges (but again I'll be sure to impose some limitations). I will explore using estimateRowCountas well.

I've got everything I need for development, so if you're OK with the schema you can delete the tables. Cheers!

Thanks, feel free to resolve this and we can continue being in contact (only if necessary) on the implementation bits. I have dropped the tables at db2048.

MusikAnimal closed this task as Resolved.
MusikAnimal moved this task from In Development to Q1 2018-19 on the Community-Tech-Sprint board.
DannyH moved this task from Bug backlog to Archive on the Community-Tech board.Jul 17 2017, 9:54 PM