Page MenuHomePhabricator

Compose New History Queries
Closed, ResolvedPublic2 Estimated Story Points

Description

Acceptance Criteria:

  • Define clear SQL queries returning the required data to support T231597 that can be used by DBAs to evaluate the potential risk, impacts and mitigations for the following filters:
    • reverted
      • Where a revert is defined as a revision labelled with either Undo or Rollback
    • anonymous
    • bot
      • Where a bot is defined as a User having with User Group: Bot
  • Update task with all queries

Event Timeline

WDoranWMF set the point value for this task to 2.Sep 4 2019, 12:24 PM

Notes to self:

  • Don't forget about RevDel hiding users.

In all the cases here, I used enwiki's Barack Obama article for the EXPLAIN output as an example of an article that probably has a decently large edit history. All the tests were run on db1106 (the s1 vslow replica).

All of these queries are about as efficient as they can be: they'll fetch only as many rows as they need, but if there are very few qualifying edits they might wind up having to scan the page's whole history.

Note that in practice the fields and most of the joins should be coming from RevisionStore::getQueryInfo(); the PHP code would just add the WHERE conditions, LIMIT, ORDER BY, and any extra joins like user_groups.

  • botedits: total number of edits by bots
    • Where a bot is defined as a User having with User Group: Bot
SELECT /*!STRAIGHT_JOIN*/ rev_id, rev_page, rev_timestamp, rev_minor_edit, rev_deleted, rev_len, rev_parent_id, rev_sha1, comment_text, comment_data, comment_id, actor_user, actor_name, actor_id
 FROM revision
  JOIN revision_actor_temp ON(revactor_rev = rev_id AND revactor_page = rev_page)
  JOIN actor ON(revactor_actor = actor_id)
  JOIN user_groups ON(actor_user = ug_user AND ug_group = 'bot')
  JOIN revision_comment_temp ON(revcomment_rev = rev_id)
  JOIN comment ON(revcomment_comment_id = comment_id)
 WHERE rev_page = ? AND (rev_deleted & 4) = 0
 ORDER BY rev_timestamp DESC, rev_id DESC LIMIT 100;
+------+-------------+-----------------------+--------+-----------------------------------------------------------+----------------+---------+----------------------------------------------------+---------+--------------------------+
| id   | select_type | table                 | type   | possible_keys                                             | key            | key_len | ref                                                | rows    | Extra                    |
+------+-------------+-----------------------+--------+-----------------------------------------------------------+----------------+---------+----------------------------------------------------+---------+--------------------------+
|    1 | SIMPLE      | revision              | ref    | PRIMARY,rev_page_id,page_timestamp,page_user_timestamp    | page_timestamp | 4       | const                                              | 1921190 | Using where              |
|    1 | SIMPLE      | revision_actor_temp   | eq_ref | PRIMARY,revactor_rev,actor_timestamp,page_actor_timestamp | revactor_rev   | 4       | enwiki.revision.rev_id                             |       1 | Using where              |
|    1 | SIMPLE      | actor                 | eq_ref | PRIMARY,actor_user                                        | PRIMARY        | 8       | enwiki.revision_actor_temp.revactor_actor          |       1 | Using where              |
|    1 | SIMPLE      | user_groups           | eq_ref | PRIMARY,ug_group                                          | PRIMARY        | 261     | enwiki.actor.actor_user,const                      |       1 | Using where; Using index |
|    1 | SIMPLE      | revision_comment_temp | ref    | PRIMARY,revcomment_rev                                    | PRIMARY        | 4       | enwiki.revision.rev_id                             |       1 | Using index              |
|    1 | SIMPLE      | comment               | eq_ref | PRIMARY                                                   | PRIMARY        | 8       | enwiki.revision_comment_temp.revcomment_comment_id |       1 |                          |
+------+-------------+-----------------------+--------+-----------------------------------------------------------+----------------+---------+----------------------------------------------------+---------+--------------------------+

Without the STRAIGHT_JOIN, it wants to do the same sort of thing it did for the botcount query in T231598. Which generally isn't all that bad for the pages I tested with, but does still fetch all rows and filesort. It can't really guess whether it'll be faster to fetch all rows and filesort or to scan in order to find the 100 matches.

  • anonedits: total number of anonymous edits
SELECT rev_id, rev_page, rev_timestamp, rev_minor_edit, rev_deleted, rev_len, rev_parent_id, rev_sha1, comment_text, comment_data, comment_id, actor_user, actor_name, actor_id
 FROM revision_actor_temp
  JOIN revision ON(revactor_rev = rev_id AND revactor_page = rev_page)
  JOIN actor ON(revactor_actor = actor_id)
  JOIN revision_comment_temp ON(revcomment_rev = rev_id)
  JOIN comment ON(revcomment_comment_id = comment_id)
 WHERE rev_page = ? AND actor_user IS NULL AND (rev_deleted & 4) = 0
 ORDER BY rev_timestamp DESC, rev_id DESC LIMIT 100;
+------+-------------+-----------------------+--------+-----------------------------------------------------------+----------------+---------+----------------------------------------------------+-------+-------------+
| id   | select_type | table                 | type   | possible_keys                                             | key            | key_len | ref                                                | rows  | Extra       |
+------+-------------+-----------------------+--------+-----------------------------------------------------------+----------------+---------+----------------------------------------------------+-------+-------------+
|    1 | SIMPLE      | revision              | ref    | PRIMARY,rev_page_id,page_timestamp,page_user_timestamp    | page_timestamp | 4       | const                                              | 53998 | Using where |
|    1 | SIMPLE      | revision_actor_temp   | eq_ref | PRIMARY,revactor_rev,actor_timestamp,page_actor_timestamp | revactor_rev   | 4       | enwiki.revision.rev_id                             |     1 | Using where |
|    1 | SIMPLE      | actor                 | eq_ref | PRIMARY,actor_user                                        | PRIMARY        | 8       | enwiki.revision_actor_temp.revactor_actor          |     1 | Using where |
|    1 | SIMPLE      | revision_comment_temp | ref    | PRIMARY,revcomment_rev                                    | PRIMARY        | 4       | enwiki.revision.rev_id                             |     1 | Using index |
|    1 | SIMPLE      | comment               | eq_ref | PRIMARY                                                   | PRIMARY        | 8       | enwiki.revision_comment_temp.revcomment_comment_id |     1 |             |
+------+-------------+-----------------------+--------+-----------------------------------------------------------+----------------+---------+----------------------------------------------------+-------+-------------+
  • revertededits: total number of reverted edits
    • Where a revert is defined as a revision labelled with either Undo or Rollback
SELECT rev_id, rev_page, rev_timestamp, rev_minor_edit, rev_deleted, rev_len, rev_parent_id, rev_sha1, comment_text, comment_data, comment_id, actor_user, actor_name, actor_id
 FROM revision_actor_temp
  JOIN revision ON(revactor_rev = rev_id AND revactor_page = rev_page)
  JOIN actor ON(revactor_actor = actor_id)
  JOIN revision_comment_temp ON(revcomment_rev = rev_id)
  JOIN comment ON(revcomment_comment_id = comment_id)
 WHERE rev_page = ? AND (rev_deleted & 4) = 0 AND EXISTS(SELECT 1 FROM change_tag WHERE ct_rev_id = rev_id AND ct_tag_id IN (?,?))
 ORDER BY rev_timestamp DESC, rev_id DESC LIMIT 100;
+------+-------------+-----------------------+--------+-----------------------------------------------------------+-----------------------+---------+----------------------------------------------------+-------+------------------------------------------------+
| id   | select_type | table                 | type   | possible_keys                                             | key                   | key_len | ref                                                | rows  | Extra                                          |
+------+-------------+-----------------------+--------+-----------------------------------------------------------+-----------------------+---------+----------------------------------------------------+-------+------------------------------------------------+
|    1 | PRIMARY     | revision              | ref    | PRIMARY,rev_page_id,page_timestamp,page_user_timestamp    | page_timestamp        | 4       | const                                              | 53998 | Using where                                    |
|    1 | PRIMARY     | change_tag            | ref    | change_tag_rev_tag_id,change_tag_tag_id_id                | change_tag_rev_tag_id | 5       | enwiki.revision.rev_id                             |     1 | Using where; Using index; FirstMatch(revision) |
|    1 | PRIMARY     | revision_actor_temp   | eq_ref | PRIMARY,revactor_rev,actor_timestamp,page_actor_timestamp | revactor_rev          | 4       | enwiki.revision.rev_id                             |     1 | Using where                                    |
|    1 | PRIMARY     | actor                 | eq_ref | PRIMARY                                                   | PRIMARY               | 8       | enwiki.revision_actor_temp.revactor_actor          |     1 |                                                |
|    1 | PRIMARY     | revision_comment_temp | ref    | PRIMARY,revcomment_rev                                    | PRIMARY               | 4       | enwiki.revision.rev_id                             |     1 | Using index                                    |
|    1 | PRIMARY     | comment               | eq_ref | PRIMARY                                                   | PRIMARY               | 8       | enwiki.revision_comment_temp.revcomment_comment_id |     1 |                                                |
+------+-------------+-----------------------+--------+-----------------------------------------------------------+-----------------------+---------+----------------------------------------------------+-------+------------------------------------------------+

That turned out surprisingly well, actually. This one could be particularly vulnerable to statistics though, if the mw-undo and mw-rollback tags aren't among the most-used tags it might wind up choosing stranger plans that may or may not work as well.

@Anomie I might be reading the queries wrong, but... is there a reason that you're using a limit of 100 instead of the page size of 20? Or, rather, 21, to fetch an extra item to determine if there's another page to fetch?

The 100 is arbitrary. This task didn't specify a page size.

Cool. I wanted to comment here on some real-world optimization issues for these queries.

  • There's a fixed page size (20) for the API. I'd suggest using the trick of fetching one extra item, to determine if we need to show a link for the next page or not, but that's up to you.
  • The history is only linearly accessible; you can scroll forward and back, but you can't jump to page 17 or to the end.
  • The initial client is the iOS client, showing the history directly to end users. Typically, human beings don't scroll back in histories very far. Probably a drop off of ~10x for each page you go back. So second page traffic is 10% of first page traffic, third page traffic is 1% of first page traffic, and so on.
  • As other clients start to use this API endpoint, this usage behaviour might change. For example, automated clients that want to read the whole history would go all the way to the last page, every time.
  • It's kind of important that the most recent history is pretty fresh (a common use case would be to review recent edits to a page). I think as we go back in time for human users, it becomes less important, with say 5 or 10 pages in being a pretty rare thing to get to.

I have been reviewing @Anomie's queries and I haven't found anything strange there. I have checked those queries (with Obama's rev_page as well) with slower cold hosts with spinning disks and they do take ages to be run (which is kinda expected).
An example is:

SELECT rev_id, rev_page, rev_timestamp, rev_minor_edit, rev_deleted, rev_len, rev_parent_id, rev_sha1, comment_text, comment_data, comment_id, actor_user, actor_name, actor_id  FROM revision_actor_temp   JOIN revision ON(revactor_rev = rev_id AND revactor_page = rev_page)   JOIN actor ON(revactor_actor = actor_id)   JOIN revision_comment_temp ON(revcomment_rev = rev_id)   JOIN comment ON(revcomment_comment_id = comment_id)  WHERE rev_page = ? AND actor_user IS NULL AND (rev_deleted & 4) = 0  ORDER BY rev_timestamp DESC, rev_id DESC LIMIT 100;

100 rows in set (5 min 30.58 sec)

However, that same query on a cold host, but with SSDs:

100 rows in set (20.73 sec)

And once it is warmed up:

100 rows in set (0.42 sec)

The query plan is the same for both hosts.
We should not worry about spinning disks hosts, as we do not have anymore of those serving as slaves in eqiad (and we are getting rid of them on codfw too). Some of the masters are still on spinning disks, but we are also scheduling failovers for them soon to get them on SSDs.

I was also interested in checking those queries on a 10.3 host specially because we might have some changes in the optimizer.
But the query plan is the same Brad got there too.

However, I am seeing different query plans (with less scanned rows) than the ones Brad saw:

explain SELECT /*!STRAIGHT_JOIN*/ rev_id, rev_page, rev_timestamp, rev_minor_edit, rev_deleted, rev_len, rev_parent_id, rev_sha1, comment_text, comment_data, comment_id, actor_user, actor_name, actor_id  FROM revision   JOIN revision_actor_temp ON(revactor_rev = rev_id AND revactor_page = rev_page)   JOIN actor ON(revactor_actor = actor_id)   JOIN user_groups ON(actor_user = ug_user AND ug_group = 'bot')   JOIN revision_comment_temp ON(revcomment_rev = rev_id)   JOIN comment ON(revcomment_comment_id = comment_id)  WHERE rev_page = ?? AND (rev_deleted & 4) = 0  ORDER BY rev_timestamp DESC, rev_id DESC LIMIT 100;
+------+-------------+-----------------------+--------+-----------------------------------------------------------+----------------+---------+----------------------------------------------------+-------+--------------------------+
| id   | select_type | table                 | type   | possible_keys                                             | key            | key_len | ref                                                | rows  | Extra                    |
+------+-------------+-----------------------+--------+-----------------------------------------------------------+----------------+---------+----------------------------------------------------+-------+--------------------------+
|    1 | SIMPLE      | revision              | ref    | PRIMARY,page_timestamp,page_user_timestamp,rev_page_id    | page_timestamp | 4       | const                                              | 49578 | Using where              |
|    1 | SIMPLE      | revision_actor_temp   | eq_ref | PRIMARY,revactor_rev,actor_timestamp,page_actor_timestamp | revactor_rev   | 4       | enwiki.revision.rev_id                             |     1 | Using where              |
|    1 | SIMPLE      | actor                 | eq_ref | PRIMARY,actor_user                                        | PRIMARY        | 8       | enwiki.revision_actor_temp.revactor_actor          |     1 | Using where              |
|    1 | SIMPLE      | user_groups           | eq_ref | PRIMARY,ug_group                                          | PRIMARY        | 261     | enwiki.actor.actor_user,const                      |     1 | Using where; Using index |
|    1 | SIMPLE      | revision_comment_temp | ref    | PRIMARY,revcomment_rev                                    | PRIMARY        | 4       | enwiki.revision.rev_id                             |     1 | Using index              |
|    1 | SIMPLE      | comment               | eq_ref | PRIMARY                                                   | PRIMARY        | 8       | enwiki.revision_comment_temp.revcomment_comment_id |     1 |                          |
+------+-------------+-----------------------+--------+-----------------------------------------------------------+----------------+---------+----------------------------------------------------+-------+--------------------------+
6 rows in set (0.04 sec)


explain SELECT rev_id, rev_page, rev_timestamp, rev_minor_edit, rev_deleted, rev_len, rev_parent_id, rev_sha1, comment_text, comment_data, comment_id, actor_user, actor_name, actor_id  FROM revision_actor_temp   JOIN revision ON(revactor_rev = rev_id AND revactor_page = rev_page)   JOIN actor ON(revactor_actor = actor_id)   JOIN revision_comment_temp ON(revcomment_rev = rev_id)   JOIN comment ON(revcomment_comment_id = comment_id)  WHERE rev_page = ? AND actor_user IS NULL AND (rev_deleted & 4) = 0  ORDER BY rev_timestamp DESC, rev_id DESC LIMIT 100;
+------+-------------+-----------------------+--------+-----------------------------------------------------------+----------------+---------+----------------------------------------------------+-------+-------------+
| id   | select_type | table                 | type   | possible_keys                                             | key            | key_len | ref                                                | rows  | Extra       |
+------+-------------+-----------------------+--------+-----------------------------------------------------------+----------------+---------+----------------------------------------------------+-------+-------------+
|    1 | SIMPLE      | revision              | ref    | PRIMARY,page_timestamp,page_user_timestamp,rev_page_id    | page_timestamp | 4       | const                                              | 49578 | Using where |
|    1 | SIMPLE      | revision_actor_temp   | eq_ref | PRIMARY,revactor_rev,actor_timestamp,page_actor_timestamp | revactor_rev   | 4       | enwiki.revision.rev_id                             |     1 | Using where |
|    1 | SIMPLE      | actor                 | eq_ref | PRIMARY,actor_user                                        | PRIMARY        | 8       | enwiki.revision_actor_temp.revactor_actor          |     1 | Using where |
|    1 | SIMPLE      | revision_comment_temp | ref    | PRIMARY,revcomment_rev                                    | PRIMARY        | 4       | enwiki.revision.rev_id                             |     1 | Using index |
|    1 | SIMPLE      | comment               | eq_ref | PRIMARY                                                   | PRIMARY        | 8       | enwiki.revision_comment_temp.revcomment_comment_id |     1 |             |
+------+-------------+-----------------------+--------+-----------------------------------------------------------+----------------+---------+----------------------------------------------------+-------+-------------+
5 rows in set (0.03 sec)

However, I am seeing different query plans (with less scanned rows) than the ones Brad saw:

Those both look like the same plans to me, just different numbers of estimated rows. For the first I probably accidentally explained that one against WP:ANI rather than Obama, 1921190 rows is in line with the estimates I was seeing for that page and not at all for Obama. For the second, I suspect it's just statistic differences resulting in a slightly different estimate. There are actually only around 27290 rows in the Obama article at the moment, so both estimates are high.

reverted

Where a revert is defined as a revision labelled with either Undo or Rollback

The query contains AND (rev_deleted & 4) = 0 - does it make sense for determining the reverted revisions? I would understand not wanting to include revisions with suppressed user into anon or bot queries, since that would give out some information about the suppressed user, but reverted seem to be completely unrelated to user suppression.

Good catch. Yes, the AND (rev_deleted & 4) = 0 check should be left out of the "reverted" query.