Page MenuHomePhabricator

Compose Count Queries
Closed, ResolvedPublic2 Estimated Story Points

Description

Acceptance Criteria:

  • Define clear SQL queries returning the required data to support T231590 that can be used by DBAs to evaluate the potential risk, impacts and mitigations :
    • edits: total number of edits
    • editors: total number of unique editors
    • botedits: total number of edits by bots
      • Where a bot is defined as a User having with User Group: Bot
    • anonedits: total number of anonymous edits
    • revertededits: total number of reverted edits
      • Where a revert is defined as a revision labelled with either Undo or Rollback
  • Update task with all queries

Event Timeline

WDoranWMF set the point value for this task to 2.Sep 4 2019, 12:22 PM

Does it have to be multiple queries? It seems like it would be most efficient to do it all in a single table scan.

Notes to self:

  • Investigate multiple simple queries vs one complex query, per Tim.
  • Don't forget about RevDel hiding users.

In all the cases here, I used enwiki's Barack Obama article for the EXPLAIN output as an example of an article that probably has a decently large edit history. All the tests were run on db1106 (the s1 vslow replica).

  • edits: total number of edits

I note that InfoAction already does this query, and caches it for a week.

SELECT COUNT(*) FROM revision WHERE rev_page = ?;
+------+-------------+----------+------+------------------------------------------------+----------------+---------+-------+-------+-------------+
| id   | select_type | table    | type | possible_keys                                  | key            | key_len | ref   | rows  | Extra       |
+------+-------------+----------+------+------------------------------------------------+----------------+---------+-------+-------+-------------+
|    1 | SIMPLE      | revision | ref  | rev_page_id,page_timestamp,page_user_timestamp | page_timestamp | 4       | const | 53998 | Using index |
+------+-------------+----------+------+------------------------------------------------+----------------+---------+-------+-------+-------------+

This will do an index scan over potentially very many rows, but it seems about as efficient as we can make it.

  • editors: total number of unique editors
SELECT COUNT(DISTINCT revactor_actor)
 FROM revision_actor_temp
  JOIN revision ON(revactor_rev = rev_id AND revactor_page = rev_page)
 WHERE rev_page = ? AND (rev_deleted & 4) = 0;
+------+-------------+---------------------+--------+--------------------------------------------------------+----------------------+---------+-----------------------------------------+-------+-------------+
| id   | select_type | table               | type   | possible_keys                                          | key                  | key_len | ref                                     | rows  | Extra       |
+------+-------------+---------------------+--------+--------------------------------------------------------+----------------------+---------+-----------------------------------------+-------+-------------+
|    1 | SIMPLE      | revision_actor_temp | ref    | PRIMARY,revactor_rev,page_actor_timestamp              | page_actor_timestamp | 4       | const                                   | 70686 | Using index |
|    1 | SIMPLE      | revision            | eq_ref | PRIMARY,rev_page_id,page_timestamp,page_user_timestamp | PRIMARY              | 4       | enwiki.revision_actor_temp.revactor_rev |     1 | Using where |
+------+-------------+---------------------+--------+--------------------------------------------------------+----------------------+---------+-----------------------------------------+-------+-------------+

An index scan with a join filter is more expensive than the simple edit count query.

The redundant join condition is there to allow MySQL to know that rev_page and revactor_page are the same so it can propagate the condition. Still, we might find we need to force a STRAIGHT_JOIN here if it decides to use a plan other than the one shown here.

The (rev_deleted & 4) = 0 condition is to exclude revisions where the user has been RevDeled, as we don't want to leak information about how many such users there are.

  • botedits: total number of edits by bots
    • Where a bot is defined as a User having with User Group: Bot
SELECT COUNT(DISTINCT revactor_actor)
 FROM revision_actor_temp
  JOIN revision ON(revactor_rev = rev_id AND revactor_page = rev_page)
  JOIN actor ON(revactor_actor = actor_id)
  JOIN user_groups ON(actor_user = ug_user AND ug_group = 'bot')
 WHERE rev_page = ? AND (rev_deleted & 4) = 0;
+------+-------------+---------------------+--------+-----------------------------------------------------------+----------------------+---------+-----------------------------------------+------+--------------------------+
| id   | select_type | table               | type   | possible_keys                                             | key                  | key_len | ref                                     | rows | Extra                    |
+------+-------------+---------------------+--------+-----------------------------------------------------------+----------------------+---------+-----------------------------------------+------+--------------------------+
|    1 | SIMPLE      | user_groups         | ref    | PRIMARY,ug_group                                          | ug_group             | 257     | const                                   |  300 | Using where; Using index |
|    1 | SIMPLE      | actor               | ref    | PRIMARY,actor_user                                        | actor_user           | 5       | enwiki.user_groups.ug_user              |    1 | Using index              |
|    1 | SIMPLE      | revision_actor_temp | ref    | PRIMARY,revactor_rev,actor_timestamp,page_actor_timestamp | page_actor_timestamp | 12      | const,enwiki.actor.actor_id             |    1 | Using index              |
|    1 | SIMPLE      | revision            | eq_ref | PRIMARY,rev_page_id,page_timestamp,page_user_timestamp    | PRIMARY              | 4       | enwiki.revision_actor_temp.revactor_rev |    1 | Using where              |
+------+-------------+---------------------+--------+-----------------------------------------------------------+----------------------+---------+-----------------------------------------+------+--------------------------+

MariaDB 10.1.38 seems to be coming up with a good plan here: fetch the actor IDs for all bot users, then count their edits to the page. On a wiki with a lot of bots compared to regular users, that plan might not be usable in which case it becomes significantly slower.

Other than that, the notes are the same as for the previous query.

  • anonedits: total number of anonymous edits
SELECT COUNT(DISTINCT revactor_actor)
 FROM revision_actor_temp
  JOIN revision ON(revactor_rev = rev_id AND revactor_page = rev_page)
  JOIN actor ON(revactor_actor = actor_id)
 WHERE rev_page = ? AND actor_user IS NULL AND (rev_deleted & 4) = 0;
+------+-------------+---------------------+--------+-----------------------------------------------------------+----------------------+---------+-------------------------------------------+-------+-------------+
| id   | select_type | table               | type   | possible_keys                                             | key                  | key_len | ref                                       | rows  | Extra       |
+------+-------------+---------------------+--------+-----------------------------------------------------------+----------------------+---------+-------------------------------------------+-------+-------------+
|    1 | SIMPLE      | revision_actor_temp | ref    | PRIMARY,revactor_rev,actor_timestamp,page_actor_timestamp | page_actor_timestamp | 4       | const                                     | 70686 | Using index |
|    1 | SIMPLE      | revision            | eq_ref | PRIMARY,rev_page_id,page_timestamp,page_user_timestamp    | PRIMARY              | 4       | enwiki.revision_actor_temp.revactor_rev   |     1 | Using where |
|    1 | SIMPLE      | actor               | eq_ref | PRIMARY,actor_user                                        | PRIMARY              | 8       | enwiki.revision_actor_temp.revactor_actor |     1 | Using where |
+------+-------------+---------------------+--------+-----------------------------------------------------------+----------------------+---------+-------------------------------------------+-------+-------------+

This one doesn't have any nice optimization like the bots query, so it's just a bit slow.

  • revertededits: total number of reverted edits
    • Where a revert is defined as a revision labelled with either Undo or Rollback
SELECT COUNT(DISTINCT rev_id)
 FROM revision
  JOIN change_tag ON(ct_rev_id = rev_id AND ct_tag_id IN (?,?))
 WHERE rev_page = ?;
+------+-------------+---------------------+--------+--------------------------------------------------------+-----------------------+---------+------------------------+-------+--------------------------+
| id   | select_type | table               | type   | possible_keys                                          | key                   | key_len | ref                    | rows  | Extra                    |
+------+-------------+---------------------+--------+--------------------------------------------------------+-----------------------+---------+------------------------+-------+--------------------------+
|    1 | SIMPLE      | revision            | ref    | PRIMARY,rev_page_id,page_timestamp,page_user_timestamp | page_timestamp        | 4       | const                  | 53998 | Using index              |
|    1 | SIMPLE      | change_tag          | ref    | change_tag_rev_tag_id,change_tag_tag_id_id             | change_tag_rev_tag_id | 5       | enwiki.revision.rev_id |     1 | Using where; Using index |
+------+-------------+---------------------+--------+--------------------------------------------------------+-----------------------+---------+------------------------+-------+--------------------------+

Another with a bunch of joins. Here I'm not bothering to hope MariaDB optimizes a lookup from change_tag_def, I'm assuming the lookup of the two tag IDs will be done beforehand by PHP.

Investigate multiple simple queries vs one complex query, per Tim.

SELECT
  COUNT(DISTINCT rev_id) AS editcount,
  COUNT(DISTINCT CASE WHEN (rev_deleted & 4) = 0 THEN revactor_actor END) AS actorcount,
  COUNT(DISTINCT CASE WHEN (rev_deleted & 4) = 0 AND ug_group = 'bot' THEN revactor_actor END) AS botcount,
  COUNT(DISTINCT CASE WHEN (rev_deleted & 4) = 0 AND actor_user IS NULL THEN revactor_actor END) AS anoncount,
  COUNT(DISTINCT CASE WHEN ct_tag_id IN (?,?) THEN rev_id END) AS revertedcount
 FROM revision
  JOIN revision_actor_temp ON(revactor_rev = rev_id AND revactor_page = rev_page)
  JOIN actor ON(revactor_actor = actor_id)
  LEFT JOIN user_groups ON(actor_user = ug_user AND ug_group = 'bot')
  LEFT JOIN change_tag ON(ct_rev_id = rev_id AND ct_tag_id IN (?,?))
 WHERE rev_page = ?;
+------+-------------+---------------------+--------+-----------------------------------------------------------+-----------------------+---------+-------------------------------------------+-------+--------------------------+
| id   | select_type | table               | type   | possible_keys                                             | key                   | key_len | ref                                       | rows  | Extra                    |
+------+-------------+---------------------+--------+-----------------------------------------------------------+-----------------------+---------+-------------------------------------------+-------+--------------------------+
|    1 | SIMPLE      | revision            | ref    | PRIMARY,rev_page_id,page_timestamp,page_user_timestamp    | page_timestamp        | 4       | const                                     | 53998 |                          |
|    1 | SIMPLE      | revision_actor_temp | eq_ref | PRIMARY,revactor_rev,actor_timestamp,page_actor_timestamp | revactor_rev          | 4       | enwiki.revision.rev_id                    |     1 | Using where              |
|    1 | SIMPLE      | actor               | eq_ref | PRIMARY                                                   | PRIMARY               | 8       | enwiki.revision_actor_temp.revactor_actor |     1 |                          |
|    1 | SIMPLE      | user_groups         | eq_ref | PRIMARY,ug_group                                          | PRIMARY               | 261     | enwiki.actor.actor_user,const             |     1 | Using where; Using index |
|    1 | SIMPLE      | change_tag          | ref    | change_tag_rev_tag_id,change_tag_tag_id_id                | change_tag_rev_tag_id | 5       | enwiki.revision.rev_id                    |     1 | Using where; Using index |
+------+-------------+---------------------+--------+-----------------------------------------------------------+-----------------------+---------+-------------------------------------------+-------+--------------------------+

There's not much opportunity for optimization here.

At a quick test on Barack Obama, the individual queries reported a total of 0.28 seconds and had Handler_read_key=90164 and Handler_read_next=110816. The combined query reported 0.62 seconds and had Handler_read_key=100969 and Handler_read_next=28074.

I also ran the queries against enwiki's Administrators' noticeboard/Incidents page, which tends to be my go-to example of a really huge page. I had to do the STRAIGHT_JOIN on the editors and anon counts to get the good plans, after which it reported 28.79 seconds (most in the reverted edits query) and had Handler_read_key=3345633 and Handler_read_next=4422879. As for the combined query, after waiting for several minutes I just killed it. To the point that I killed it it had Handler_read_key=3761124 and Handler_read_next=1000129, so probably the slowness was in tracking all the data for the different DISTINCTs.

Then I decided to test a variation, combining only the three user-related queries:

SELECT /*!STRAIGHT_JOIN*/
  COUNT(DISTINCT revactor_actor) AS actorcount,
  COUNT(DISTINCT CASE WHEN ug_group = 'bot' THEN revactor_actor END) AS botcount,
  COUNT(DISTINCT CASE WHEN actor_user IS NULL THEN revactor_actor END) AS anoncount
 FROM revision_actor_temp
  JOIN revision ON(revactor_rev = rev_id AND revactor_page = rev_page)
  JOIN actor ON(revactor_actor = actor_id)
  LEFT JOIN user_groups ON(actor_user = ug_user AND ug_group = 'bot')
 WHERE rev_page = ? AND (rev_deleted & 4) = 0;

The corresponding separate queries there took 5.52 seconds with Handler_read_key=2248060 and Handler_read_next=2205315. This combined query took 5.21 seconds, with Handler_read_key=2191796 and Handler_read_next=1097577.

Then, going back to the original query, I tried using a subquery before counting to try to reduce the amount of DISTINCT data it might have to track.

SELECT COUNT(*) AS editcount, COUNT(DISTINCT actor) AS actorcount, COUNT(DISTINCT bot) AS botcount, COUNT(DISTINCT anon) AS anoncount, COUNT(reverted) AS revertedcount
 FROM (
    SELECT DISTINCT
      rev_id AS edit,
      CASE WHEN (rev_deleted & 4) = 0 THEN revactor_actor END AS actor,
      CASE WHEN (rev_deleted & 4) = 0 AND ug_group = 'bot' THEN revactor_actor END AS bot,
      CASE WHEN (rev_deleted & 4) = 0 AND actor_user IS NULL THEN revactor_actor END AS anon,
      CASE WHEN ct_tag_id IN (?,?) THEN rev_id END AS reverted
     FROM revision
      JOIN revision_actor_temp ON(revactor_rev = rev_id AND revactor_page = rev_page)
      JOIN actor ON(revactor_actor = actor_id)
      LEFT JOIN user_groups ON(actor_user = ug_user AND ug_group = 'bot')
      LEFT JOIN change_tag ON(ct_rev_id = rev_id AND ct_tag_id IN (?,?))
     WHERE rev_page = ?
   ) AS x;
+------+-------------+---------------------+--------+-----------------------------------------------------------+-----------------------+---------+-------------------------------------------+---------+--------------------------+
| id   | select_type | table               | type   | possible_keys                                             | key                   | key_len | ref                                       | rows    | Extra                    |
+------+-------------+---------------------+--------+-----------------------------------------------------------+-----------------------+---------+-------------------------------------------+---------+--------------------------+
|    1 | PRIMARY     | <derived2>          | ALL    | NULL                                                      | NULL                  | NULL    | NULL                                      | 1920638 |                          |
|    2 | DERIVED     | revision            | ref    | PRIMARY,rev_page_id,page_timestamp,page_user_timestamp    | rev_page_id           | 4       | const                                     | 1920638 | Using temporary          |
|    2 | DERIVED     | revision_actor_temp | eq_ref | PRIMARY,revactor_rev,actor_timestamp,page_actor_timestamp | revactor_rev          | 4       | enwiki.revision.rev_id                    |       1 | Using where              |
|    2 | DERIVED     | actor               | eq_ref | PRIMARY                                                   | PRIMARY               | 8       | enwiki.revision_actor_temp.revactor_actor |       1 |                          |
|    2 | DERIVED     | user_groups         | eq_ref | PRIMARY,ug_group                                          | PRIMARY               | 261     | enwiki.actor.actor_user,const             |       1 | Using where; Using index |
|    2 | DERIVED     | change_tag          | ref    | change_tag_rev_tag_id,change_tag_tag_id_id                | change_tag_rev_tag_id | 5       | enwiki.revision.rev_id                    |       1 | Using where; Using index |
+------+-------------+---------------------+--------+-----------------------------------------------------------+-----------------------+---------+-------------------------------------------+---------+--------------------------+

That came out relatively well, taking 20.86 seconds with Handler_read_key=4150667, Handler_read_next=1119996, Handler_read_rnd_next=3242909, and Handler_tmp_write=3242928. It did take 0.60 seconds for Barack Obama still, though.

@Anomie ISTM that we might get some value out of doing separate queries if the results are cached separately? Pseudo-code wise,

if (cache hit for total edits)
  use cache value for total edits
else
  run query for total edits
  stick results in the cache

if (cache hit for anonedits)
   use cache value for anonedits
else
  run query for anonedits
  stick results in the cache

/* and so on... */

I guess that would only matter for counts that are used by other code (like total edits); otherwise the LRU profile for each value would be identical (it's all getting fetched in this same block of code).

Speaking of caches, kind of off-topic for this ticket, but...

At save time, we might be able to manipulate the cache directly, and thus save some hits (rather than just flushing the cache). So, pseudocode something like:

if (cache hit for total edits)
  increment cache value for total edits

if (this is an anon edit and cache hit for anonedits)
  increment cache value for anon edits

if (this is a bot edit and cache hit for botedits)
  increment cache value for bot edits

if (this is a revert edit and cache hit for reverted edits)
  increment cache value for reverted edits

I don't think the count of unique editors can work that way... unless we cache the list of unique editors as well as the count.

if (cache hit for unique editors list)
  if (current editor is not in the cached unique editor list)
    append the current editor to the cached unique editor list
    if (cache hit for unique editors count)
      increment cache value for unique editors count
    else
      set cache value for unique editors count to length of unique editors list

We can probably cut down on actual queries quite a bit with write-time cache manipulation.

Last couple of points, and then I promise I'll stop commenting.

  • The initial scale of the API calls using these queries will be OOM 1000 calls/week. It's niche functionality and only will be used by the iOS client initially. That will probably change gradually as other official clients and third-party developers pick up on the API call. That said, probably only a tiny minority of end users will be interested in this information.
  • I confirmed with the iOS team that the freshness of the edit counts isn't that important, and that a few days to a week of staleness is OK on the counts. (Actual history probably needs to be pretty fresh, though.) The only situation we could think of where that isn't the case is if the total edit count is less than a few history pages long (say, for a new page or a rarely-edited page). In these cases, it would be jarring for the user to see an edit count of 3 when they can see 14 revisions listed right below it. Humans rarely scroll back in history more than a few pages, so it would probably be OK to aggressively cache the edit count when the count is >100 (~5 pages of scrolling) and have a shorter TTL when the count is lower than that.
  • There's another case I can think of, in which a developer wants to get the whole revision history of a page. They might first get the total edit count, then allocate some fixed resources (like memory or files) for those edits, and when they actually scroll through history, they get more revisions than they expected, and they overflow their resources and their computer explodes or whatever. I think if we just document that edit counts are approximate, so Don't Do It That Way, we'll be OK.

Hmm. Your comments made me realize I did counts of unique anon editors and unique bots, not count of edits by anon editors and by bots. Grr.

On the plus side it's an easy enough fix: swap COUNT(DISTINCT revactor_actor) for COUNT(*) in the individual queries, and COUNT(DISTINCT revactor_actor) for COUNT(DISTINCT revactor_rev) or THEN revactor_actor END for THEN revactor_rev END in the combined queries. I'd have to check if that changes the plans or performance characteristics, although my first guess would be probably not and some quick experiments seem to support that it doesn't make very much difference to performance on enwiki's WP:ANI.

I claim full credit for debugging your queries that I neither read carefully nor commented on. You're welcome!

Question- this use case seems very similar to category counts. Wouldn't be more useful to keep this "cached" on counters and increment on edit, or does it need to be real time results? It doesn't have to be database cache if core mediawiki doesn't want to be changed- it can be anywhere else with the post-edit hook. A 1 second query on write wouldn't be a huge problem, specially if asynchronous and/or executed on the slow query servers, a 1 second query on every read may overload the available query resources. The pathological cases here are Wikidata, not enwiki, not sure which wikis this is intended to be on.

Here are corrected versions of the bot and anon queries.

  • botedits: total number of edits by bots
    • Where a bot is defined as a User having with User Group: Bot
SELECT COUNT(*)
 FROM revision_actor_temp
  JOIN revision ON(revactor_rev = rev_id AND revactor_page = rev_page)
  JOIN actor ON(revactor_actor = actor_id)
  JOIN user_groups ON(actor_user = ug_user AND ug_group = 'bot')
 WHERE rev_page = ? AND (rev_deleted & 4) = 0;
+------+-------------+---------------------+--------+-----------------------------------------------------------+----------------------+---------+-----------------------------------------+------+--------------------------+
| id   | select_type | table               | type   | possible_keys                                             | key                  | key_len | ref                                     | rows | Extra                    |
+------+-------------+---------------------+--------+-----------------------------------------------------------+----------------------+---------+-----------------------------------------+------+--------------------------+
|    1 | SIMPLE      | user_groups         | ref    | PRIMARY,ug_group                                          | ug_group             | 257     | const                                   |  300 | Using where; Using index |
|    1 | SIMPLE      | actor               | ref    | PRIMARY,actor_user                                        | actor_user           | 5       | enwiki.user_groups.ug_user              |    1 | Using index              |
|    1 | SIMPLE      | revision_actor_temp | ref    | PRIMARY,revactor_rev,actor_timestamp,page_actor_timestamp | page_actor_timestamp | 12      | const,enwiki.actor.actor_id             |    1 | Using index              |
|    1 | SIMPLE      | revision            | eq_ref | PRIMARY,rev_page_id,page_timestamp,page_user_timestamp    | PRIMARY              | 4       | enwiki.revision_actor_temp.revactor_rev |    1 | Using where              |
+------+-------------+---------------------+--------+-----------------------------------------------------------+----------------------+---------+-----------------------------------------+------+--------------------------+

The plan didn't change here.

  • anonedits: total number of anonymous edits
SELECT COUNT(*)
 FROM revision_actor_temp
  JOIN revision ON(revactor_rev = rev_id AND revactor_page = rev_page)
  JOIN actor ON(revactor_actor = actor_id)
 WHERE rev_page = ? AND actor_user IS NULL AND (rev_deleted & 4) = 0;
+------+-------------+---------------------+--------+-----------------------------------------------------------+----------------------+---------+-------------------------------------------+-------+-------------+
| id   | select_type | table               | type   | possible_keys                                             | key                  | key_len | ref                                       | rows  | Extra       |
+------+-------------+---------------------+--------+-----------------------------------------------------------+----------------------+---------+-------------------------------------------+-------+-------------+
|    1 | SIMPLE      | revision_actor_temp | ref    | PRIMARY,revactor_rev,actor_timestamp,page_actor_timestamp | page_actor_timestamp | 4       | const                                     | 70706 | Using index |
|    1 | SIMPLE      | revision            | eq_ref | PRIMARY,rev_page_id,page_timestamp,page_user_timestamp    | PRIMARY              | 4       | enwiki.revision_actor_temp.revactor_rev   |     1 | Using where |
|    1 | SIMPLE      | actor               | eq_ref | PRIMARY,actor_user                                        | PRIMARY              | 8       | enwiki.revision_actor_temp.revactor_actor |     1 | Using where |
+------+-------------+---------------------+--------+-----------------------------------------------------------+----------------------+---------+-------------------------------------------+-------+-------------+

Same plan here too. Performance seems the same too.

Investigate multiple simple queries vs one complex query, per Tim.

SELECT
  COUNT(DISTINCT rev_id) AS editcount,
  COUNT(DISTINCT CASE WHEN (rev_deleted & 4) = 0 THEN revactor_actor END) AS actorcount,
  COUNT(DISTINCT CASE WHEN (rev_deleted & 4) = 0 AND ug_group = 'bot' THEN rev_id END) AS botcount,
  COUNT(DISTINCT CASE WHEN (rev_deleted & 4) = 0 AND actor_user IS NULL THEN rev_id END) AS anoncount,
  COUNT(DISTINCT CASE WHEN ct_tag_id IN (?,?) THEN rev_id END) AS revertedcount
 FROM revision
  JOIN revision_actor_temp ON(revactor_rev = rev_id AND revactor_page = rev_page)
  JOIN actor ON(revactor_actor = actor_id)
  LEFT JOIN user_groups ON(actor_user = ug_user AND ug_group = 'bot')
  LEFT JOIN change_tag ON(ct_rev_id = rev_id AND ct_tag_id IN (?,?))
 WHERE rev_page = ?;
+------+-------------+---------------------+--------+-----------------------------------------------------------+-----------------------+---------+-------------------------------------------+-------+--------------------------+
| id   | select_type | table               | type   | possible_keys                                             | key                   | key_len | ref                                       | rows  | Extra                    |
+------+-------------+---------------------+--------+-----------------------------------------------------------+-----------------------+---------+-------------------------------------------+-------+--------------------------+
|    1 | SIMPLE      | revision            | ref    | PRIMARY,rev_page_id,page_timestamp,page_user_timestamp    | page_timestamp        | 4       | const                                     | 54008 |                          |
|    1 | SIMPLE      | revision_actor_temp | eq_ref | PRIMARY,revactor_rev,actor_timestamp,page_actor_timestamp | revactor_rev          | 4       | enwiki.revision.rev_id                    |     1 | Using where              |
|    1 | SIMPLE      | actor               | eq_ref | PRIMARY                                                   | PRIMARY               | 8       | enwiki.revision_actor_temp.revactor_actor |     1 |                          |
|    1 | SIMPLE      | user_groups         | eq_ref | PRIMARY,ug_group                                          | PRIMARY               | 261     | enwiki.actor.actor_user,const             |     1 | Using where; Using index |
|    1 | SIMPLE      | change_tag          | ref    | change_tag_rev_tag_id,change_tag_tag_id_id                | change_tag_rev_tag_id | 5       | enwiki.revision.rev_id                    |     1 | Using where; Using index |
+------+-------------+---------------------+--------+-----------------------------------------------------------+-----------------------+---------+-------------------------------------------+-------+--------------------------+

Same plan here. Same performance on Obama too. Against WP:ANI it actually finished this time; it took 40.80 seconds, with Handler_read_key=4151990 and Handler_read_next=1120445.

SELECT /*!STRAIGHT_JOIN*/
  COUNT(DISTINCT revactor_actor) AS actorcount,
  COUNT(DISTINCT CASE WHEN ug_group = 'bot' THEN rev_id END) AS botcount,
  COUNT(DISTINCT CASE WHEN actor_user IS NULL THEN rev_id END) AS anoncount
 FROM revision_actor_temp
  JOIN revision ON(revactor_rev = rev_id AND revactor_page = rev_page)
  JOIN actor ON(revactor_actor = actor_id)
  LEFT JOIN user_groups ON(actor_user = ug_user AND ug_group = 'bot')
 WHERE rev_page = ? AND (rev_deleted & 4) = 0;

Same performance as before.

SELECT COUNT(*) AS editcount, COUNT(DISTINCT actor) AS actorcount, COUNT(bot) AS botcount, COUNT(anon) AS anoncount, COUNT(reverted) AS revertedcount
 FROM (
    SELECT DISTINCT
      rev_id AS edit,
      CASE WHEN (rev_deleted & 4) = 0 THEN revactor_actor END AS actor,
      CASE WHEN (rev_deleted & 4) = 0 AND ug_group = 'bot' THEN rev_id END AS bot,
      CASE WHEN (rev_deleted & 4) = 0 AND actor_user IS NULL THEN rev_id END AS anon,
      CASE WHEN ct_tag_id IN (?,?) THEN rev_id END AS reverted
     FROM revision
      JOIN revision_actor_temp ON(revactor_rev = rev_id AND revactor_page = rev_page)
      JOIN actor ON(revactor_actor = actor_id)
      LEFT JOIN user_groups ON(actor_user = ug_user AND ug_group = 'bot')
      LEFT JOIN change_tag ON(ct_rev_id = rev_id AND ct_tag_id IN (?,?))
     WHERE rev_page = ?
   ) AS x;
+------+-------------+---------------------+--------+-----------------------------------------------------------+-----------------------+---------+-------------------------------------------+---------+--------------------------+
| id   | select_type | table               | type   | possible_keys                                             | key                   | key_len | ref                                       | rows    | Extra                    |
+------+-------------+---------------------+--------+-----------------------------------------------------------+-----------------------+---------+-------------------------------------------+---------+--------------------------+
|    1 | PRIMARY     | <derived2>          | ALL    | NULL                                                      | NULL                  | NULL    | NULL                                      | 1986000 |                          |
|    2 | DERIVED     | revision            | ref    | PRIMARY,rev_page_id,page_timestamp,page_user_timestamp    | rev_page_id           | 4       | const                                     | 1986000 | Using temporary          |
|    2 | DERIVED     | revision_actor_temp | eq_ref | PRIMARY,revactor_rev,actor_timestamp,page_actor_timestamp | revactor_rev          | 4       | enwiki.revision.rev_id                    |       1 | Using where              |
|    2 | DERIVED     | actor               | eq_ref | PRIMARY                                                   | PRIMARY               | 8       | enwiki.revision_actor_temp.revactor_actor |       1 |                          |
|    2 | DERIVED     | user_groups         | eq_ref | PRIMARY,ug_group                                          | PRIMARY               | 261     | enwiki.actor.actor_user,const             |       1 | Using where; Using index |
|    2 | DERIVED     | change_tag          | ref    | change_tag_rev_tag_id,change_tag_tag_id_id                | change_tag_rev_tag_id | 5       | enwiki.revision.rev_id                    |       1 | Using where; Using index |
+------+-------------+---------------------+--------+-----------------------------------------------------------+-----------------------+---------+-------------------------------------------+---------+--------------------------+

Again, same plan. This seems to be somewhat slower on WP:ANI though, it took 23.70 seconds, and 0.68 for Obama.

  • anonedits: total number of anonymous edits
SELECT COUNT(DISTINCT revactor_actor)
 FROM revision_actor_temp
  JOIN revision ON(revactor_rev = rev_id AND revactor_page = rev_page)
  JOIN actor ON(revactor_actor = actor_id)
 WHERE rev_page = ? AND actor_user IS NULL AND (rev_deleted & 4) = 0;
+------+-------------+---------------------+--------+-----------------------------------------------------------+----------------------+---------+-------------------------------------------+-------+-------------+
| id   | select_type | table               | type   | possible_keys                                             | key                  | key_len | ref                                       | rows  | Extra       |
+------+-------------+---------------------+--------+-----------------------------------------------------------+----------------------+---------+-------------------------------------------+-------+-------------+
|    1 | SIMPLE      | revision_actor_temp | ref    | PRIMARY,revactor_rev,actor_timestamp,page_actor_timestamp | page_actor_timestamp | 4       | const                                     | 70686 | Using index |
|    1 | SIMPLE      | revision            | eq_ref | PRIMARY,rev_page_id,page_timestamp,page_user_timestamp    | PRIMARY              | 4       | enwiki.revision_actor_temp.revactor_rev   |     1 | Using where |
|    1 | SIMPLE      | actor               | eq_ref | PRIMARY,actor_user                                        | PRIMARY              | 8       | enwiki.revision_actor_temp.revactor_actor |     1 | Using where |
+------+-------------+---------------------+--------+-----------------------------------------------------------+----------------------+---------+-------------------------------------------+-------+-------------+

This one doesn't have any nice optimization like the bots query, so it's just a bit slow.

Interesting, 10.3 looks like it chooses page_timestamp instead of page_actor_timestamp - it could be a matter of refreshing the table stats

root@db1114.eqiad.wmnet[enwiki]> explain SELECT COUNT(DISTINCT revactor_actor)  FROM revision_actor_temp   JOIN revision ON(revactor_rev = rev_id AND revactor_page = rev_page)   JOIN actor ON(revactor_actor = actor_id)  WHERE rev_page = ? AND actor_user IS NULL AND (rev_deleted & 4) = 0;
+------+-------------+---------------------+--------+-----------------------------------------------------------+----------------+---------+-------------------------------------------+-------+-------------+
| id   | select_type | table               | type   | possible_keys                                             | key            | key_len | ref                                       | rows  | Extra       |
+------+-------------+---------------------+--------+-----------------------------------------------------------+----------------+---------+-------------------------------------------+-------+-------------+
|    1 | SIMPLE      | revision            | ref    | PRIMARY,rev_page_id,page_timestamp,page_user_timestamp    | page_timestamp | 4       | const                                     | 47000 | Using where |
|    1 | SIMPLE      | revision_actor_temp | eq_ref | PRIMARY,revactor_rev,actor_timestamp,page_actor_timestamp | revactor_rev   | 4       | enwiki.revision.rev_id                    |     1 | Using where |
|    1 | SIMPLE      | actor               | eq_ref | PRIMARY,actor_user                                        | PRIMARY        | 8       | enwiki.revision_actor_temp.revactor_actor |     1 | Using where |
+------+-------------+---------------------+--------+-----------------------------------------------------------+----------------+---------+-------------------------------------------+-------+-------------+
3 rows in set (0.00 sec)

Note that's the wrong version of the query. But the correct version uses the same plan:

wikiadmin@db1114(enwiki)> explain SELECT COUNT(*)  FROM revision_actor_temp   JOIN revision ON(revactor_rev = rev_id AND revactor_page = rev_page)   JOIN actor ON(revactor_actor = actor_id)  WHERE rev_page = 534366 AND actor_user IS NULL AND (rev_deleted & 4) = 0;
+------+-------------+---------------------+--------+-----------------------------------------------------------+----------------+---------+-------------------------------------------+-------+-------------+
| id   | select_type | table               | type   | possible_keys                                             | key            | key_len | ref                                       | rows  | Extra       |
+------+-------------+---------------------+--------+-----------------------------------------------------------+----------------+---------+-------------------------------------------+-------+-------------+
|    1 | SIMPLE      | revision            | ref    | PRIMARY,rev_page_id,page_timestamp,page_user_timestamp    | page_timestamp | 4       | const                                     | 47146 | Using where |
|    1 | SIMPLE      | revision_actor_temp | eq_ref | PRIMARY,revactor_rev,actor_timestamp,page_actor_timestamp | revactor_rev   | 4       | enwiki.revision.rev_id                    |     1 | Using where |
|    1 | SIMPLE      | actor               | eq_ref | PRIMARY,actor_user                                        | PRIMARY        | 8       | enwiki.revision_actor_temp.revactor_actor |     1 | Using where |
+------+-------------+---------------------+--------+-----------------------------------------------------------+----------------+---------+-------------------------------------------+-------+-------------+

For enwiki's Barack Obama article, I find this plan has Handler_read_key = 48099 and Handler_read_next = 27305, while the other plan (achieved by ignoring indexes) has Handler_read_key = 34728 and Handler_read_next = 27305.

Since it came up in discussion on T231590, here's the query for a count of minor edits:

SELECT COUNT(*) FROM revision WHERE rev_page = ? AND rev_minor_edit;
+------+-------------+----------+------+------------------------------------------------+----------------+---------+-------+-------+-------------+
| id   | select_type | table    | type | possible_keys                                  | key            | key_len | ref   | rows  | Extra       |
+------+-------------+----------+------+------------------------------------------------+----------------+---------+-------+-------+-------------+
|    1 | SIMPLE      | revision | ref  | rev_page_id,page_timestamp,page_user_timestamp | page_timestamp | 4       | const | 54528 | Using where |
+------+-------------+----------+------+------------------------------------------------+----------------+---------+-------+-------+-------------+

Took 0.11 seconds for Barack Obama and 2.85 for ANI.

Hello Anomie,
Thank you for directing me to this page. I've tried a few of these queries with success. However, I am having trouble with a few others. To ease the testing process, I'm using a batch of articles to test. In this case, I've tested with three article ids: 602,14640, 7659.

For example, all edits to a page

SELECT rev_page, COUNT(*) 
FROM revision 
WHERE rev_page IN (602,14640, 7659)
GROUP BY rev_page

This appears to be undercounting on the first page_id and overcounting on the second and third page_id. I will appreciate your feedback.

That's the correct query to do what you said. Why do you think it's undercounting or overcounting?

https://xtools.wmflabs.org/articleinfo/pa.wikipedia.org/ਮਾਰਖ਼ੋਰ
shows rev_page 7659 as having 80 total edits and this query gives a total of 30 total edits.

For rev_page 602 it is the opposite, the article info tool shows a higher edit count than the query outputs.

Is there another method to verify number of edits?

No it doesn't. It shows rev_page 7274 as having 80 edits. https://xtools.wmflabs.org/articleinfo/pa.wikipedia.org/ਗੱਲ-ਬਾਤ:ਮਾਰਖ਼ੋਰ is for rev_page 7659, which shows 30 edits.

eprodromou claimed this task.

This seems done