Page MenuHomePhabricator

Compose query for minor edit count
Closed, ResolvedPublic1 Estimate Story Points

Description

Per user story T234941: Curator gets minor edit count, we want to add a count of minor edits as one of the available count types on the existing /page/{title}/history/counts endpoint.

Create an SQL query to retrieve this count. Update this task with the query. Incorporation of the query into the existing endpoint will be performed under a separate task.

This may be somewhat related to T235657: Compose query for minor edit history

Requirements:

  • task updated with query

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
WDoranWMF set the point value for this task to 1.Oct 16 2019, 7:41 PM
BPirkle claimed this task.Oct 21 2019, 9:37 PM
BPirkle moved this task from Ready to Doing on the Core Platform Team Workboards (Green) board.
BPirkle added a comment.EditedOct 22 2019, 2:59 PM
SELECT  COUNT(*)  FROM `revision`
WHERE rev_page = ? 
AND (rev_minor_edit != 0)  
LIMIT 1;

For rev_page 534366 (Barack_Obama):

+----------+
| COUNT(*) |
+----------+
|     7219 |
+----------+
1 row in set (0.32 sec)
+------+-------------+----------+------+------------------------------------------------+----------------+---------+-------+-------+-------------+
| id   | select_type | table    | type | possible_keys                                  | key            | key_len | ref   | rows  | Extra       |
+------+-------------+----------+------+------------------------------------------------+----------------+---------+-------+-------+-------------+
|    1 | SIMPLE      | revision | ref  | page_timestamp,page_user_timestamp,rev_page_id | page_timestamp | 4       | const | 50372 | Using where |
+------+-------------+----------+------+------------------------------------------------+----------------+---------+-------+-------+-------------+

By contrast, the "edits" query for overall edit count, which has a similar plan but an Extra of "Using index" rather than "Using where", reports as being executed in at most 0.01 seconds, and frequently 0.00. Here's that query and "Barack Obama" EXPLAIN, for comparison:

SELECT COUNT(*) 
FROM  `revision` 
WHERE rev_page = ? 
LIMIT 1;
+------+-------------+----------+------+------------------------------------------------+----------------+---------+-------+-------+-------------+
| id   | select_type | table    | type | possible_keys                                  | key            | key_len | ref   | rows  | Extra       |
+------+-------------+----------+------+------------------------------------------------+----------------+---------+-------+-------+-------------+
|    1 | SIMPLE      | revision | ref  | page_timestamp,page_user_timestamp,rev_page_id | page_timestamp | 4       | const | 50390 | Using index |
+------+-------------+----------+------+------------------------------------------------+----------------+---------+-------+-------+-------------+

The most similar code I found in Action API was this bit in ApiQueryUserContribs.php:

			$this->addWhereIf( 'rev_minor_edit = 0', isset( $show['!minor'] ) );
			$this->addWhereIf( 'rev_minor_edit != 0', isset( $show['minor'] ) );

(https://gerrit.wikimedia.org/g/mediawiki/core/+/533346b24ffbd93a50f81b0534a6dd5e2f4f861f/includes/api/ApiQueryUserContribs.php#377)

However, this is part of a much larger query with very different characteristics.

WDoranWMF added subscribers: Marostegui, jcrespo, WDoranWMF.

@jcrespo @Marostegui Would it be possible to get input DBA review on the above query and whether it's acceptable?

jcrespo reassigned this task from BPirkle to WDoranWMF.Oct 28 2019, 9:04 AM
jcrespo moved this task from Triage to Blocked external/Not db team on the DBA board.

Could you give us the full list of query possibilities that would be affected by that extra option? Please provide us with the details of the performance analysis you have done so we better understand your rationale, and we can advice you better- for example, about the decision to not calculate it on edit as a counter?

BPirkle added a subscriber: Anomie.Oct 28 2019, 4:12 PM

Some context, just to be sure everyone is on the same page: the proposed query is part of the "GET history counts" core REST API endpoint, as originally implemented under T231590: Implement GET Edit Count, using queries discussed in T231598: Compose Count Queries. This endpoint returns various counts, each of which is implemented using its own query. The handler class for this API endpoint is here: https://codesearch.wmflabs.org/search/?q=class%20PageHistoryCountHandler&i=nope&files=&repos=

This endpoint is initially intended for use by the IOS App, but of course once something is publicly available there are no guarantees about how it will be used.

To my knowledge, the only performance analysis that has been performed, other than what was recorded above, was in T231598: Compose Count Queries where @Anomie considered a very similar query (and which I did not notice before posting the one above).

Regarding alternative implementation possibilities, the only constraint I'm aware of is that as part of mediawiki core, whatever we do should also work on third-party wikis and should not be specific to our production environment.

Here are representative queries for all count types for this endpoint (as logged on my local test wiki). As such, they are a bit raw, and include magic numbers that are actually defines in the code. Note that these are largely the same as previously discussed under T231598: Compose Count Queries.

SELECT  COUNT(*)  
FROM `revision_actor_temp` 
JOIN `revision` ON ((revactor_rev = rev_id AND revactor_page = rev_page)) 
JOIN `actor` ON ((revactor_actor = actor_id))   
WHERE rev_page = '138' AND (actor_user IS NULL) AND ((rev_deleted & 12) != 12)  LIMIT 1
SELECT  COUNT(*)  
FROM `revision_actor_temp` 
JOIN `revision` ON ((revactor_rev = rev_id AND revactor_page = rev_page)) 
JOIN `actor` ON ((revactor_actor = actor_id))   
WHERE rev_page = '138' 
  AND (EXISTS(SELECT  1  FROM `user_groups`    WHERE (actor.actor_user = ug_user) AND ug_group = 'bot' AND (ug_expiry IS NULL OR ug_expiry >= '20191028155235')  )) 
  AND ((rev_deleted & 12) != 12)  
LIMIT 1
SELECT  COUNT(DISTINCT revactor_actor)  
FROM `revision_actor_temp` 
JOIN `revision` ON ((revactor_rev = rev_id AND revactor_page = rev_page))   
WHERE rev_page = '138' AND ((rev_deleted & 12) != 12)  
LIMIT 1
SELECT  COUNT(*)  
FROM `revision`    
WHERE rev_page = '138'  
LIMIT 1
SELECT  COUNT(DISTINCT rev_id)  
FROM `revision` 
JOIN `change_tag` ON ((ct_rev_id = rev_id) AND ct_tag_id = '2')   
WHERE rev_page = '138'  
LIMIT 1

The endpoint provides only one count type per call - there is a "type" parameter that indicates which type the caller is interested in. There was some discussion about whether it would be preferable to provide all count types in one call, but it was determined that separate calls were better even for the number of count types we currently have. And would be less scalable as count types were added.

@jcrespo , is that sufficient information to continue the discussion? If not, what else can we provide at this stage?

jcrespo claimed this task.Oct 28 2019, 4:22 PM
jcrespo moved this task from Blocked external/Not db team to Next on the DBA board.

That should be enough, although you may want to coordinate in addition with Anomie, as he is from your same team but may give some helpful pointers regarding his experience with the API.

Anomie thought it looked okay. To quote him from IRC:

My opinions: T235572#5595565 looks as good as we're going to get for that one. No need for an explicit LIMIT 1 though. T235657#5599503 also looks right to me.

I'm not sure where the line between "ask the DBAs" and "use our own good judgment" is, so it seemed safest to ask your team as well. Although I'm less likely to get a t-shirt that way... ;-)

WDoranWMF added a comment.EditedOct 29 2019, 5:41 PM

@BPirkle I think on this basis it's ok for us to move forward. Sorry, I had misread the state.

WDoranWMF added a comment.EditedOct 30 2019, 2:30 PM

@jcrespo Totally aware you guys are very busy right now, do you think we can proceed based on what we have or are there concerns the DBAs will need more time to check into?

Thanks, that should be the #1 thing I need, in some cases, I don't even need a ticket!, just be added to the review :-D

The last thing I need to know from you, which will not be on the review (although it could be on the commit msg!), is the expected frequency of its call- I know those things normally depend on "user behavior", but you may know things like:

  • only called on update/not user dependent/cached
  • Number of calls similar api ends
  • Worst case scenario
jcrespo added a comment.EditedOct 30 2019, 2:53 PM

Based on testing I made, this query would bring down our production servers unless run "offline" on a cron + our vslow servers, and not exposed publicly, as it takes 16 seconds and do a full scan over millions of records because of lack of indexes. This is run over one of our largest api enwiki servers:

root@db1080.eqiad.wmnet[enwiki]> EXPLAIN SELECT COUNT(*) FROM revision WHERE rev_page=1952670 and rev_minor_edit != 0\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: revision
         type: ref
possible_keys: page_timestamp,page_user_timestamp,rev_page_id
          key: rev_page_id
      key_len: 4
          ref: const
         rows: 3020694
        Extra: Using where
1 row in set (0.00 sec)

root@db1080.eqiad.wmnet[enwiki]> SELECT COUNT(*) FROM revision WHERE rev_page=1952670 and rev_minor_edit != 0;
+----------+
| COUNT(*) |
+----------+
|   572001 |
+----------+
1 row in set (16.94 sec)

It is possible that an additional index on rev_page, rev_minor may help (not sure unless we do a proof of concept).

pinging @eprodromou for metrics on usage of below:

  • only called on update/not user dependent/cached
  • Number of calls similar api ends
  • Worst case scenario

@BPirkle @jcrespo Are there limits or optimisation we can employ to improve this?

jcrespo added a comment.EditedOct 30 2019, 3:00 PM

@BPirkle @jcrespo Are there limits or optimisation we can employ to improve this?

Sorry, I edited with a suggestion below my last comment and you may have not seen it:

it is possible that an additional index on rev_page, rev_minor may help (not sure unless we do a proof of concept).

That and working or not, one typical strategy (eg. used for revision count and categories) is to cached it on the database, updating the count on edit).
We can also limit the count (maybe), so LIMIT 501, and give exact count for small number or revisions, and "500+" for the rest, for faster results.

Note "counting the number of minor edits" is something that mediawiki never required, so it is not optimized for that (yet)- no indexes. I wonder how important it is- you can get it instantly, but it will require some extra work.

For queries that really don't need real time, we also have a list of jobs that run every week to get some statistics and are cached on the db (eg. pages with more revisions, most pages linked). etc.

If we go overengineering way, we can also store it in another analytics engine that counts faster (OLAP vs OLTP).

jcrespo reassigned this task from jcrespo to eprodromou.Oct 30 2019, 3:04 PM
jcrespo moved this task from In progress to Blocked external/Not db team on the DBA board.

Moving temporarilly to other tasks, let me know how to proceed.

@jcrespo thanks, we'll review and loop back to you.

If you limit yourself to namespace 0, it *may* be possible with some rate limiting (but I don't like that much from a user perspective):

root@db1080.eqiad.wmnet[enwiki]> SELECT COUNT(*) FROM revision WHERE rev_page=2337812 and rev_minor_edit = 1;       
+----------+
| COUNT(*) |
+----------+
|     5293 |
+----------+
1 row in set (1.68 sec)

root@db1080.eqiad.wmnet[enwiki]> EXPLAIN SELECT COUNT(*) FROM revision WHERE rev_page=2337812 and rev_minor_edit = 1;
+------+-------------+----------+------+------------------------------------------------+-------------+---------+-------
| id   | select_type | table    | type | possible_keys                                  | key         | key_len | ref   
+------+-------------+----------+------+------------------------------------------------+-------------+---------+-------
|    1 | SIMPLE      | revision | ref  | page_timestamp,page_user_timestamp,rev_page_id | rev_page_id | 4       | const 
+------+-------------+----------+------+------------------------------------------------+-------------+---------+-------
1 row in set (0.00 sec)

root@db1080.eqiad.wmnet[enwiki]> EXPLAIN SELECT COUNT(*) FROM revision WHERE rev_page=2337812 and rev_minor_edit = 1\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: revision
         type: ref
possible_keys: page_timestamp,page_user_timestamp,rev_page_id
          key: rev_page_id
      key_len: 4
          ref: const
         rows: 93930
        Extra: Using where
1 row in set (0.00 sec)

Based on testing I made, this query would bring down our production servers unless run "offline" on a cron + our vslow servers, and not exposed publicly, as it takes 16 seconds and do a full scan over millions of records because of lack of indexes.

For my own future reference, is the difference between this and the similar queries from T231598 in the fact that this one can't be done with "Using index"?

pinging @eprodromou for metrics on usage of below:

  • only called on update/not user dependent/cached
  • Number of calls similar api ends
  • Worst case scenario

So, this is going to be used in the iOS app when a user reviews the history of a page. It can be user-independent; it doesn't have to be exact.

It can definitely be cached, but I don't believe the code we have now uses the object cache.

Our expectation is that this functionality will be used O(1000) times per week across all sites.

I guess the worst-case scenario is that we don't add object caching, and we re-architect the front end and use this endpoint to show the same data in the Web interface, in which case there would be a lot of traffic to it.

Based on testing I made, this query would bring down our production servers unless run "offline" on a cron + our vslow servers, and not exposed publicly, as it takes 16 seconds and do a full scan over millions of records because of lack of indexes.

For my own future reference, is the difference between this and the similar queries from T231598 in the fact that this one can't be done with "Using index"?

It may help, but it really depends. The thought process I follow is that the strategy (ALL, index, ref, const) will give you the O(?) (linear, quadratic, logn), but the number of rows will be more important in some other cases (constant). E.g. logn + 1000000 > n^2 + 1 for smaller "n"s. Testing on production (or equivalent) will give us the real deal + hw "modifiers" (where the constant matters):

root@db1080.eqiad.wmnet[enwiki]> flush status; SELECT COUNT(*) FROM revision WHERE rev_page=2337812 and rev_minor_edit = 1; show status like 'Hand%';
Query OK, 0 rows affected (0.00 sec)

+----------+
| COUNT(*) |
+----------+
|     5293 |
+----------+
1 row in set (0.47 sec)

+----------------------------+-------+
| Variable_name              | Value |
+----------------------------+-------+
| Handler_commit             | 1     |
| Handler_delete             | 0     |
| Handler_discover           | 0     |
| Handler_external_lock      | 0     |
| Handler_icp_attempts       | 0     |
| Handler_icp_match          | 0     |
| Handler_mrr_init           | 0     |
| Handler_mrr_key_refills    | 0     |
| Handler_mrr_rowid_refills  | 0     |
| Handler_prepare            | 0     |
| Handler_read_first         | 0     |
| Handler_read_key           | 1     |
| Handler_read_last          | 0     |
| Handler_read_next          | 51539 |
| Handler_read_prev          | 0     |
| Handler_read_retry         | 0     |
| Handler_read_rnd           | 0     |
| Handler_read_rnd_deleted   | 0     |
| Handler_read_rnd_next      | 0     |
| Handler_rollback           | 0     |
| Handler_savepoint          | 0     |
| Handler_savepoint_rollback | 0     |
| Handler_tmp_update         | 0     |
| Handler_tmp_write          | 0     |
| Handler_update             | 0     |
| Handler_write              | 0     |
+----------------------------+-------+
26 rows in set (0.00 sec)

@jcrespo What is the level of effort to test out an additional index on index on rev_page, rev_minor?

@jcrespo What is the level of effort to test out an additional index on index on rev_page, rev_minor?

Depool it, add an index, test it, remove the index, pool it back, maybe on two hosts, s1 and s8. Faster if test cluster is available. A day of work, not in hours, but from beginning to end.

I'm going to look around in our codebase for an example of doing counts with limits. But as a quick sanity check, this returns virtually immediately:

SELECT COUNT(*) FROM (SELECT 1 FROM revision WHERE rev_page=1952670 LIMIT 500) AS c;
+----------+
| COUNT(*) |
+----------+
|      500 |
+----------+
1 row in set (0.00 sec)

I'm going to look around in our codebase for an example of doing counts with limits. But as a quick sanity check, this returns virtually immediately:

SELECT COUNT(*) FROM (SELECT 1 FROM revision WHERE rev_page=1952670 LIMIT 500) AS c;
+----------+
| COUNT(*) |
+----------+
|      500 |
+----------+
1 row in set (0.00 sec)

Yep, that was exactly what I had in mind structurally, although requires the minor edit filter. You can mimic the usage of EXPLAIN and the Handler counters as seen above for better profiling.

@jcrespo @BPirkle and I spoke with @eprodromou if the limit will work and not break the DBs for now, we think we'll move forward with that option but we'll split off a separate ticket for us to dig into using caching or weekly run queries. @eprodromou has said that if the data is weekly that is likely to be ok.

I'm going to look around in our codebase for an example of doing counts with limits.

Look for uses of IDatabase->selectRowCount(). Two caveats to be aware of:

  • Obviously, whether there's 500 or 5000000 rows it'll return "500". @eprodromou would have to determine whether a capped count would serve.
  • The actual number of rows touched will still probably be over 500, depending on how often the rev_minor_edit != 0 filter actually matches. In extreme cases, it might wind up touching all of the rows anyway because there are so few passing the filter.

There's also IDatabase->estimateRowCount(), but that's often significantly off on the count. It uses the "rows" column in the EXPLAIN output, and you can see how accurate that was in some of the examples above.

@eprodromou would have to determine whether a capped count would serve.

It will work for now. I think we put a structure in T234941 that should let us raise the cap later without requiring changes on the client side.

@BPirkle let's say the cap is 500. You should set your LIMIT to 501. If you get back 501, the output should be {"count": 500, "limit": true} meaning "more than 500". If you get back 500, the output should be {"count": 500, "limit": false} meaning "exactly 500". Make sense?

Ideally I'd like to see us collect these history counts into a database table that gets updated at write time or in the job queue.

I hoper we're going to apply the limit to all the count types right?

Prospective change uploaded to https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/545590/

Here's the actual query executed by that code, per logging on my local:

SELECT  COUNT(*) AS `rowcount`  
FROM (
  SELECT  1  FROM `revision`    
  WHERE rev_page = ? 
    AND (rev_minor_edit != 0)  
    LIMIT 501  
) `tmp_count`

Execution times for various pages (obtained by executing the query manually):

rev_pagetime
5343660.15 sec
19526700.39 sec
23378120.46 sec

Those aren't exactly crazy fast. And as @Anomie points out they're dependent on the actual page data so results may not scale strictly based on a page's revision count, or even number of minor edits. But at least for these three large example pages, they're a whole lot better than the unlimited query.

EXPLAIN for rev_page = 534366:

+------+-------------+------------+------+------------------------------------------------+----------------+---------+-------+-------+-------------+
| id   | select_type | table      | type | possible_keys                                  | key            | key_len | ref   | rows  | Extra       |
+------+-------------+------------+------+------------------------------------------------+----------------+---------+-------+-------+-------------+
|    1 | PRIMARY     | <derived2> | ALL  | NULL                                           | NULL           | NULL    | NULL  |   501 |             |
|    2 | DERIVED     | revision   | ref  | rev_page_id,page_timestamp,page_user_timestamp | page_timestamp | 4       | const | 50254 | Using where |
+------+-------------+------------+------+------------------------------------------------+----------------+---------+-------+-------+-------------+

EXPLAIN for rev_page = 1952670:

+------+-------------+------------+------+------------------------------------------------+----------------+---------+-------+---------+-------------+
| id   | select_type | table      | type | possible_keys                                  | key            | key_len | ref   | rows    | Extra       |
+------+-------------+------------+------+------------------------------------------------+----------------+---------+-------+---------+-------------+
|    1 | PRIMARY     | <derived2> | ALL  | NULL                                           | NULL           | NULL    | NULL  |     501 |             |
|    2 | DERIVED     | revision   | ref  | rev_page_id,page_timestamp,page_user_timestamp | page_timestamp | 4       | const | 2692818 | Using where |
+------+-------------+------------+------+------------------------------------------------+----------------+---------+-------+---------+-------------+

EXPLAIN for rev_page = 2337812:

+------+-------------+------------+------+------------------------------------------------+----------------+---------+-------+-------+-------------+
| id   | select_type | table      | type | possible_keys                                  | key            | key_len | ref   | rows  | Extra       |
+------+-------------+------------+------+------------------------------------------------+----------------+---------+-------+-------+-------------+
|    1 | PRIMARY     | <derived2> | ALL  | NULL                                           | NULL           | NULL    | NULL  |   501 |             |
|    2 | DERIVED     | revision   | ref  | rev_page_id,page_timestamp,page_user_timestamp | page_timestamp | 4       | const | 97788 | Using where |
+------+-------------+------------+------+------------------------------------------------+----------------+---------+-------+-------+-------------+

I'm unable to emulate the nifty @jcrespo technique of flushing status, executing the query, then using status data, because I can't execute the flush:

FLUSH STATUS;
ERROR 1227 (42000): Access denied; you need (at least one of) the RELOAD privilege(s) for this operation

@BPirkle let's say the cap is 500. You should set your LIMIT to 501. If you get back 501, the output should be {"count": 500, "limit": true} meaning "more than 500". If you get back 500, the output should be {"count": 500, "limit": false} meaning "exactly 500". Make sense?

I think that makes sense. In that example, the returned count value is the maximum possible value, whether or not the limit actually affected the result. I assumed that I should always include the limit parameter, regardless of whether the count value was the maximum possible. So if the cap is 500 and there is only one minor edit, I'm returning {"count": 1, "limit": false}.

Let me know if I assumed incorrectly.

I hoper we're going to apply the limit to all the count types right?

That's an @eprodromou question. I do not know how the counts are used for other functionality that this endpoint supports, so I do not know if limiting those queries (which were approved without a limit) would be a bad thing.

I have no objection to limiting the rest, and it would be easy to code. But I haven't done it yet.

So if the cap is 500 and there is only one minor edit, I'm returning {"count": 1, "limit": false}.

Yes, that's right!

0.46 sec is still bad for a public api endpoint- and causes outages like T234450 when run multiple times by external users. Please do not make decisions based only on time- the same query when run twice will be stored mostly on memory and will not be representative of the random parameters it may get.

If you cannot run flush, you can just reconnect to flush the session to 0. We could create a privileged stored procedure for convenience if you create a task.

To add more context: Please don't blame us DBAs! 0:-) On another app- 1000 runs a day of 0.46 seconds would be more than acceptable. However, because we are an open organization and it makes no sense to have "private/secret" apis, it makes your job harder- because we have to bullet-proof every new api introduced, as 99.999% of the people will use reasonably, but that one user will abuse it, will ruin it for all others.

What is an acceptable time? There is not hard documentation, but here is some guideline:

Every db server at the moment can run 64 fully parallel queries (while it can handle thousands of open connections, so we optimize for maximum throughput, not for minimum latency). A 0.5 second query means that each instance can handle 128 queries per second of that kind, multiplied by 2 instances per api (3 for enwiki)- let's say 384. Currently we get 26778 queries per second on the 3 API hosts, so the average query time should be in the range of 0.0001 seconds, in average. E.g. if you order the queries on those servers by aggregated latency, the top 20 results have an avg latency between 1.70 ms and 65.29 µs, with the occasional outlier in max latency.

Index + limit may work in your case (?)

@jcrespo thank you for the help! I'll sync with Bill when he's on.

To add more context: Please don't blame us DBAs! 0:-)

Not at all, and thank you very much for the detailed explanations! This is as deep as I've gone down this particular type of rabbit hole at WMF, and I'm learning a lot about how to think about db usage within our environment.

FWIW - yes I did see that subsequent executions were very fast (0.00-0.05 sec) due to being cached.

For the benefit of those of us not intimately familiar with our db indexes, here's what they currently look like on the revision table:

show index from revision;

TableNon_uniqueKey_nameSeq_in_indexColumn_nameCollationCardinalitySub_partPackedNullIndex_typeCommentIndex_comment
revision0PRIMARY1rev_idA780088345NULLNULLBTREE
revision1rev_page_id1rev_pageA97511043NULLNULLBTREE
revision1rev_page_id2rev_idA780088345NULLNULLBTREE
revision1page_timestamp1rev_pageA111441192NULLNULLBTREE
revision1page_timestamp2rev_timestampA780088345NULLNULLBTREE
revision1user_timestamp1rev_userA45887549NULLNULLBTREE
revision1user_timestamp2rev_timestampA780088345NULLNULLBTREE
revision1usertext_timestamp1rev_user_textA97511043NULLNULLBTREE
revision1usertext_timestamp2rev_timestampA780088345NULLNULLBTREE
revision1page_user_timestamp1rev_pageA97511043NULLNULLBTREE
revision1page_user_timestamp2rev_userA780088345NULLNULLBTREE
revision1page_user_timestamp3rev_timestampA780088345NULLNULLBTREE
revision1rev_timestamp1rev_timestampA780088345NULLNULLBTREE

Fiddling a bit more on my local (which has very few pages, so absolutely not representative, but still):

EXPLAIN SELECT COUNT(*) FROM `revision` WHERE rev_page = 138 AND (rev_minor_edit != 0) LIMIT 1;
idselect_typetabletypepossible_keyskeykey_lenrefrowsExtra
1SIMPLErevisionrefrev_page_id,page_timestamp,page_user_timestamprev_page_id4const13Using where
CREATE INDEX page_minor_edit ON revision(rev_page,rev_minor_edit) USING BTREE;
EXPLAIN SELECT COUNT(*) FROM `revision` WHERE rev_page = 138 AND (rev_minor_edit != 0) LIMIT 1;
idselect_typetabletypepossible_keyskeykey_lenrefrowsExtra
1SIMPLErevisionrangerev_page_id,page_timestamp,page_user_timestamp,page_minor_editpage_minor_edit5NULL3Using where; Using index

The index dropped "rows" substantially (even on my bitty table), "type" changed from "ref" to "range", and we got a "Using index" in Extra. While indexes on boolean fields are often not valuable, this is an index on the combination of rev_page and rev_minor_edit, not just on rev_minor_edit. It is therefore more helpful for this particular query.

Disclaimer: I'm no DBA, I just wanted to document my personal experiment.

I've captured a task T237043 for doing these counts the "right way" by keeping them in the database and incrementing on edit. @BPirkle @WDoranWMF I think we should try to implement this while API volume is still low; I've added them to our backlog and I hope we can get to them in an upcoming sprint.

eprodromou removed eprodromou as the assignee of this task.Oct 31 2019, 10:46 PM

To clarify status: it appears all the primary people involved (myself, @eprodromou , @WDoranWMF , and @jcrespo ) agree to proceed with the "index+limit" approach, using the code that is already in https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/545590/

@jcrespo , do you need anything else from us to move forward with the index? To meet our commitments, we need to include this gerrit change in next week's train, before the deployment break (https://wikitech.wikimedia.org/wiki/Deployments#Upcoming).

@jcrespo Is there a possibility to look at this today? We were hoping to make it for the train on Tuesday.

eprodromou added a comment.EditedNov 1 2019, 3:36 PM

I'd like to suggest a couple of other mitigations to this problem.

  1. We simply refuse to execute the query for pages with a "long" history. Before running the "minor edit count" query, we run an "edit count" query to determine how long the history is. If it's more than a fixed length (1000 revisions?), we return a 500 error.
  1. We add a global governor to prevent this API call from being used globally more than N times per M seconds. The DDoS problem goes down, since any attempt to overuse the API call will just render it impossible to use for everyone.

Our current estimate of regular use is on the order of ~1000 calls per week, so if we give another order of magnitude wiggle room, we can limit ourselves comfortably to 10000 calls to this API method per week (168 hours), or about 60 calls per hour. Pseudocode to do that follows:

const PERIOD = 3600 // seconds per hour
const LIMIT = 60 // allowed calls during period

function getMinorRevisionCount(pageID) {
  int thisSecond = getTimeInSeconds()
  int thisPeriod = roundDown(thisSecond/PERIOD) // rounding usually happens automatically with integer division
  string key = concatenate("get_minor_revision_count_calls_", thisPeriod)
  int callsSoFar = objectCache::get(key)
  
  if (callsSoFar is not null and callsSoFar > LIMIT) {
     throw an exception
  }

  // atomic increment, hopefully defaults to zero

  objectCache::increment(key) 

   run the query

   return the results
}

@jcrespo My bad, I didn't realise Friday was a public holiday! Just to sync on next steps since you'll be on before any of us:

  1. We want to move forward with validating the index approach as our main priority
  2. Depending on your findings we can re-evaluate approaches.

Thanks again for all the support/help!

We want to move forward with validating the index approach as our main priority

Yes, I think I will be able to do this this week- please note that while the technical steps are not too long, it takes some time to depool a host, apply the schema change, pool it back carefully into production load, check for query plan anomalies, depool it, revert and repool.

I will report back when I am in the middle of it, so you can run additional tests if you want.

do you need anything else from us to move forward with the index? To meet our commitments, we need to include this gerrit change in next week's train, before the deployment break

Please note that deploying a full production schema change to all wikis can take 1-2 weeks for a "safe" change like this (adding an index or an extra column) from the moment it starts, so you can set your expectations accordingly. Please also note that there were people that planned upcoming changes with us at the beginning of the quarter- although we are happy to handle emergent tasks like this one to the best of our possiblities. More at: https://wikitech.wikimedia.org/wiki/Schema_changes#Additional_notes_on_scheduling

jcrespo claimed this task.Nov 4 2019, 9:34 AM
jcrespo moved this task from Blocked external/Not db team to Next on the DBA board.

@jcrespo ok thanks for the update.

If we were to take an approach where we used the limited version of the query, along with rate limiting of the API - similar to what is described in T235572#5626938, would that alleviate the concerns with the limit approach?

Option 1 would work, although I understand it is a big limitation. A variant way would be to run the query with a limit, but I am not sure that functionality was implemented into Mediawiki. I would be a bit skeptical about Option 2 because nobody has been able to implement a proper read API concurrency control in the last 5 years I been asking for it (there is already poolcounter for edit concurrency)- but probably for a good reason: T64615 T149421, but it is literally your team, maybe with Performance, who can advise you better on that topic- my scope ends at the database layer.

My basic understanding of the psudocode provided is that, if that is client code, you are rate-limiting at client side (but the problem is not on how api will be used by the, what I believe is, the target use case, but by abusing 3rd parties), or if that is on server side, a "DOS as a service code endpoint"- hit here 60 times to deny service for an hour to all clients (or a third variant, you are tracking ip usage which would go into privacy territory) :-/ Please loop in Performance team for those solutions, I am just a (temporary) DBA. I would also loop memcache/appserver SREs owners for anything involving memcache.

Regarding rate limiting, @Anomie pointed me at User::pingLimiter when rate limiting was previously mentioned. I have thus far only glanced at that code, and am not stating it is a viable solution to the issue being discussed in this task. But I bring it up just because it may be relevant.

if that is on server side, a "DOS as a service code endpoint"- hit here 60 times to deny service for an hour to all clients

This is server-side code, just to keep this particular endpoint from bringing down the whole site. I think we'll live with the risk that someone would DOS this endpoint over the next couple of weeks while we wait to get a more permanent solution done.

This is server-side code, just to keep this particular endpoint from bringing down the whole site. I think we'll live with the risk that someone would DOS this endpoint over the next couple of weeks while we wait to get a more permanent solution done.

The entire REST API still sits behind the $wgEnableRestAPI feature flag, right? So there's no way anyone could actually bring down production via this API, as it is not enabled. If we merge the proposed gerrit change with the query that depends on the index, I think what we're really saying is

  1. We will not enable the REST API on production until the index is in place
  2. If for whatever reason the index doesn't work out, we'll refactor that code

I have no objection to merging the code (and therefore making the endpoint available on beta) if we're agreed on those points.

  1. We will not enable the REST API on production until the index is in place

That's not an option; we need to enable the API on production to support iOS team's beta, preferably this week. Considering that the train stops running tomorrow for two weeks, it's important that we get it out ASAP.

I'd prefer if we could figure out a way to release this endpoint with some protection to keep it from being a problem for the site. If we can't, then let's simply return an error for now and fix it when we have an index in a few weeks. That way, our clients won't have to change their code.

Option 1 seems to have the fewest question marks and should be straightforward to implement:

We simply refuse to execute the query for pages with a "long" history. Before running the "minor edit count" query, we run an "edit count" query to determine how long the history is. If it's more than a fixed length (1000 revisions?), we return a 500 error.

I'll code that up.

If we can't, then let's simply return an error for now and fix it when we have an index in a few weeks.

If we do absolutely nothing, it will return 400 currently. Is it ok?

If we do absolutely nothing, it will return 400 currently. Is it ok?

No, I'd like a 500 error.

https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/545590/ was merged, with the "check edit count before executing minor edit count" safety net, returning 500 if the edit count limit is exceeded, as discussed.

I don't think it's appropriate to add an index for this. Indexes make INSERT queries slower, and increase the size of the database, which I think would result in a net performance reduction considering the low traffic planned for this endpoint.

From early discussions, I thought the responses from these statistics endpoints would be cached, but I don't see any caching in the merged code. How about we add object caching, and limit the concurrency of cache refreshes with PoolCounter? Would that be acceptable to @jcrespo?

nobody has been able to implement a proper read API concurrency control in the last 5 years I been asking for it (there is already poolcounter for edit concurrency)- but probably for a good reason: T64615 T149421, but it is literally your team, maybe with Performance, who can advise you better on that topic- my scope ends at the database layer.

I'm not sure why PoolCounter is still lightly used. I have always promoted it in these kinds of discussions. With the introduction of PoolCounterWorkViaCallback, we have made it dead simple to implement in calling code. It is not just used for editing, it is used in some very high-traffic situations such as limiting concurrency of search queries. It's fast and scalable.

We should probably make PoolCounter into a proper open source product, I really think it's a good piece of technology.

I can only support anything that tstarting says- so should I abort index addition testing and shift to count on edit (please advice)?

jcrespo reassigned this task from jcrespo to BPirkle.Nov 5 2019, 8:31 AM
CDanis added a subscriber: CDanis.Nov 5 2019, 1:59 PM
eprodromou closed this task as Resolved.Nov 5 2019, 3:53 PM

@jcrespo so, that sounds about right. Since we've implemented the "don't run for long edit histories" check, I'm going to close this ticket, and open a new one with @tstarling 's improvements.

I started T237430 to do the follow-on work per Tim. We also have T237043 for keeping the counts in a separate table, updated on edit. @tstarling your input there would be helpful.