Page MenuHomePhabricator

Using both rvuser and rvcontinue with prop=revisions causes database error on pages with a lot of revisions
Closed, ResolvedPublic

Description

Continuing prop=revisions queries doesn't seem to work on pages with a lot of revisions like WP:AIV when rvuser is used.

http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=Wikipedia:Administrator%20intervention%20against%20vandalism&rvprop=timestamp|comment&rvuser=Mr.Z-man&rvlimit=10
works just fine to return 10 edits and
http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=Wikipedia:Administrator%20intervention%20against%20vandalism&rvprop=timestamp|comment&rvuser=Mr.Z-man&rvlimit=1&continue=
works to return 1. But when we try to continue the query:
http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=Wikipedia:Administrator%20intervention%20against%20vandalism&rvprop=timestamp|comment&rvuser=Mr.Z-man&rvlimit=1&continue=||&rvcontinue=644715742
after a minute or so of waiting we get:

{
    "servedby": "mw1229",
    "error": {
        "code": "internal_api_error_DBQueryError",
        "info": "[be045702] Database query error"
    }
}

It works fine without rvuser and seems to work fine on pages with fewer revisions, and interestingly also works when rvlimit is larger (on that page it seems to work with rvlimit >3)

Event Timeline

Mr.Z-man raised the priority of this task from to Needs Triage.
Mr.Z-man updated the task description. (Show Details)
Mr.Z-man added a project: MediaWiki-Action-API.
Mr.Z-man subscribed.
Anomie added subscribers: Anomie, Springle.

The logged exception for that query is:

2015-01-30 05:55:26 mw1229 enwiki: [be045702] /w/api.php?action=query&prop=revisions&titles=Wikipedia:Administrator%20intervention%20against%20vandalism&rvprop=timestamp|comment&rvuser=Mr.Z-man&rvlimit=1&continue=||&rvcontinue=644715742   DBQueryError from line 1259 of /srv/mediawiki/php-1.25wmf14/includes/db/Database.php: A database error has occurred. Did you forget to run maintenance/update.php after upgrading?  See: https://www.mediawiki.org/wiki/Manual:Upgrading#Run_the_update_script
Query: SELECT  rev_id,rev_page,rev_text_id,rev_timestamp,rev_comment,rev_user_text,rev_user,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,rev_content_format,rev_content_model,page_namespace,page_title,page_id,page_latest,page_is_redirect,page_len  FROM `revision` INNER JOIN `page` ON ((page_id = rev_page))  WHERE (rev_id<='644715742') AND rev_page = '1952670' AND rev_user_text = 'Mr.Z-man' AND ((rev_deleted & 4) != 4)  ORDER BY rev_id DESC LIMIT 2  
Function: ApiQueryRevisions::run
Error: 2013 Lost connection to MySQL server during query (10.64.48.21)

#0 /srv/mediawiki/php-1.25wmf14/includes/db/Database.php(1197): DatabaseBase->reportQueryError()
#1 /srv/mediawiki/php-1.25wmf14/includes/db/Database.php(1693): DatabaseBase->query()
#2 /srv/mediawiki/php-1.25wmf14/includes/api/ApiQueryBase.php(379): DatabaseBase->select()
#3 /srv/mediawiki/php-1.25wmf14/includes/api/ApiQueryRevisions.php(307): ApiQueryBase->select()
#4 /srv/mediawiki/php-1.25wmf14/includes/api/ApiQueryRevisionsBase.php(43): ApiQueryRevisions->run()
#5 /srv/mediawiki/php-1.25wmf14/includes/api/ApiQuery.php(297): ApiQueryRevisionsBase->execute()
#6 /srv/mediawiki/php-1.25wmf14/includes/api/ApiMain.php(1023): ApiQuery->execute()
#7 /srv/mediawiki/php-1.25wmf14/includes/api/ApiMain.php(393): ApiMain->executeAction()
#8 /srv/mediawiki/php-1.25wmf14/includes/api/ApiMain.php(364): ApiMain->executeActionWithErrorHandling()
#9 /srv/mediawiki/php-1.25wmf14/api.php(85): ApiMain->execute()
#10 /srv/mediawiki/w/api.php(3): include()
#11 {main}

The error there sounds to me like the query was taking too long and got killed.

Explaining the query looks like it's using the primary key (on enwiki that's (rev_page,rev_id)) and scanning for matching rev_user_text. Increasing the limit gets it to use usertext_timestamp instead and then filesorting, which works better for this particular query. But I suspect it might break similarly when used on *users* with lots of revisions, particularly if the wiki has just rev_id as the primary key as specified in tables.sql. @Springle, any suggestions?

Aklapper triaged this task as Medium priority.Feb 2 2015, 9:55 PM

Including rev_timestamp when ordering would help in this case:

enwiki> explain
SELECT  rev_id,rev_page,rev_text_id,rev_timestamp,rev_comment,rev_user_text,rev_user,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,rev_content_format,rev_content_model,page_namespace,page_title,page_id,page_latest,page_is_redirect,page_len  
FROM `revision`
INNER JOIN `page` ON ((page_id = rev_page))
WHERE (rev_id<='644715742') AND rev_page = '1952670' AND rev_user_text = 'Mr.Z-man' AND ((rev_deleted & 4) != 4) 
ORDER BY rev_timestamp DESC, rev_id DESC
LIMIT 2\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: page
         type: const
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: const
         rows: 1
        Extra:
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: revision
         type: ref
possible_keys: PRIMARY,rev_id,page_timestamp,usertext_timestamp,page_user_timestamp
          key: usertext_timestamp
      key_len: 257
          ref: const
         rows: 41628
        Extra: Using where
2 rows in set (0.25 sec)

Alternatively, convert the user text to a user id with a non-correlated subquery, to use page_user_timestamp:

enwiki> explain
SELECT  rev_id,rev_page,rev_text_id,rev_timestamp,rev_comment,rev_user_text,rev_user,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,rev_content_format,rev_content_model,page_namespace,page_title,page_id,page_latest,page_is_redirect,page_len
FROM `revision`
INNER JOIN `page` ON ((page_id = rev_page)) 
WHERE (rev_id<='644715742') AND rev_page = '1952670' AND rev_user = (
   select user_id from user where user_name = 'Mr.Z-man'
) AND ((rev_deleted & 4) != 4)
ORDER BY rev_id DESC LIMIT 2\G
*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: page
         type: const
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: const
         rows: 1
        Extra: Using filesort
*************************** 2. row ***************************
           id: 1
  select_type: PRIMARY
        table: revision
         type: ref
possible_keys: PRIMARY,rev_id,page_timestamp,user_timestamp,page_user_timestamp
          key: page_user_timestamp
      key_len: 8
          ref: const,const
         rows: 134
        Extra: Using index condition; Using where
*************************** 3. row ***************************
           id: 2
  select_type: SUBQUERY
        table: user
         type: const
possible_keys: user_name
          key: user_name
      key_len: 257
          ref: const
         rows: 1
        Extra: Using index
3 rows in set (0.25 sec)

Maybe that breaks with username changes or reuse somehow and would need fancier subquery. It also incurs an, albeit small, filesort for some reason I don't follow yet...

Maybe that breaks with username changes or reuse somehow and would need fancier subquery.

IP users all have user_id = 0. I could use different queries for the two cases, though: rev_user for logged-in and rev_user_text for not.

These are the queries I'm looking at now:

SELECT rev_id,rev_page,rev_text_id,rev_timestamp,rev_comment,rev_user_text,rev_user,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,rev_content_format,rev_content_model,page_namespace,page_title,page_id,page_latest,page_is_redirect,page_len FROM revision INNER JOIN page ON ((page_id = rev_page))
WHERE rev_page = '1952670' ORDER BY rev_timestamp DESC, rev_id DESC LIMIT 2;

SELECT rev_id,rev_page,rev_text_id,rev_timestamp,rev_comment,rev_user_text,rev_user,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,rev_content_format,rev_content_model,page_namespace,page_title,page_id,page_latest,page_is_redirect,page_len FROM revision INNER JOIN page ON ((page_id = rev_page))
WHERE rev_page = '1952670' AND rev_timestamp <= '20150131020902' ORDER BY rev_timestamp DESC, rev_id DESC LIMIT 2;

SELECT rev_id,rev_page,rev_text_id,rev_timestamp,rev_comment,rev_user_text,rev_user,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,rev_content_format,rev_content_model,page_namespace,page_title,page_id,page_latest,page_is_redirect,page_len FROM revision INNER JOIN page ON ((page_id = rev_page))
WHERE rev_page = '1952670' AND rev_timestamp <= '20150131020902' AND rev_id <= 644929755 ORDER BY rev_timestamp DESC, rev_id DESC LIMIT 2;

SELECT rev_id,rev_page,rev_text_id,rev_timestamp,rev_comment,rev_user_text,rev_user,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,rev_content_format,rev_content_model,page_namespace,page_title,page_id,page_latest,page_is_redirect,page_len FROM revision INNER JOIN page ON ((page_id = rev_page))
WHERE (rev_timestamp < '20150126174311' OR (rev_timestamp = '20150126174311' AND rev_id <= 644278379)) AND rev_page = '1952670' ORDER BY rev_timestamp DESC, rev_id DESC LIMIT 2;

I'm running into some weird issue here, these all want to use the rev_timestamp index despite page_timestamp (with the implicitly-appended rev_id) looking like a much better choice. Increasing the limit to 501 makes it decide to use PRIMARY (which on enwiki is (rev_page,rev_id)), again rather than page_timestamp. It will choose page_timestamp if I remove ", rev_id DESC" from the ORDER BY, but the API needs this for correct behavior. This makes me think of T72558.

SELECT rev_id,rev_page,rev_text_id,rev_timestamp,rev_comment,rev_user_text,rev_user,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,rev_content_format,rev_content_model,page_namespace,page_title,page_id,page_latest,page_is_redirect,page_len FROM revision INNER JOIN page ON ((page_id = rev_page))
WHERE rev_page = '1952670' AND rev_user = '646348' ORDER BY rev_timestamp DESC, rev_id DESC LIMIT 2;

This one works good, except it insists on filesorting unless "rev_id DESC" is removed which is exactly like T72558.

All the various other WHERE clauses from the first block could be added in here too, and it seems to continue using the correct page_user_timestamp index with the T72558-like issue.

SELECT rev_id,rev_page,rev_text_id,rev_timestamp,rev_comment,rev_user_text,rev_user,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,rev_content_format,rev_content_model,page_namespace,page_title,page_id,page_latest,page_is_redirect,page_len FROM revision INNER JOIN page ON ((page_id = rev_page))
WHERE rev_page = '1952670' AND rev_user_text = '190.167.61.153' ORDER BY rev_timestamp DESC, rev_id DESC LIMIT 2;

This one doesn't insist on filesorting. Which is correct, but why not this one when all the rest are wrong?

And it continues working with all the other WHERE clauses.

SELECT rev_id,rev_page,rev_text_id,rev_timestamp,rev_comment,rev_user_text,rev_user,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,rev_content_format,rev_content_model,page_namespace,page_title,page_id,page_latest,page_is_redirect,page_len FROM revision INNER JOIN page ON ((page_id = rev_page))
WHERE rev_page = '1952670' AND rev_user != '646348' ORDER BY rev_timestamp DESC, rev_id DESC LIMIT 2;

SELECT rev_id,rev_page,rev_text_id,rev_timestamp,rev_comment,rev_user_text,rev_user,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,rev_content_format,rev_content_model,page_namespace,page_title,page_id,page_latest,page_is_redirect,page_len FROM revision INNER JOIN page ON ((page_id = rev_page))
WHERE rev_page = '1952670' AND rev_user_text != '190.167.61.153' ORDER BY rev_timestamp DESC, rev_id DESC LIMIT 2;

This one should be like the first one above, with a "Using where" filtering on rev_user. And it does, to the point of wanting to use the rev_timestamp index instead of page_timestamp.

I tried all these on the enwiki logpager slave db1055 which has an altered revision primary key on (rev_id, rev_user) in order to partition by user. In all cases the plans switched back using sane index choices without filesorts.

I'll pull an enwiki slave and alter it to the standard rev_id primary key to see how that affects things.

Interesting, db1057 enwiki with revision primary key (rev_id) shows sane plans for all these queries, apparently simply because PRIMARY is removed from contention. May be time to bite the bullet and bring old wiki revision tables into line with the current schema.

As for the why... a couple theories to test. Including rev_page in the primary key may have made rev_timestamp index too attractive to the optimizer when it is rarely any better than a full table scan on enwiki. Alternatively had the enwiki revision PK been in reverse order (rev_id, rev_page) the filesorts may not have been needed. /guess

If you'd like to test other queries, connect to db1057 directly. It isn't repooled yet and is fine to hammer.

gerritbot subscribed.

Change 188843 had a related patch set uploaded (by Anomie):
API: Improve queries for prop=revisions in enum mode

https://gerrit.wikimedia.org/r/188843

Patch-For-Review

db1057 is repooled and happy enough. Going to incrementally correct the revision primary keys for all old wikis on the other database boxes.

Change 190768 had a related patch set uploaded (by Springle):
depool db1065 T88084

https://gerrit.wikimedia.org/r/190768

Patch-For-Review

Both enwiki API slaves have had their revision primary keys altered and the query plans look sane. Giving it a while to settle down then will review tendril query logs.

A related note: MySQL 5.7 has a query rewriting module but MariaDB 10 does not. I've ported a subset of the functionality into our MariaDB 10.0.16 tree and applied a few rules[1] to the enwiki API slaves that apply hints for edge cases when the optimizer chooses the wrong plan. It seems to be working well to catch some outlying API queries, yet still allowing the optimizer some freedom where adding new FORCE INDEX or STRAIGHT_JOIN clauses would not. If it works for a week I'll document and puppetize.

[1] Eg:

ApiQueryRevisions::run query with ORDER BY rev_id where page_timestamp and page_user_timestamp should have lower weight than rev_page_id.

ApiQueryRevisions::run query with ORDER BY rev_timestamp where choosing an index scan on rev_timestamp is little better than a full table scan, and range/ref should be weighted much higher.

AFComputedVariable::compute with ORDER BY rev_timestamp where page_user_timestamp should be weighted lower than page_timestamp despite ICP (index condition pushdown).

Nemo_bis added subscribers: Nemo_bis, PiRSquared17.
Nemo_bis added a subscriber: XZise.

Change 188843 merged by jenkins-bot:
API: Improve queries for prop=revisions in enum mode

https://gerrit.wikimedia.org/r/188843

Anomie claimed this task.

We've got improved queries from the API coming with 1.26wmf4, and enhancements in the database as well. I'm going to mark this resolved, but feel free to reopen if there's still something to do here.

jcrespo reopened this task as Open.EditedAug 7 2015, 1:21 PM
jcrespo subscribed.

Reopen because I am getting an increasing number of long running queries of this type (top API/db error). I suspect it is the same problem, but for other wikis different than enwiki (dewiki)*. Feel free to prove me wrong (I have not yet done a full research).

  • Update: I am wrong, it is also happening for enwiki.
Krinkle subscribed.

(Classifying as not an issue with the RDBMS library or the current schema. If and when such issue is identified, feel free to move around.)

@Krinkle It took me some time to understand your comment and classification. Wouldn't be nice to have a specific tag for #query-performance or #wikimedia-query-performance or #slow-database-queries for discovery reasons (e.g. finding duplicates and previous examples)? What do you think?

We usually group tasks by component (or "feature"), and another dimension for teams. Overly wide components tracking tasks that nobody would triage or manage tend to die-off, cause confusion, or cause tasks to be only tagged there and subsequently forgotten/missed.

We do have a few yellow/generic tasks such as JavaScript, I18n and Performance-Team. The last one can be used here instead if that helps discovery. I've not seen tags used for discovery in practice (typically just general text search instead), but maybe that's just me. I'll tag it with that for now.

I think your classification was already right, I was proposing to add on top a new yellow one, as normally work on those require a combination of performance, DBAs and either core or other team on the product side, depending on the code module. The reasoning for that is that it would help avoid duplicate reports, in the same spirit as Wikimedia-production-error. It was just a suggestion, I understand if you consider it may not be useful.

Umherirrender subscribed.

Together with the fixes for T204669 the performance of the above queries feels good.

Feel free to reopen or create a new task if there are other timeouts using prop=revisions