Page MenuHomePhabricator

API: Contributions: Database query error
Closed, ResolvedPublic

Description

Querying contributions over API seems to have become quite hard for the databases at Wikimedia Commons.

(internal_api_error_DBQueryError): [952da8bcb8ea627656e95b2e] Database query error

I have a script for that checks eligibility and requests 76 edits before a given date. The request looks like this:
https://commons.wikimedia.org/w/api.php?action=query&list=usercontribs&ucuser=Rillke&ucstart=2016-01-01T00%3A00%3A00Z&uclimit=76&ucprop=

action: usercontribs,
ucuser: <user name>,
ucstart: 2016-01-01T00%3A00%3A00Z,
uclimit: 76,
ucprop:<empty because unimportant>

My tool's error report page (users have the option to retry, or to report the issue) suggests some users facing into this issue. When users choose to retry, it usually works, although still slow (I assume some kind of caching in the DB).

If you can't reproduce, take another user name.

The response time is also quite high, as I would assume there's an index on the timestamp.

Is there any way I could improve the API query so my users get a better experience?

Event Timeline

Anomie added subscribers: jcrespo, Anomie.
2016-03-28 16:32:38 [952da8bcb8ea627656e95b2e] mw1233 commonswiki 1.27.0-wmf.18 exception ERROR: [952da8bcb8ea627656e95b2e] /w/api.php?format=json&action=query&list=usercontribs&ucuser=Rillke&ucstart=2016-01-01T00%3A00%3A00Z&uclimit=76&ucprop=&requestid=commonswiki&_=1459182577085   DBQueryError from line 934 of /srv/mediawiki/php-1.27.0-wmf.18/includes/db/Database.php: A connection error occured.
Query: SELECT  rev_id,rev_timestamp,page_namespace,page_title,rev_user,rev_user_text,rev_deleted  FROM `page`,`revision` FORCE INDEX (usertext_timestamp)   WHERE (page_id=rev_page) AND ((rev_deleted & 12) != 12) AND rev_user_text = 'Rillke' AND (rev_timestamp<='20160101000000')  ORDER BY rev_timestamp DESC,rev_id DESC LIMIT 77
Function: ApiQueryContributions::execute
Error: 2013 Lost connection to MySQL server during query (10.64.16.8)
 {"exception_id":"952da8bcb8ea627656e95b2e"}
[Exception DBQueryError] (/srv/mediawiki/php-1.27.0-wmf.18/includes/db/Database.php:934) A connection error occured.
Query: SELECT  rev_id,rev_timestamp,page_namespace,page_title,rev_user,rev_user_text,rev_deleted  FROM `page`,`revision` FORCE INDEX (usertext_timestamp)   WHERE (page_id=rev_page) AND ((rev_deleted & 12) != 12) AND rev_user_text = 'Rillke' AND (rev_timestamp<='20160101000000')  ORDER BY rev_timestamp DESC,rev_id DESC LIMIT 77
Function: ApiQueryContributions::execute
Error: 2013 Lost connection to MySQL server during query (10.64.16.8)

  #0 /srv/mediawiki/php-1.27.0-wmf.18/includes/db/Database.php(884): DatabaseBase->reportQueryError(string, integer, string, string, boolean)
  #1 /srv/mediawiki/php-1.27.0-wmf.18/includes/db/Database.php(1233): DatabaseBase->query(string, string)
  #2 /srv/mediawiki/php-1.27.0-wmf.18/includes/api/ApiQueryBase.php(376): DatabaseBase->select(array, array, array, string, array, array)
  #3 /srv/mediawiki/php-1.27.0-wmf.18/includes/api/ApiQueryUserContributions.php(89): ApiQueryBase->select(string)
  #4 /srv/mediawiki/php-1.27.0-wmf.18/includes/api/ApiQuery.php(250): ApiQueryContributions->execute()
  #5 /srv/mediawiki/php-1.27.0-wmf.18/includes/api/ApiMain.php(1359): ApiQuery->execute()
  #6 /srv/mediawiki/php-1.27.0-wmf.18/includes/api/ApiMain.php(462): ApiMain->executeAction()
  #7 /srv/mediawiki/php-1.27.0-wmf.18/includes/api/ApiMain.php(433): ApiMain->executeActionWithErrorHandling()
  #8 /srv/mediawiki/php-1.27.0-wmf.18/api.php(83): ApiMain->execute()
  #9 /srv/mediawiki/w/api.php(3): include(string)
  #10 {main}

Slow query got killed.

I note the query insists on filesorting on db1019, probably thanks to the table partitioning on WMF's cluster breaking the clustering assumption. Interestingly, I see on db1051 that enwiki's revision table has rev_id explicitly appended to various revision-table indexes, which makes me think we had some other bug where this clustering was a problem and fixed it by making the indexes more explicit (but apparently we didn't do it everywhere). But I can't find a relevant bug in some quick searching. @jcrespo, do you remember anything like that?

Another potential improvement would be for ApiQueryUserContributions to use rev_user rather than rev_user_text when possible (i.e. all usernames being queried exist in the database). I'll make a patch for that, although I'm not sure it'll have much effect without the index clustering fix enwiki has since the filesort probably also means it's fetching all matching rows before applying the limit.

Change 280047 had a related patch set uploaded (by Anomie):
API: Use rev_user when possible for list=usercontribs

https://gerrit.wikimedia.org/r/280047

How can be the similar query on which Special:Contributions is based be so fast compared to the API query? (I might look into this myself when I find the time to)

I see Special:Contributions doesn't do a strict ordering, i.e. if someone made two edits in the same second there's no guarantee as to what order it'll display those two edits in, and if that crosses a paging boundary it might get confused when you click "next" (see T26782). That means it won't run into the filesorting issue that the API query runs into. And it already uses rev_user when possible.

Poyekhali triaged this task as Medium priority.Apr 13 2016, 4:13 AM
Poyekhali subscribed.

I see Special:Contributions doesn't do a strict ordering, i.e. if someone made two edits in the same second there's no guarantee as to what order it'll display those two edits in, and if that crosses a paging boundary it might get confused when you click "next" (see T26782). That means it won't run into the filesorting issue that the API query runs into. And it already uses rev_user when possible.

Good catch! We should try to order by on an set of indexed columns or column + PK when possible, but I do not know how feasible that is.

Tonight there was a spike in wfLogDBError due to some overload that caused also lag up to 40 seconds on one of s7 slaves (db1034).

~50 queries with One of the affecting queries executed around May 2nd, 22:00~22:30 UTC were ~50 of the type:

SELECT /* ApiQueryContributions::execute [...OMISSIS...] */ rev_id, rev_timestamp, page_namespace, page_title, rev_user, rev_user_text, rev_deleted FROM `page`, `revision` FORCE INDEX (usertext_timestamp) WHERE (page_id=rev_page) AND ((rev_deleted & 4) != 4) AND rev_user_text = '[...OMISSIS...]' ORDER BY rev_timestamp DESC, rev_id DESC LIMIT 2

For which an example explain is:

+------+-------------+----------+--------+--------------------+--------------------+---------+--------------------------+---------+-----------------------------+
| id   | select_type | table    | type   | possible_keys      | key                | key_len | ref                      | rows    | Extra                       |
+------+-------------+----------+--------+--------------------+--------------------+---------+--------------------------+---------+-----------------------------+
|    1 | SIMPLE      | revision | ref    | usertext_timestamp | usertext_timestamp | 257     | const                    | 2767162 | Using where; Using filesort |
|    1 | SIMPLE      | page     | eq_ref | PRIMARY            | PRIMARY            | 4       | cawiki.revision.rev_page |       1 |                             |
+------+-------------+----------+--------+--------------------+--------------------+---------+--------------------------+---------+-----------------------------+

With execution times up to 222 seconds.

In general the use of FORCE INDEX is discouraged because under some circumstances it could make MySQL use the wrong plan based on current data and indexes and it might get misaligned with future database improvements.

In this particular case most of them where executed on cawiki.
Logs from HHVM slow available here:
https://logstash.wikimedia.org/#dashboard/temp/AVR2VaXQCsPTNesWb9Vf

In general the use of FORCE INDEX is discouraged because under some circumstances it could make MySQL use the wrong plan based on current data and indexes and it might get misaligned with future database improvements.

That's probably a red herring in this case, usertext_timestamp is the most sensible index to use for this query and it's what is picked in the absence of any forcing anyway.

The thing that concerns me in that explain is the filesort, which comes down to what I already said in T131065#2155924.

That's probably a red herring in this case, usertext_timestamp is the most sensible index to use for this query and it's what is picked in the absence of any forcing anyway.

Absolutely, I reported it just as a general bad practice, not as the possible cause here.

For potential index differences (see T132416) I've looked at cawiki only (s7) because was the one for which most of the queries in the above mentioned event were executed. The differences are summarized here: T132416#2262197

I also agree with your comment on filesort, there are rev_user_text values (that usually belongs to bots) with a lot of rows, for which the number of rows examined is quite big hence the drop in performance if the filesort is used.

As expected I was able to get an explain with only where removing the FORCE INDEX and the ORDER BY rev_id DESC.

Then, as a quick/ugly test I tried a subselect (still without force index) that actually does the trick in this case. To make the test real I've used a value for rev_user_text of one of the bots that have a lot of rows:

SELECT * FROM (SELECT /* ApiQueryContributions::execute [...OMISSIS...] */ rev_id, rev_timestamp, page_namespace, page_title, rev_user, rev_user_text, rev_deleted FROM `page`, `revision` WHERE (page_id=rev_page) AND ((rev_deleted & 4) != 4) AND rev_user_text = '[...OMISSIS...]' ORDER BY rev_timestamp DESC LIMIT 20) AS tmp ORDER BY tmp.rev_timestamp DESC, tmp.rev_id DESC;
...
20 rows in set (0.00 sec)
# For comparison, with the current query:
20 rows in set (1 min 40.77 sec)

For which the filesort is used only on the already timestamp sorted and limited results, hence the speed:

+------+-------------+------------+--------+-----------------------------------+--------------------+---------+--------------------------+---------+----------------+
| id   | select_type | table      | type   | possible_keys                     | key                | key_len | ref                      | rows    | Extra          |
+------+-------------+------------+--------+-----------------------------------+--------------------+---------+--------------------------+---------+----------------+
|    1 | PRIMARY     | <derived2> | ALL    | NULL                              | NULL               | NULL    | NULL                     |      20 | Using filesort |
|    2 | DERIVED     | revision   | ref    | page_timestamp,usertext_timestamp | usertext_timestamp | 257     | const                    | 2766470 | Using where    |
|    2 | DERIVED     | page       | eq_ref | PRIMARY                           | PRIMARY            | 4       | cawiki.revision.rev_page |       1 |                |
+------+-------------+------------+--------+-----------------------------------+--------------------+---------+--------------------------+---------+----------------+

Just sorting by rev_id would not be an option?
Could it happen that a row with rev_id = N has a rev_timestamp greater than that of a row with rev_id > N? (in other words, are not both monotonically increasing?)

Then, as a quick/ugly test I tried a subselect (still without force index) that actually does the trick in this case. To make the test real I've used a value for rev_user_text of one of the bots that have a lot of rows:

SELECT * FROM (SELECT /* ApiQueryContributions::execute [...OMISSIS...] */ rev_id, rev_timestamp, page_namespace, page_title, rev_user, rev_user_text, rev_deleted FROM `page`, `revision` WHERE (page_id=rev_page) AND ((rev_deleted & 4) != 4) AND rev_user_text = '[...OMISSIS...]' ORDER BY rev_timestamp DESC LIMIT 20) AS tmp ORDER BY tmp.rev_timestamp DESC, tmp.rev_id DESC;
...
20 rows in set (0.00 sec)
# For comparison, with the current query:
20 rows in set (1 min 40.77 sec)

For which the filesort is used only on the already timestamp sorted and limited results, hence the speed:

It also won't work. Let's reduce the limit from 20 to 2 for simplicity, and consider a user with only these three rows in revision:

+----------------+--------+
| rev_timestamp  | rev_id |
+----------------+--------+
| 20160203123456 |      1 |
| 20160203123456 |      2 |
| 20160203123456 |      3 |
+----------------+--------+

Since all the timestamps are the same and rev_id is unconstrained by the ORDER BY, your subselect is free to return the rows with rev_id 1 and 2 or 1 and 3, which gives the following result after the outer select:

+----------------+--------+      +----------------+--------+
| rev_timestamp  | rev_id |      | rev_timestamp  | rev_id |
+----------------+--------+      +----------------+--------+
| 20160203123456 |      2 |  OR  | 20160203123456 |      3 |
| 20160203123456 |      1 |      | 20160203123456 |      1 |
+----------------+--------+      +----------------+--------+

Unfortunately, that's not the correct result here. The correct result is, of course,

+----------------+--------+
| rev_timestamp  | rev_id |
+----------------+--------+
| 20160203123456 |      3 |
| 20160203123456 |      2 |
+----------------+--------+

Just sorting by rev_id would not be an option?

No, that would reopen T91883: prop=revisions sorts by rev_id, not by rev_timestamp.

Could it happen that a row with rev_id = N has a rev_timestamp greater than that of a row with rev_id > N? (in other words, are not both monotonically increasing?)

Yes. It happens easily with imported edits.

It also won't work. Let's reduce the limit from 20 to 2 for simplicity, and consider a user with only these three rows in revision:

Since all the timestamps are the same and rev_id is unconstrained by the ORDER BY, your subselect is free to return the rows with rev_id 1 and 2 or 1 and 3, which gives the following result after the outer select:

Unfortunately, that's not the correct result here. The correct result is, of course,

True, but if there is a reasonable limit on the number of revisions that can be generated in the same second with the same rev_user_text for other reasons (API/edit limits, etc.) then the inner select could have always a fixed secure limit (let's say 100) and then the real limit will be done in the outer select.

As I said, this was a just a quick/ugly test that could be used as a temporary workaround at most, not a final solution.
That should probably rely on revisiting the indexes and the queries to ensure that we can have MySQL do the sort using indexed values and make a decision over the different primary key/indexes on the revision table across the wikis and the fleet.

Change 280047 merged by jenkins-bot:
API: Use rev_user when possible for list=usercontribs

https://gerrit.wikimedia.org/r/280047

Marostegui assigned this task to Anomie.
Marostegui subscribed.

Change 280047 merged by jenkins-bot:
API: Use rev_user when possible for list=usercontribs

https://gerrit.wikimedia.org/r/280047

This probably fixed this issue.
There has no been reports of this query failing in the last 90 days and the explain without the force looks pretty healthy.
There have been also lots of work done to unify the revision table across the cluster.

Feel free to reopen if you consider this is not fixed yet.