Page MenuHomePhabricator

ApiQueryRecentChanges::run is spiking, nuking API servers
Closed, ResolvedPublic

Description

Since yesterday, this cause are multi-second (up to 160 seconds) queries like this:

SELECT /* ApiQueryRecentChanges::run <user> */ rc_id, rc_timestamp, rc_namespace, rc_title, rc_cur_id, rc_type, rc_deleted, rc_this_oldid, rc_last_oldid, rc_user, rc_user_text, ts_tags, page_is_redirect FROM `recentchanges` FORCE INDEX (rc_timestamp) LEFT JOIN `tag_summary` ON ((rc_id=ts_rc_id)) LEFT JOIN `page` ON ((rc_namespace=page_namespace) AND (rc_title=page_title)) WHERE rc_namespace = '0' AND rc_type = '1' AND (rc_patrolled = 0) AND (page_is_redirect = 0 OR page_is_redirect IS NULL) ORDER BY rc_timestamp DESC, rc_id DESC LIMIT 51 /* 4d1857ea652abc7849668dc00f8055bb db1060 zhwiki 2s */

Either something has changed on the code generating them, or a js user script has changed, executing them in thousands.

Event Timeline

jcrespo created this task.Jul 12 2016, 2:47 PM
Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptJul 12 2016, 2:47 PM
jcrespo triaged this task as Unbreak Now! priority.Jul 12 2016, 2:49 PM

I've setup a watchdog that kills API queries running for over 10 seconds.

Other queries that are misbehaving:

SELECT /* ApiQueryRecentChanges::run XXXX */ rc_id, rc_timestamp, rc_namespace, rc_title, rc_cur_id, rc_type, rc_deleted, rc_this_oldid, rc_last_oldid, page_is_redirect FROM `recentchanges` FORCE INDEX (rc_timestamp) LEFT JOIN `page` ON ((rc_namespace=page_namespace) AND (rc_title=page_title)) WHERE rc_namespace = '0' AND rc_type = '1' AND (rc_patrolled = 0) AND (page_is_redirect = 0 OR page_is_redirect IS NULL) ORDER BY rc_timestamp DESC, rc_id DESC LIMIT 11
Restricted Application added subscribers: Luke081515, TerraCodes. · View Herald TranscriptJul 12 2016, 2:49 PM

Most queries seem to come from a few zh-users:

  • 写字板
  • Stang
  • A2093064
  • Assanges
  • Galaxyharrylion
  • Amazingloong

(there may be some false positives here, but if we find something in common among those users, we may find the issue: probably some kind of too aggressive user's .js code that was updated recently). Can someone help contacting those users?

Anomie added a subscriber: Anomie.EditedJul 12 2016, 3:32 PM

In general, this reminds me of T101502 and similar bugs: the module has a bunch of filters, and if rows passing the combination of filters is relatively rare it can result in lots of rows being scanned before enough are found to meet the limit.

OTOH, in this specific case we may be making things worse by forcing a particular index. This query could probably better be served by the new_name_timestamp index, and I see db1060 has an even better one available (name_type_patrolled_timestamp) that isn't in maintenance/tables.sql.

Edit: new_name_timestamp wouldn't help since we're not including rc_new in the query.

Change 298495 had a related patch set uploaded (by Anomie):
API: Remove index forcing in ApiQueryRecentChanges

https://gerrit.wikimedia.org/r/298495

jcrespo added a comment.EditedJul 12 2016, 3:40 PM

@Anomie yes, however, knowing that happens I think we should be prepared- have some kind of limitation on the query time when the same query is very fequent (which right now I have badly done monitoring the DB; and in extreme cases, being able to easily block certain combination of parameters, and just return an error immediately (I cannot do that at db side).

These are for longer term, let me test your patch first.

The explain is very clean:

MariaDB PRODUCTION s2 localhost zhwiki > EXPLAIN SELECT /* ApiQueryRecentChanges::run <user> */ rc_id, rc_timestamp, rc_namespace, rc_title, rc_cur_id, rc_type, rc_deleted, rc_this_oldid, rc_last_oldid, rc_user, rc_user_text, ts_tags, page_is_redirect FROM `recentchanges` FORCE INDEX (rc_timestamp) LEFT JOIN `tag_summary` ON ((rc_id=ts_rc_id)) LEFT JOIN `page` ON ((rc_namespace=page_namespace) AND (rc_title=page_title)) WHERE rc_namespace = '0' AND rc_type = '1' AND (rc_patrolled = 0) AND (page_is_redirect = 0 OR page_is_redirect IS NULL) ORDER BY rc_timestamp DESC, rc_id DESC LIMIT 51\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: recentchanges
         type: index
possible_keys: NULL
          key: rc_timestamp
      key_len: 16
          ref: NULL
         rows: 51
        Extra: Using where
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: tag_summary
         type: ref
possible_keys: ts_rc_id,tag_summary_rc_id
          key: ts_rc_id
      key_len: 5
          ref: zhwiki.recentchanges.rc_id
         rows: 1
        Extra: 
*************************** 3. row ***************************
           id: 1
  select_type: SIMPLE
        table: page
         type: eq_ref
possible_keys: name_title
          key: name_title
      key_len: 261
          ref: zhwiki.recentchanges.rc_namespace,zhwiki.recentchanges.rc_title
         rows: 1
        Extra: Using where
3 rows in set (0.00 sec)

But completely wrong, it takes 3 second to execute:

+----------------------------+---------+
| Variable_name              | Value   |
+----------------------------+---------+
| Handler_commit             | 1       |
| Handler_delete             | 0       |
| Handler_discover           | 0       |
| Handler_external_lock      | 0       |
| Handler_icp_attempts       | 0       |
| Handler_icp_match          | 0       |
| Handler_mrr_init           | 0       |
| Handler_mrr_key_refills    | 0       |
| Handler_mrr_rowid_refills  | 0       |
| Handler_prepare            | 0       |
| Handler_read_first         | 0       |
| Handler_read_key           | 1550    |
| Handler_read_last          | 1       |
| Handler_read_next          | 72      |
| Handler_read_prev          | 1390074 |
| Handler_read_rnd           | 0       |
| Handler_read_rnd_deleted   | 0       |
| Handler_read_rnd_next      | 0       |
| Handler_rollback           | 0       |
| Handler_savepoint          | 0       |
| Handler_savepoint_rollback | 0       |
| Handler_tmp_update         | 0       |
| Handler_tmp_write          | 0       |
| Handler_update             | 0       |
| Handler_write              | 0       |
+----------------------------+---------+
25 rows in set (0.00 sec)

There is a huge difference between 1390074 and 51 rows.

I tested extensively the queries, and I agree with your conclusions, but I have some extra:

  • I have fixed a bug that I think it wasn't part of the root cause, but it could add varaibles to the equation and it was problematic- the optimizer column on mysql.column stats was using 3-byte UTF-8, which failed to update the table statistics.
  • name_type_patrolled_timestamp would definitely help- let me create it on db1054, too (out of band)
  • Let's deploy your proposed change- if it doesn't break other queries too much, we will formalize name_type_patrolled_timestamp everywhere

Mentioned in SAL [2016-07-12T17:37:22Z] <jynus> out of band ALTER TABLE recentchanges ADD KEY name_type_patrolled_timestamp on db1054 T140108

jcrespo lowered the priority of this task from Unbreak Now! to High.Jul 12 2016, 5:40 PM

High because I think it will happen again, but I am now prepared.

Anomie moved this task from Unsorted to Needs Review on the MediaWiki-API board.Jul 12 2016, 9:13 PM

Change 298495 merged by jenkins-bot:
API: Remove index forcing in ApiQueryRecentChanges

https://gerrit.wikimedia.org/r/298495

Change 298813 had a related patch set uploaded (by Legoktm):
API: Remove index forcing in ApiQueryRecentChanges

https://gerrit.wikimedia.org/r/298813

Change 298813 merged by jenkins-bot:
API: Remove index forcing in ApiQueryRecentChanges

https://gerrit.wikimedia.org/r/298813

Mentioned in SAL [2016-07-13T19:14:11Z] <legoktm@tin> Synchronized php-1.28.0-wmf.8/includes/api/ApiQueryRecentChanges.php: API: Remove index forcing in ApiQueryRecentChanges - T140108 (duration: 00m 26s)

No errors last night (was something deployed or did something changed on zhwiki?).

No errors last night (was something deployed or did something changed on zhwiki?).

I deployed the patch to all active MediaWiki branches at 2016-07-13T19:14:11Z.

The difference can be seen at: https://logstash.wikimedia.org/#dashboard/temp/AVXou9fFw3dCNxx2AiAK

Two pending things:

  • Check that the change doesn't break other unlrelated things
  • Add the index properly to mediawiki and deploy it (it is online and only takes second per wiki)
jcrespo lowered the priority of this task from High to Normal.Jul 14 2016, 9:30 AM

I would add a third:

  • document properly all FORCE/USE/IGNORE INDEXES and STRAIGHT JOINs. It is ok to add them, but they are still a hack and cause maintenance hell. Try to minimize those by centralizing all FORCE INDEX on the code with proper links to the issues and explanation why they were added.

Change 298995 had a related patch set uploaded (by Anomie):
Add rc_name_type_patrolled_timestamp index

https://gerrit.wikimedia.org/r/298995

Mentioned in SAL [2016-07-28T15:07:06Z] <jynus> adding new index (schema change) to recentchanges T140108

jcrespo moved this task from Triage to In progress on the DBA board.Jul 28 2016, 3:18 PM

I am deploying this now to all WMF servers.

Mentioned in SAL [2016-07-29T07:30:18Z] <jynus> schema change continues for s2, s1, s4 and s5 T140108

This is almost done, I have to do a second pass to redo the tables that failed to apply the change:

dbstore1002.eqiad.wmnet: svwiki
db1026.eqiad.wmnet: wikidatawiki
all s3 ones

And delete the previously undocumented ones.

Mentioned in SAL [2016-08-01T09:25:22Z] <jynus> dropping index name_type_patrolled_timestamp on zhwiki on db1060 and db1054 T140108

jcrespo moved this task from In progress to Done on the DBA board.Aug 1 2016, 9:29 AM

This change has been deployed to all wikis/all servers. I am considering this resolved in terms of WMF server deployment. Only keeping this open in case it is needed to track mediawiki commits. Otherwise, resolve.

Change 298995 merged by jenkins-bot:
Add rc_name_type_patrolled_timestamp index

https://gerrit.wikimedia.org/r/298995

Anomie closed this task as Resolved.Aug 15 2016, 8:52 PM
Anomie claimed this task.

Changes are deployed and merged. Let's close this.