Page MenuHomePhabricator

Reclone db1114 (s1 api) from another API host
Closed, ResolvedPublic

Description

db1114 (s1 api replica) has a high rate of query errors, and it is the host that provides a wrong plan for ApiQueryRevisions::run even though it was upgraded, rebooted and analyzed nothing could be found.
Let's reclone it from another API host from s1 once eqiad is passive.

Event Timeline

Marostegui triaged this task as Normal priority.Sep 5 2018, 1:07 PM
Marostegui created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 5 2018, 1:07 PM
Marostegui moved this task from Triage to Next on the DBA board.Sep 5 2018, 1:07 PM
Banyek claimed this task.Sep 17 2018, 9:41 AM

Change 460867 had a related patch set uploaded (by Banyek; owner: Banyek):
[operations/puppet@production] mariadb: disable notifications for db1114

https://gerrit.wikimedia.org/r/460867

Change 460867 merged by Banyek:
[operations/puppet@production] mariadb: disable notifications for db1114

https://gerrit.wikimedia.org/r/460867

Change 460869 had a related patch set uploaded (by Banyek; owner: Banyek):
[operations/mediawiki-config@master] maridadb: depool db1114

https://gerrit.wikimedia.org/r/460869

Change 460869 merged by jenkins-bot:
[operations/mediawiki-config@master] maridadb: depool db1114

https://gerrit.wikimedia.org/r/460869

Mentioned in SAL (#wikimedia-operations) [2018-09-17T12:58:29Z] <banyek@deploy1001> Synchronized wmf-config/db-eqiad.php: T203565: depooling db1114 (duration: 00m 49s)

Change 460898 had a related patch set uploaded (by Banyek; owner: Banyek):
[operations/puppet@production] mariadb: Wipe srv partition on db1114

https://gerrit.wikimedia.org/r/460898

Change 460898 merged by Banyek:
[operations/puppet@production] mariadb: Wipe srv partition on db1114

https://gerrit.wikimedia.org/r/460898

Script wmf-auto-reimage was launched by banyek on neodymium.eqiad.wmnet for hosts:

['db1114.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201809171348_banyek_6985.log.

Change 460906 had a related patch set uploaded (by Banyek; owner: Banyek):
[operations/puppet@production] MariaDB: mute notifications of db1119

https://gerrit.wikimedia.org/r/460906

Completed auto-reimage of hosts:

['db1114.eqiad.wmnet']

and were ALL successful.

Change 460907 had a related patch set uploaded (by Banyek; owner: Banyek):
[operations/mediawiki-config@master] Labsdb: depool db1119

https://gerrit.wikimedia.org/r/460907

Change 460906 merged by Banyek:
[operations/puppet@production] MariaDB: mute notifications of db1119

https://gerrit.wikimedia.org/r/460906

Change 460907 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: depool db1119

https://gerrit.wikimedia.org/r/460907

Mentioned in SAL (#wikimedia-operations) [2018-09-17T14:27:19Z] <banyek@deploy1001> Synchronized wmf-config/db-eqiad.php: T203565: depooling db1119 (duration: 00m 49s)

jcrespo moved this task from Next to In progress on the DBA board.Sep 18 2018, 9:14 AM

Change 461124 had a related patch set uploaded (by Banyek; owner: Banyek):
[operations/puppet@production] MariaDB: Enable notifications db1114 & db1119

https://gerrit.wikimedia.org/r/461124

Change 461124 merged by Banyek:
[operations/puppet@production] MariaDB: Enable notifications db1114 & db1119

https://gerrit.wikimedia.org/r/461124

Change 461128 had a related patch set uploaded (by Banyek; owner: Banyek):
[operations/mediawiki-config@master] db-equiad: repool db1114 and db1119

https://gerrit.wikimedia.org/r/461128

Change 461128 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad: repool db1114 and db1119

https://gerrit.wikimedia.org/r/461128

Mentioned in SAL (#wikimedia-operations) [2018-09-18T13:57:48Z] <banyek@deploy1001> Synchronized wmf-config/db-eqiad.php: T203565: Repool db1114 and db1119 (duration: 00m 49s)

The recloning finished, but I don't close the ticket: it needs to be checked if the plan is working good again.

Change 461360 had a related patch set uploaded (by Banyek; owner: Banyek):
[operations/mediawiki-config@master] db-eqiad.php: Set API servers weight

https://gerrit.wikimedia.org/r/461360

Change 461360 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Set API servers weight

https://gerrit.wikimedia.org/r/461360

Mentioned in SAL (#wikimedia-operations) [2018-09-19T10:07:31Z] <banyek@deploy1001> Synchronized wmf-config/db-eqiad.php: T203565: Weight adjust for S1 API hosts (duration: 00m 57s)

Banyek closed this task as Resolved.Sep 19 2018, 10:27 AM

The root cause of the different query plans was the different schemas on the hosts.

SHOW CREATE TABLE enwiki.revision:
[...]
KEY `rev_timestamp` (`rev_timestamp`),
KEY `page_timestamp` (`rev_page`,`rev_timestamp`),
KEY `user_timestamp` (`rev_user`,`rev_timestamp`),
KEY `usertext_timestamp` (`rev_user_text`,`rev_timestamp`),
KEY `page_user_timestamp` (`rev_page`,`rev_user`,`rev_timestamp`),
KEY `rev_page_id` (`rev_page`,`rev_id`)

in this case the optimizer decides to use page_timestamp

KEY `rev_page_id` (`rev_page`,`rev_id`),
KEY `page_timestamp` (`rev_page`,`rev_timestamp`),
KEY `rev_timestamp` (`rev_timestamp`),
KEY `user_timestamp` (`rev_user`,`rev_timestamp`),
KEY `usertext_timestamp` (`rev_user_text`,`rev_timestamp`),
KEY `page_user_timestamp` (`rev_page`,`rev_user`,`rev_timestamp`)

in this case the optimizer decides to use rev_timestamp
When the correct index was forced, the query had the same execution plan. (With the same results)

Good catch.
Can you open a task to get that checked and fixed across all the hosts where this difference exists?

It is not entirely clear to me what the differences are?
Which hosts did you compare?

A quick recap:
a, The original issue (the different API hosts have different query plans) is solved.
b, A new issue emerged: The API hosts has really bad performance with the query ApiQueryRevisions::run.
After checking the revisions table in enwiki accross al the hosts in S1 it seems the indexes are not defined in the same order across the cluster.
Several hosts have the order (on these hosts the query performs good):

SHOW CREATE TABLE enwiki.revision:
[...]
KEY `rev_timestamp` (`rev_timestamp`),
KEY `page_timestamp` (`rev_page`,`rev_timestamp`),
KEY `user_timestamp` (`rev_user`,`rev_timestamp`),
KEY `usertext_timestamp` (`rev_user_text`,`rev_timestamp`),
KEY `page_user_timestamp` (`rev_page`,`rev_user`,`rev_timestamp`),
KEY `rev_page_id` (`rev_page`,`rev_id`)

In this case the optimizer will use page_timestamp:

+------+-------------+---------------------+--------+--------------------------------------------------------------+----------------+---------+-----------------------------------------------+-------+-----------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+---------------------+--------+--------------------------------------------------------------+----------------+---------+-----------------------------------------------+-------+-----------------------+
| 1 | SIMPLE | revision | range | rev_timestamp,page_timestamp,page_user_timestamp,rev_page_id | page_timestamp | 20 | NULL | 42160 | Using index condition |
| 1 | SIMPLE | temp_rev_comment | ref | PRIMARY,revcomment_rev | PRIMARY | 4 | enwiki.revision.rev_id | 1 | Using index |
| 1 | SIMPLE | comment_rev_comment | eq_ref | PRIMARY | PRIMARY | 8 | enwiki.temp_rev_comment.revcomment_comment_id | 1 | Using where |
| 1 | SIMPLE | page | eq_ref | PRIMARY | PRIMARY | 4 | enwiki.revision.rev_page | 1 | |
+------+-------------+---------------------+--------+--------------------------------------------------------------+----------------+---------+-----------------------------------------------+-------+-----------------------+

In other cases the order of the indexes are:

SHOW CREATE TABLE enwiki.revision:
[...]
KEY `rev_page_id` (`rev_page`,`rev_id`),
KEY `page_timestamp` (`rev_page`,`rev_timestamp`),
KEY `rev_timestamp` (`rev_timestamp`),
KEY `user_timestamp` (`rev_user`,`rev_timestamp`),
KEY `usertext_timestamp` (`rev_user_text`,`rev_timestamp`),
KEY `page_user_timestamp` (`rev_page`,`rev_user`,`rev_timestamp`)

In this case the optimizer chooses rev_timestamp:

```+------+-------------+---------------------+--------+--------------------------------------------------------------+---------------+---------+-----------------------------------------------+-------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+---------------------+--------+--------------------------------------------------------------+---------------+---------+-----------------------------------------------+-------+-------------+
| 1 | SIMPLE | revision | index | rev_page_id,page_timestamp,rev_timestamp,page_user_timestamp | rev_timestamp | 16 | NULL | 31581 | Using where |
| 1 | SIMPLE | temp_rev_comment | ref | PRIMARY,revcomment_rev | PRIMARY | 4 | enwiki.revision.rev_id | 1 | Using index |
| 1 | SIMPLE | comment_rev_comment | eq_ref | PRIMARY | PRIMARY | 8 | enwiki.temp_rev_comment.revcomment_comment_id | 1 | Using where |
| 1 | SIMPLE | page | eq_ref | PRIMARY | PRIMARY | 4 | enwiki.revision.rev_page | 1 | |
+------+-------------+---------------------+--------+--------------------------------------------------------------+---------------+---------+-----------------------------------------------+-------+-------------+

There are a few ways to solve this:
a, we could drop the indexes, and recreate them in a good order - which is a terribly bad idea, because who knows what will happen
b, we can reclone all the hosts from one which has good sequence of the indexes - This is simply a bad idea, b/c there are too much machines to do this
c, we can enforce page_timestamp (USE INDEX (page_timestamp) of IGNORE_INDEX(rev_timestamp) which is probably the best way to solve this.

Thanks for the summary!

I agree that the best chance way is to go for option c probably.
a) Would show lots of uncertainties, it might fix this query, but what could happen with others is unknown.
b) That is a no-go indeed. We'd need to to first pick which is the correct other, and then starting recloning all hosts across all sections.

I created a bug in the MediaWiki-Database board regarding this: T204926

Banyek reopened this task as Open.Sep 20 2018, 1:50 PM

I re-open this ticket until the query plans are fixed

Are all the API hosts having the same query plan then? Or do we have to tweak https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/461360/ ?

all of them has the same (wrong) query plan

Banyek moved this task from Backlog to On hold on the User-Banyek board.Sep 20 2018, 1:56 PM
Banyek moved this task from On hold to Blocked on the User-Banyek board.

@Banyek why do we have this still open if we have T204926? The scope of the ticket (reclone the host) is done.

Banyek added a comment.EditedSep 25 2018, 1:35 PM

@Marostegui I am ok with closing this

Banyek closed this task as Resolved.Sep 25 2018, 1:35 PM

I am fine too - just asking if there was some reason behind keeping it open that I might have missed :-)