The codfw API slaves (db2062, db2069) overloaded immediately after the codfw switch today. @jcrespo @Marostegui responded by pooling db2055 but is likely still not enough.
Description
Details
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
db-codfw.php: Depool db2069 | operations/mediawiki-config | master | +2 -2 |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | akosiaris | T199073 Perform a datacenter switchover (2018-19 Q1) | |||
Resolved | None | T163495 Mediawiki revision-related queries are failing with high rate for enwiki on codfw | |||
Resolved | Marostegui | T163351 codfw API slaves overloaded during the 2017-04-19 codfw switch | |||
Resolved | None | T163413 Pool new server db2071 |
Event Timeline
Update we have found that the query plan isn't the same for all the queries: https://phabricator.wikimedia.org/P5293
We believe this is caused by the ALTERs that are being done in the revision table to get the new indexes and PK, and it might need an analyze table so the queries start the correct plan and avoid the filesort
The analyze can take around 3 days for enwiki, so Jaime is trying a hack to export the indexes stats from one host in eqiad (which already shows the correct plan) to one of the hosts in codfw.
We will post the results of that.
If not, we might need to change the query to FORCE the correct index.
Change 348970 had a related patch set uploaded (by Marostegui):
[operations/mediawiki-config@master] db-codfw.php: Depool db2069
The first hack hasn't worked as expected. We are thinking about just depooling the slave and run the normal analyze table on it.
This was and old friend^ we should either index hint, send it to RC or something soon- it is now failing too often. See subtask.
db2062 and db2069 are in the same state as yesterday (but with a lot less IOPS than after the initial peak) so that is good.
The IOPS they are having are pretty much the same as db1072 was having when eqiad was active:
Change 348970 abandoned by Marostegui:
db-codfw.php: Depool db2069
Reason:
Not needed for now
db2071 has now been serving traffic for around 1 hour:
https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&orgId=1&var-server=db2071&var-network=eth0&from=1492688519355&to=1492692119355
https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=codfw%20prometheus%2Fops&var-server=db2071&from=1492681333602&to=1492692133603
It is looking good although it is still doing filesort, which is strange as Jaime recloned it from db1080 which was not doing file sort either. However, after bringing both servers up they do now file sort
At least, doing filesorting on db2071 isn't as expensive as on the other servers without ssd (it can be see on the disk utilization graph)
root@db2071.codfw.wmnet[enwiki]> EXPLAIN SELECT /* AFComputedVariable::{closure} */ rev_user_text FROM `revision` WHERE rev_page = '1743794' ORDER BY rev_timestamp DESC LIMIT 100\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: revision type: ref possible_keys: page_timestamp,page_user_timestamp,rev_page_id key: page_user_timestamp key_len: 4 ref: const rows: 137288 Extra: Using where; Using filesort 1 row in set (0.04 sec)
Maybe with this extra server, we can now depool one of the other "old" ones and let them run analyze over the weekend, and all that.
It would be also interesting to test again analyze on db1080, see if it stops filesorting and restart MySQL and see if the file sort comes back
Just for the record, I finished an alter table on the revision table on db1065, and it is not filesorting.
root@db1065[enwiki]> EXPLAIN SELECT /* AFComputedVariable::{closure} */ rev_user_text FROM `revision` WHERE rev_page = '1743794' ORDER BY rev_timestamp DESC LIMIT 100\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: revision type: ref possible_keys: rev_page_id,page_timestamp,page_user_timestamp key: page_timestamp key_len: 4 ref: const rows: 149164 Extra: Using where 1 row in set (0.04 sec)
I will restart MySQL tomorrow on that host to see if the same query does file sort (and we can reproduce the issue with db1080) after the restart.
Mentioned in SAL (#wikimedia-operations) [2017-04-21T06:21:03Z] <marostegui> Restart MySQL on db1065 for maintenance - T163351
After restarting db1065, the server is filesorting:
root@db1065[enwiki]> EXPLAIN SELECT /* AFComputedVariable::{closure} */ rev_user_text FROM `revision` WHERE rev_page = '1743794' ORDER BY rev_timestamp DESC LIMIT 100\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: revision type: ref possible_keys: rev_page_id,page_timestamp,page_user_timestamp key: page_user_timestamp key_len: 4 ref: const rows: 136954 Extra: Using where; Using filesort 1 row in set (0.10 sec)
So the behaviour seen on: T163351#3197434 looks related to the server restart (which is terrible)
Technically the overload is not true anymore, but only because we threw hardware at it- enwiki is no longer blockef, "only" has bad performance and high failure rate, with a probably large save time. We are taking care of those issues at: T163495