Page MenuHomePhabricator

codfw API slaves overloaded during the 2017-04-19 codfw switch
Closed, ResolvedPublic

Description

The codfw API slaves (db2062, db2069) overloaded immediately after the codfw switch today. @jcrespo @Marostegui responded by pooling db2055 but is likely still not enough.

Details

Related Gerrit Patches:
operations/mediawiki-config : masterdb-codfw.php: Depool db2069

Event Timeline

faidon created this task.Apr 19 2017, 4:46 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 19 2017, 4:46 PM
Marostegui moved this task from Triage to In progress on the DBA board.Apr 19 2017, 4:52 PM

Update we have found that the query plan isn't the same for all the queries: https://phabricator.wikimedia.org/P5293
We believe this is caused by the ALTERs that are being done in the revision table to get the new indexes and PK, and it might need an analyze table so the queries start the correct plan and avoid the filesort

The analyze can take around 3 days for enwiki, so Jaime is trying a hack to export the indexes stats from one host in eqiad (which already shows the correct plan) to one of the hosts in codfw.
We will post the results of that.

If not, we might need to change the query to FORCE the correct index.

Anomie added a subscriber: Anomie.Apr 19 2017, 5:34 PM

Change 348970 had a related patch set uploaded (by Marostegui):
[operations/mediawiki-config@master] db-codfw.php: Depool db2069

https://gerrit.wikimedia.org/r/348970

The first hack hasn't worked as expected. We are thinking about just depooling the slave and run the normal analyze table on it.

This was and old friend^ we should either index hint, send it to RC or something soon- it is now failing too often. See subtask.

db2062 and db2069 are in the same state as yesterday (but with a lot less IOPS than after the initial peak) so that is good.
The IOPS they are having are pretty much the same as db1072 was having when eqiad was active:

db1072:
https://grafana.wikimedia.org/dashboard/file/server-board.json?orgId=1&refresh=1m&from=now-2d&to=now&var-server=db1072&var-network=eth0&panelId=18&fullscreen

db2062:
https://grafana.wikimedia.org/dashboard/file/server-board.json?orgId=1&refresh=1m&from=1492616426274&to=1492667920821&var-server=db2062&var-network=eth0&panelId=18&fullscreen

Change 348970 abandoned by Marostegui:
db-codfw.php: Depool db2069

Reason:
Not needed for now

https://gerrit.wikimedia.org/r/348970

db2071 has now been serving traffic for around 1 hour:
https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&orgId=1&var-server=db2071&var-network=eth0&from=1492688519355&to=1492692119355
https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=codfw%20prometheus%2Fops&var-server=db2071&from=1492681333602&to=1492692133603

It is looking good although it is still doing filesort, which is strange as Jaime recloned it from db1080 which was not doing file sort either. However, after bringing both servers up they do now file sort
At least, doing filesorting on db2071 isn't as expensive as on the other servers without ssd (it can be see on the disk utilization graph)

root@db2071.codfw.wmnet[enwiki]>  EXPLAIN SELECT /* AFComputedVariable::{closure}  */  rev_user_text  FROM `revision`    WHERE rev_page = '1743794'   ORDER BY rev_timestamp DESC LIMIT 100\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: revision
         type: ref
possible_keys: page_timestamp,page_user_timestamp,rev_page_id
          key: page_user_timestamp
      key_len: 4
          ref: const
         rows: 137288
        Extra: Using where; Using filesort
1 row in set (0.04 sec)

Maybe with this extra server, we can now depool one of the other "old" ones and let them run analyze over the weekend, and all that.
It would be also interesting to test again analyze on db1080, see if it stops filesorting and restart MySQL and see if the file sort comes back

Just for the record, I finished an alter table on the revision table on db1065, and it is not filesorting.

root@db1065[enwiki]>  EXPLAIN SELECT /* AFComputedVariable::{closure}  */  rev_user_text  FROM `revision`    WHERE rev_page = '1743794'   ORDER BY rev_timestamp DESC LIMIT 100\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: revision
         type: ref
possible_keys: rev_page_id,page_timestamp,page_user_timestamp
          key: page_timestamp
      key_len: 4
          ref: const
         rows: 149164
        Extra: Using where
1 row in set (0.04 sec)

I will restart MySQL tomorrow on that host to see if the same query does file sort (and we can reproduce the issue with db1080) after the restart.

Mentioned in SAL (#wikimedia-operations) [2017-04-21T06:21:03Z] <marostegui> Restart MySQL on db1065 for maintenance - T163351

After restarting db1065, the server is filesorting:

root@db1065[enwiki]>  EXPLAIN SELECT /* AFComputedVariable::{closure}  */  rev_user_text  FROM `revision`    WHERE rev_page = '1743794'   ORDER BY rev_timestamp DESC LIMIT 100\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: revision
         type: ref
possible_keys: rev_page_id,page_timestamp,page_user_timestamp
          key: page_user_timestamp
      key_len: 4
          ref: const
         rows: 136954
        Extra: Using where; Using filesort
1 row in set (0.10 sec)

So the behaviour seen on: T163351#3197434 looks related to the server restart (which is terrible)

jcrespo closed this task as Resolved.Apr 21 2017, 7:54 AM
jcrespo assigned this task to Marostegui.

Technically the overload is not true anymore, but only because we threw hardware at it- enwiki is no longer blockef, "only" has bad performance and high failure rate, with a probably large save time. We are taking care of those issues at: T163495