codfw API slaves overloaded during the 2017-04-19 codfw switch
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	faidon
	Apr 19 2017, 4:46 PM

Description

The codfw API slaves (db2062, db2069) overloaded immediately after the codfw switch today. @jcrespo @Marostegui responded by pooling db2055 but is likely still not enough.

Details

	Subject	Repo	Branch	Lines +/-
	db-codfw.php: Depool db2069	operations/mediawiki-config	master	+2 -2

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	akosiaris	T199073 Perform a datacenter switchover (2018-19 Q1)
Resolved	None	T163495 Mediawiki revision-related queries are failing with high rate for enwiki on codfw
Resolved	Marostegui	T163351 codfw API slaves overloaded during the 2017-04-19 codfw switch
Resolved	None	T163413 Pool new server db2071

Event Timeline

faidon created this task.Apr 19 2017, 4:46 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 19 2017, 4:46 PM

faidon moved this task from Backlog to Fallout from Apr/May 2017 switch on the codfw-rollout board.Apr 19 2017, 4:48 PM

Marostegui moved this task from Triage to In progress on the DBA board.Apr 19 2017, 4:52 PM

Update we have found that the query plan isn't the same for all the queries: https://phabricator.wikimedia.org/P5293
We believe this is caused by the ALTERs that are being done in the revision table to get the new indexes and PK, and it might need an analyze table so the queries start the correct plan and avoid the filesort

The analyze can take around 3 days for enwiki, so Jaime is trying a hack to export the indexes stats from one host in eqiad (which already shows the correct plan) to one of the hosts in codfw.
We will post the results of that.

If not, we might need to change the query to FORCE the correct index.

Anomie subscribed.Apr 19 2017, 5:34 PM

Change 348970 had a related patch set uploaded (by Marostegui):
[operations/mediawiki-config@master] db-codfw.php: Depool db2069

https://gerrit.wikimedia.org/r/348970

gerritbot added a project: Patch-For-Review.Apr 19 2017, 5:39 PM

The first hack hasn't worked as expected. We are thinking about just depooling the slave and run the normal analyze table on it.

This was and old friend^ we should either index hint, send it to RC or something soon- it is now failing too often. See subtask.

db2062 and db2069 are in the same state as yesterday (but with a lot less IOPS than after the initial peak) so that is good.
The IOPS they are having are pretty much the same as db1072 was having when eqiad was active:

db1072:
https://grafana.wikimedia.org/dashboard/file/server-board.json?orgId=1&refresh=1m&from=now-2d&to=now&var-server=db1072&var-network=eth0&panelId=18&fullscreen

db2062:
https://grafana.wikimedia.org/dashboard/file/server-board.json?orgId=1&refresh=1m&from=1492616426274&to=1492667920821&var-server=db2062&var-network=eth0&panelId=18&fullscreen

Change 348970 abandoned by Marostegui:
db-codfw.php: Depool db2069

Reason:
Not needed for now

https://gerrit.wikimedia.org/r/348970

Marostegui created subtask T163413: Pool new server db2071.Apr 20 2017, 6:18 AM

db2071 has now been serving traffic for around 1 hour:
https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&orgId=1&var-server=db2071&var-network=eth0&from=1492688519355&to=1492692119355
https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=codfw%20prometheus%2Fops&var-server=db2071&from=1492681333602&to=1492692133603

It is looking good although it is still doing filesort, which is strange as Jaime recloned it from db1080 which was not doing file sort either. However, after bringing both servers up they do now file sort
At least, doing filesorting on db2071 isn't as expensive as on the other servers without ssd (it can be see on the disk utilization graph)

root@db2071.codfw.wmnet[enwiki]>  EXPLAIN SELECT /* AFComputedVariable::{closure}  */  rev_user_text  FROM `revision`    WHERE rev_page = '1743794'   ORDER BY rev_timestamp DESC LIMIT 100\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: revision
         type: ref
possible_keys: page_timestamp,page_user_timestamp,rev_page_id
          key: page_user_timestamp
      key_len: 4
          ref: const
         rows: 137288
        Extra: Using where; Using filesort
1 row in set (0.04 sec)

Maybe with this extra server, we can now depool one of the other "old" ones and let them run analyze over the weekend, and all that.
It would be also interesting to test again analyze on db1080, see if it stops filesorting and restart MySQL and see if the file sort comes back

Just for the record, I finished an alter table on the revision table on db1065, and it is not filesorting.

root@db1065[enwiki]>  EXPLAIN SELECT /* AFComputedVariable::{closure}  */  rev_user_text  FROM `revision`    WHERE rev_page = '1743794'   ORDER BY rev_timestamp DESC LIMIT 100\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: revision
         type: ref
possible_keys: rev_page_id,page_timestamp,page_user_timestamp
          key: page_timestamp
      key_len: 4
          ref: const
         rows: 149164
        Extra: Using where
1 row in set (0.04 sec)

I will restart MySQL tomorrow on that host to see if the same query does file sort (and we can reproduce the issue with db1080) after the restart.

jcrespo mentioned this in T163495: Mediawiki revision-related queries are failing with high rate for enwiki on codfw.Apr 20 2017, 7:56 PM

Mentioned in SAL (#wikimedia-operations) [2017-04-21T06:21:03Z] <marostegui> Restart MySQL on db1065 for maintenance - T163351

After restarting db1065, the server is filesorting:

root@db1065[enwiki]>  EXPLAIN SELECT /* AFComputedVariable::{closure}  */  rev_user_text  FROM `revision`    WHERE rev_page = '1743794'   ORDER BY rev_timestamp DESC LIMIT 100\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: revision
         type: ref
possible_keys: rev_page_id,page_timestamp,page_user_timestamp
          key: page_user_timestamp
      key_len: 4
          ref: const
         rows: 136954
        Extra: Using where; Using filesort
1 row in set (0.10 sec)

So the behaviour seen on: T163351#3197434 looks related to the server restart (which is terrible)

jcrespo added a parent task: T163495: Mediawiki revision-related queries are failing with high rate for enwiki on codfw.Apr 21 2017, 7:39 AM

jcrespo removed a subtask: T116557: AFComputedVariable::compute query timeouts.Apr 21 2017, 7:48 AM

Technically the overload is not true anymore, but only because we threw hardware at it- enwiki is no longer blockef, "only" has bad performance and high failure rate, with a probably large save time. We are taking care of those issues at: T163495

Marostegui closed subtask T163413: Pool new server db2071 as Resolved.Apr 21 2017, 11:59 AM

Aklapper removed a subscriber: Anomie.Oct 16 2020, 5:40 PM

Maintenance_bot removed a project: Patch-For-Review.Oct 16 2020, 6:36 PM

codfw API slaves overloaded during the 2017-04-19 codfw switchClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

codfw API slaves overloaded during the 2017-04-19 codfw switch
Closed, ResolvedPublic
Actions

Related Objects
Search...