Page MenuHomePhabricator

dbstore2001: s5 thread isn't able to catch up with the master
Closed, ResolvedPublic

Description

Something is going on with dbstore2001 and s5 thread as it is delayed almost a week now:

root@dbstore2001:~# mysql --skip-ssl -e "show all slaves status\G" | egrep "Connection|Seconds"
              Connection_name: s1
        Seconds_Behind_Master: 134781
              Connection_name: s2
        Seconds_Behind_Master: NULL
              Connection_name: s3
        Seconds_Behind_Master: 86438
              Connection_name: s4
        Seconds_Behind_Master: NULL
              Connection_name: s5
        Seconds_Behind_Master: 559490
              Connection_name: s6
        Seconds_Behind_Master: 86411
              Connection_name: s7
        Seconds_Behind_Master: 86429
              Connection_name: x1
        Seconds_Behind_Master: NULL

The following appears on dmesg which is a bit old and might not be related, but worth mentioning it:

[Wed Jun  7 13:10:31 2017] hpsa 0000:02:00.0: scsi 0:0:0:0 Aborting command ffff880ff7451840Tag:0x00000000:00000180 CDBLen: 6 CDB: 0x1201... SN: 0x0  BEING SENT
[Wed Jun  7 13:10:31 2017] hpsa 0000:02:00.0: scsi 0:3:0:0: Aborting command RAID              HP       P420i            controller SSDSmartPathCap- En- Exp=1
[Wed Jun  7 13:10:31 2017] hpsa 0000:02:00.0: CDB 12018000600000000000000000000000 was aborted with status 0x0
[Wed Jun  7 13:10:31 2017] hpsa 0000:02:00.0: invalid command: LUN:0000000000000000 CDB:00000000800100000000000000000000
[Wed Jun  7 13:10:31 2017] hpsa 0000:02:00.0: probably means device no longer present
[Wed Jun  7 13:10:31 2017] hpsa 0000:02:00.0: scsi 0:0:0:0 Aborting command ffff880ff7451840Tag:0x00000000:00000180 CDBLen: 6 CDB: 0x1201... SN: 0x0  SENT, FAILED
[Wed Jun  7 13:10:31 2017] hpsa 0000:02:00.0: scsi 0:3:0:0: FAILED to abort command RAID              HP       P420i            controller SSDSmartPathCap- En- Exp=1

Event Timeline

Marostegui moved this task from Triage to In progress on the DBA board.

Something started the 12th between 12:20 and 13:00
https://grafana.wikimedia.org/dashboard/db/mysql?panelId=6&fullscreen&orgId=1&var-dc=codfw%20prometheus%2Fops&var-server=dbstore2001&from=1496727827117&to=1497937427117

There is nothing really relevant on SAL:

13:24 aharoni: running mwscript updateCollation.php --wiki=bawiki
13:23 hashar: terbium: for T139110 mwscript updateCollation.php --wiki=fawiktionary --previous-collation=uca-fa
13:22 hashar: terbium: for T139110 mwscript updateCollation.php --wiki=fawikisource --previous-collation=uca-fa
13:21 hashar: terbium: for T139110 mwscript updateCollation.php --wiki=fawiki --previous-collation=uca-fa
13:17 moritzm: upgrading cp1008 to openssl 1.1.0f
13:13 hashar@tin: Synchronized wmf-config/InitialiseSettings.php: Set collation for Bashkir wikis to uppercase-ba - T162823 (duration: 00m 41s)
13:10 hashar@tin: Synchronized wmf-config/InitialiseSettings.php: update some logos 6974b9ab4..76939d15f (duration: 00m 41s)
13:08 hashar@tin: Synchronized static/images/project-logos: (no justification provided) (duration: 00m 43s)
12:01 marostegui@tin: Synchronized wmf-config/db-eqiad.php: Depool db1089 for maintenance - T166935 (duration: 00m 41s)
12:01 moritzm: upgrading mw1266-mw1275 to HHVM 3.18
11:09 joal@tin: Finished deploy [analytics/refinery@d9c3419]: Regular weekly deploy of refinery (mostly unique_devices patches) (duration: 06m 18s)

Nothing on mysql logs around that day and time

Mentioned in SAL (#wikimedia-operations) [2017-06-20T07:22:12Z] <marostegui> Stop MySQL dbstore2001 for maintenance - T168354

There is clearly something wrong with this host.
T165033#3257985 we could see that it normally took around 3 hours to get MySQL up after the socket was created, so far it has been almost 8 hours and still the server isn't available:

root@dbstore2001:~# cat /srv/sqldata/dbstore2001.err | grep "Server socket created" | tail -n1
170620  7:39:09 [Note] Server socket created on IP: '::'.
root@dbstore2001:~# date
Tue Jun 20 15:20:24 UTC 2017

I am going to leave it like that to see how long it actually takes. But this server needs to be migrate to multi instance

And it is up:

170620 16:04:28 [Note] /opt/wmf-mariadb10/bin/mysqld: ready for connections

So it took 8:30h from the socket creation until the server was fully available.
I have started the slaves now.

I have stopped all the slaves but the s5 one and I will leave it running alone until tomorrow morning, to see if it is able to reduce its lag when it has the whole server just for itself.

During the night, until now, s5 running by itself not only didn't catch up but got further delayed.
There is something wrong with this host (or at least its data) - for now and until we get to work on T168409 I am going to reimport s5 here.

Change 360607 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-codfw.php: Depool db2066

https://gerrit.wikimedia.org/r/360607

Change 360607 merged by jenkins-bot:
[operations/mediawiki-config@master] db-codfw.php: Depool db2066

https://gerrit.wikimedia.org/r/360607

Mentioned in SAL (#wikimedia-operations) [2017-06-21T06:18:09Z] <marostegui@tin> Synchronized wmf-config/db-codfw.php: Depool db2066 - T168354 (duration: 00m 43s)

Mentioned in SAL (#wikimedia-operations) [2017-06-21T06:19:15Z] <marostegui> Stop replication and puppet on db2066 for maintenance - T168354

Mentioned in SAL (#wikimedia-operations) [2017-06-21T07:37:11Z] <marostegui> Stop and reset slave s5 on dbstore2001 - T168354

Marostegui renamed this task from dbstore2001 s5 thread is 6 days delayed to dbstore2001: s5 thread isn't able to catch up with the master.Jun 21 2017, 8:36 AM

Mentioned in SAL (#wikimedia-operations) [2017-06-23T06:08:26Z] <marostegui@tin> Synchronized wmf-config/db-codfw.php: Repool db2066 - T168354 (duration: 00m 46s)

s5 is now replicating again in dbstore2001 after a few days getting it imported.
Right now, all the shards are a few days delayed due to the increase in load on dbstore2001 while importing dewiki and wikidata.

We will see how the server copes with the load and how long it takes for all the shards to catch up (if they ever do)

Mentioned in SAL (#wikimedia-operations) [2017-06-27T07:16:04Z] <marostegui> Temporarily disable event scheduler on dbstore2001 - T168354

Mentioned in SAL (#wikimedia-operations) [2017-06-27T08:14:15Z] <marostegui> Re-enable event scheduler on dbstore2001 - T168354

Mentioned in SAL (#wikimedia-operations) [2017-06-28T05:55:08Z] <marostegui> Temporarily disable event scheduler on dbstore2001 - https://phabricator.wikimedia.org/T168354

Mentioned in SAL (#wikimedia-operations) [2017-06-28T06:34:56Z] <marostegui> Stop Replication in sync on db2033 and dbstore2001 (x1) - T168354

Mentioned in SAL (#wikimedia-operations) [2017-06-28T07:33:35Z] <marostegui> Re-enable event scheduler on dbstore2001 - T168354

I have restored the original replication filters for x1 after reimporting the missing tables from x1 into dewiki and wikidatawiki

An update:
The shards keep catching up (as they were really delayed for the maintenance that pretty much lasted a week).
x1, s3 and s6 already caught up.
The rest (including s5, which was the one that trigger this maintenance) are catching up slowly but nicely.

Mentioned in SAL (#wikimedia-operations) [2017-07-03T08:29:59Z] <marostegui> Compress dewiki on dbstore2001 - T168354

So, the scope of this task (s5) was fixed a couple of weeks ago. However, there is a performance problem with this host, which will only be solved (hopefully) by finishing: T168409

All the shards have a 24h delay as expected but s1. Which is delayed around a week, why? Basically because whenever there is high IO operations on this host (ie: importing s5, compressing s5...) s1 gets really delayed until the operation is finished, and then maintain that delay, but it is not able to catch up. Again, it doesn't get delayed more, but it doesn't catch up either, it just maintains it.

It happens only with s1 (which is the most active shard). So this host has a problem, and we believe it might be InnoDB being corrupted or some sort of it, after that crash it had months ago.
I will close this as resolved, as s5 has been up to date for some weeks no, but s1 is still (and will remain) delayed unfortunately.
The only solution, as mentioned before, is to migrate this host to multi instance and re-import s1 as part of: T168409