Slot Number: 8 Media Error Count: 417 Drive has flagged a S.M.A.R.T alert : Yes
Description
Details
Related Objects
Event Timeline
It might be easier just to replace the disk even if this host will go away at some point.
@Papaul do you have spare disks?
And the disk finally failed (see how it no longer appears) and how it was automatically detected on T170503:
root@db2019:~# megacli -PDList -aall | grep Slot Slot Number: 0 Slot Number: 1 Slot Number: 2 Slot Number: 3 Slot Number: 4 Slot Number: 5 Slot Number: 6 Slot Number: 7 Slot Number: 9 Slot Number: 10 Slot Number: 11
The disk was replaced and the raid is back to optimal (T170503#3436419), let's see if it has any effect on this issue in the next few days.
This has clearly change its pattern since the bad this was replaced and it is looking better already:
https://grafana.wikimedia.org/render/dashboard-solo/db/mysql?orgId=1&var-dc=codfw%20prometheus%2Fops&var-server=db2019&from=1498890025720&to=1500328389967&panelId=6&width=1000&height=500&tz=UTC%2B02%3A00
As part of: T170662 we will probably switch over s4 master on codfw. So this will get resolved.
I was thinking about either db2051 or db2065 to replace the current master.
db2051 is probably the right one, but we need to spread that hw batch, as you suggested, first.
Change 369626 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-codfw.php: Depool db2051
Change 369626 merged by jenkins-bot:
[operations/mediawiki-config@master] db-codfw.php: Depool db2051
Mentioned in SAL (#wikimedia-operations) [2017-08-02T11:19:37Z] <marostegui@tin> Synchronized wmf-config/db-codfw.php: Depool db2051 - T170351 (duration: 00m 46s)
Change 369633 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db2051.yaml: Update its socket location
Mentioned in SAL (#wikimedia-operations) [2017-08-02T11:33:43Z] <marostegui> Stop MySQL on db2051 for maintenance - T170351
Change 369633 merged by Marostegui:
[operations/puppet@production] db2051.yaml: Update its socket location
Mentioned in SAL (#wikimedia-operations) [2017-08-02T14:12:08Z] <marostegui> Stop MySQL on db2051 in order to get it ready to move to another rack - T170351
db2051 has been moved and it is now replicating again from its new location
Thanks @Papaul!
Change 369842 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Restore db2051 original values
Change 369842 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Restore db2051 original values
Mentioned in SAL (#wikimedia-operations) [2017-08-03T05:57:15Z] <marostegui@tin> Synchronized wmf-config/db-codfw.php: Repool db2051 - T170351 (duration: 00m 54s)
Change 369877 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/software@master] s4.hosts: db2051 is now s4 codfw master
Change 369879 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-codfw.php: Promote db2051 to master
Change 369880 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: Promote db2051 as the new s4 codfw master
Mentioned in SAL (#wikimedia-operations) [2017-08-03T12:15:53Z] <marostegui> Restart MySQL on db2051 - T170351
Mentioned in SAL (#wikimedia-operations) [2017-08-03T13:01:29Z] <marostegui> Disable gtid on s4 codfw slaves to get ready for the topology change - T170351
Mentioned in SAL (#wikimedia-operations) [2017-08-03T13:14:12Z] <marostegui> Start topology change for s4 in codfw, slaves will be moved under db2051 - T170351
Mentioned in SAL (#wikimedia-operations) [2017-08-03T13:26:49Z] <marostegui> Starting the actual s4 codfw failover db2019 -> db2051 - T170351
Change 369880 merged by Marostegui:
[operations/puppet@production] mariadb: Promote db2051 as the new s4 codfw master
Change 369879 merged by jenkins-bot:
[operations/mediawiki-config@master] db-codfw.php: Promote db2051 to master
Mentioned in SAL (#wikimedia-operations) [2017-08-03T13:35:18Z] <marostegui@tin> Synchronized wmf-config/db-codfw.php: Promote db2051 as s4 codfw master - T170351 (duration: 00m 46s)
Change 369877 merged by jenkins-bot:
[operations/software@master] s4.hosts: db2051 is now s4 codfw master
Mentioned in SAL (#wikimedia-operations) [2017-08-03T13:44:36Z] <marostegui> Enable gtid back on codfw s4 slaves - T170351
db2019 has been failed over db2051.
db2051 is now the master
We will see if the replication improves (which already did a bit when we replaced db2019's faulty disk): https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=codfw%20prometheus%2Fops&var-server=db2019&from=1498890025720&to=1501797189967&panelId=6&fullscreen
I have taken some notes during the failover that I will copy to wikitech for future references (merely as a checklist)
dbstore2001 still hangs from db2019 and will remain like that until as it is going to be rebuilt soon as part of: T168409