Page MenuHomePhabricator

db2019 has performance issues, replace disk or switchover s4 master elsewhere
Closed, ResolvedPublic

Details

Related Gerrit Patches:
operations/software : masters4.hosts: db2051 is now s4 codfw master
operations/mediawiki-config : masterdb-codfw.php: Promote db2051 to master
operations/puppet : productionmariadb: Promote db2051 as the new s4 codfw master
operations/mediawiki-config : masterdb-eqiad.php: Restore db2051 original values
operations/mediawiki-config : masterdb-codfw.php: Depool db2051
operations/puppet : productiondb2051.yaml: Update its socket location

Event Timeline

jcrespo created this task.Jul 11 2017, 9:54 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 11 2017, 9:54 PM

It might be easier just to replace the disk even if this host will go away at some point.
@Papaul do you have spare disks?

Marostegui moved this task from Triage to Next on the DBA board.Jul 12 2017, 7:37 AM

And the disk finally failed (see how it no longer appears) and how it was automatically detected on T170503:

root@db2019:~# megacli -PDList -aall  | grep Slot
Slot Number: 0
Slot Number: 1
Slot Number: 2
Slot Number: 3
Slot Number: 4
Slot Number: 5
Slot Number: 6
Slot Number: 7
Slot Number: 9
Slot Number: 10
Slot Number: 11

The disk was replaced and the raid is back to optimal (T170503#3436419), let's see if it has any effect on this issue in the next few days.

As part of: T170662 we will probably switch over s4 master on codfw. So this will get resolved.
I was thinking about either db2051 or db2065 to replace the current master.

db2051 is probably the right one, but we need to spread that hw batch, as you suggested, first.

Change 369626 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-codfw.php: Depool db2051

https://gerrit.wikimedia.org/r/369626

Change 369626 merged by jenkins-bot:
[operations/mediawiki-config@master] db-codfw.php: Depool db2051

https://gerrit.wikimedia.org/r/369626

Mentioned in SAL (#wikimedia-operations) [2017-08-02T11:19:37Z] <marostegui@tin> Synchronized wmf-config/db-codfw.php: Depool db2051 - T170351 (duration: 00m 46s)

Change 369633 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db2051.yaml: Update its socket location

https://gerrit.wikimedia.org/r/369633

Mentioned in SAL (#wikimedia-operations) [2017-08-02T11:33:43Z] <marostegui> Stop MySQL on db2051 for maintenance - T170351

Change 369633 merged by Marostegui:
[operations/puppet@production] db2051.yaml: Update its socket location

https://gerrit.wikimedia.org/r/369633

Mentioned in SAL (#wikimedia-operations) [2017-08-02T14:12:08Z] <marostegui> Stop MySQL on db2051 in order to get it ready to move to another rack - T170351

db2051 has been moved and it is now replicating again from its new location
Thanks @Papaul!

Change 369842 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Restore db2051 original values

https://gerrit.wikimedia.org/r/369842

Change 369842 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Restore db2051 original values

https://gerrit.wikimedia.org/r/369842

Mentioned in SAL (#wikimedia-operations) [2017-08-03T05:57:15Z] <marostegui@tin> Synchronized wmf-config/db-codfw.php: Repool db2051 - T170351 (duration: 00m 54s)

Change 369877 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/software@master] s4.hosts: db2051 is now s4 codfw master

https://gerrit.wikimedia.org/r/369877

Change 369879 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-codfw.php: Promote db2051 to master

https://gerrit.wikimedia.org/r/369879

Change 369880 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: Promote db2051 as the new s4 codfw master

https://gerrit.wikimedia.org/r/369880

Mentioned in SAL (#wikimedia-operations) [2017-08-03T12:15:53Z] <marostegui> Restart MySQL on db2051 - T170351

Mentioned in SAL (#wikimedia-operations) [2017-08-03T13:01:29Z] <marostegui> Disable gtid on s4 codfw slaves to get ready for the topology change - T170351

Mentioned in SAL (#wikimedia-operations) [2017-08-03T13:14:12Z] <marostegui> Start topology change for s4 in codfw, slaves will be moved under db2051 - T170351

Mentioned in SAL (#wikimedia-operations) [2017-08-03T13:26:49Z] <marostegui> Starting the actual s4 codfw failover db2019 -> db2051 - T170351

Change 369880 merged by Marostegui:
[operations/puppet@production] mariadb: Promote db2051 as the new s4 codfw master

https://gerrit.wikimedia.org/r/369880

Change 369879 merged by jenkins-bot:
[operations/mediawiki-config@master] db-codfw.php: Promote db2051 to master

https://gerrit.wikimedia.org/r/369879

Mentioned in SAL (#wikimedia-operations) [2017-08-03T13:35:18Z] <marostegui@tin> Synchronized wmf-config/db-codfw.php: Promote db2051 as s4 codfw master - T170351 (duration: 00m 46s)

Change 369877 merged by jenkins-bot:
[operations/software@master] s4.hosts: db2051 is now s4 codfw master

https://gerrit.wikimedia.org/r/369877

Mentioned in SAL (#wikimedia-operations) [2017-08-03T13:44:36Z] <marostegui> Enable gtid back on codfw s4 slaves - T170351

Marostegui triaged this task as Medium priority.
Marostegui moved this task from Next to In progress on the DBA board.

db2019 has been failed over db2051.
db2051 is now the master

We will see if the replication improves (which already did a bit when we replaced db2019's faulty disk): https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=codfw%20prometheus%2Fops&var-server=db2019&from=1498890025720&to=1501797189967&panelId=6&fullscreen

I have taken some notes during the failover that I will copy to wikitech for future references (merely as a checklist)

dbstore2001 still hangs from db2019 and will remain like that until as it is going to be rebuilt soon as part of: T168409

Marostegui closed this task as Resolved.Aug 4 2017, 8:22 AM

Resolving this, this host will be decommissioned once we have finished with: T162593