Page MenuHomePhabricator

BBU issues on codfw
Closed, ResolvedPublic

Description

This is a tracking task for all the BBU issues we have on codfw.
All those hosts are out of warranty so we probably just have to replace them with new hardware instead of buying a new BBU

Handed over for decommission T220002: dbstore2002 temporary backup host T205257
Handed over for decommission T225090: db2042 m3 master T209261
Handed over for decommission T224079: db2040 s7 old master (replaced after BBU failed: T214264#4897163)
Handed over for decommission T220070: db2033 x1 slave T184888

If needed:

  • Force write back
hpssacli controller all show detail | grep "Drive Write Cache"
hpssacli ctrl slot=0 modify dwc=enable
hpssacli controller all show detail | grep "Drive Write Cache"
  • Go back to default policy:
hpssacli ctrl slot=0 modify dwc=disable

Event Timeline

Marostegui triaged this task as Medium priority.Jan 20 2019, 2:57 PM
Marostegui moved this task from Triage to Meta/Epic on the DBA board.
Marostegui added subscribers: Banyek, Stashbot, Papaul.
Marostegui added a subscriber: Volans.
Marostegui removed subscribers: Volans, Papaul, Banyek, Marostegui.

Mentioned in SAL (#wikimedia-operations) [2019-01-20T15:13:17Z] <marostegui> Force WriteBack on db2040 - T214264

db2040 was really delayed, and I checked that the actor migration script finished yesterday, so it is probably not because of that but because of the failed BBU and the policy being back to WriteThru.
I have forced WriteBack there to let it catch up.

db2040 caught up, and I switched back to WriteThru and it wasn't able to keep up with replication.
So I have forced WB again, let's promote db2047 to master and make db2040 a replica.

jcrespo moved this task from Meta/Epic to In progress on the DBA board.

Change 485617 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Promote db2047 to master on configuration management

https://gerrit.wikimedia.org/r/485617

Mentioned in SAL (#wikimedia-operations) [2019-01-21T17:44:33Z] <jynus> stop replication on db2040 for master switch T214264

Change 485617 merged by Jcrespo:
[operations/puppet@production] mariadb: Promote db2047 to master on configuration management

https://gerrit.wikimedia.org/r/485617

Mentioned in SAL (#wikimedia-operations) [2019-01-21T17:51:33Z] <jynus> stop and apply puppet changes to db2047 T214264

Change 485696 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Depool db2040, promote db2047 to master of s7 section

https://gerrit.wikimedia.org/r/485696

Change 485696 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Depool db2040, promote db2047 to master of s7 section

https://gerrit.wikimedia.org/r/485696

Change 485698 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Depool db2040 after maintenance

https://gerrit.wikimedia.org/r/485698

Change 485701 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Demote db2040 from being an s7 master to just a replica

https://gerrit.wikimedia.org/r/485701

Change 485701 merged by Jcrespo:
[operations/puppet@production] mariadb: Demote db2040 from being an s7 master to just a replica

https://gerrit.wikimedia.org/r/485701

Mentioned in SAL (#wikimedia-operations) [2019-01-21T19:23:23Z] <jynus> mysql.py -h db1115 zarcillo -e "UPDATE masters SET instance = 'db2047' WHERE section = 's7' and dc = 'codfw'" T214264

root@cumin2001:~$ ./software/dbtools/section s7 | while read instance; do echo "$instance:"; mysql.py -h $instance -e "show slave status\G" | grep 'Using_Gtid:'; done    
labsdb1011:
labsdb1010:
labsdb1009:
dbstore2001:3317:
                   Using_Gtid: Slave_Pos
dbstore1002:
db2095:3317:
                   Using_Gtid: Slave_Pos
db2087:3317:
                   Using_Gtid: Slave_Pos
db2086:3317:
                   Using_Gtid: Slave_Pos
db2077:
                   Using_Gtid: Slave_Pos
db2068:
                   Using_Gtid: Slave_Pos
db2061:
                   Using_Gtid: Slave_Pos
db2054:
                   Using_Gtid: Slave_Pos
db2040:
                   Using_Gtid: Slave_Pos
db1125:3317:
                   Using_Gtid: Slave_Pos
db1116:3317:
                   Using_Gtid: Slave_Pos
db1101:3317:
                   Using_Gtid: Slave_Pos
db1098:3317:
                   Using_Gtid: Slave_Pos
db1094:
                   Using_Gtid: Slave_Pos
db1090:3317:
                   Using_Gtid: Slave_Pos
db1086:
                   Using_Gtid: Slave_Pos
db1079:
                   Using_Gtid: Slave_Pos
db2047:
                   Using_Gtid: Slave_Pos
db1062:

I will leave semi-sync "as is" (5 active clients, the non-ssd ones)

Change 485698 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Repool db2040 after maintenance

https://gerrit.wikimedia.org/r/485698

jcrespo moved this task from In progress to Blocked external/Not db team on the DBA board.
Marostegui claimed this task.
Marostegui updated the task description. (Show Details)

All these hosts have been handed over to DCOPs for decommissioning. Closing this.