Page MenuHomePhabricator

Replace codfw x1 master (db2033) (WAS: Failed BBU on db2033 (x1 master))
Closed, ResolvedPublic

Description

db2033 has a failed BBU.
This is codfw x1 master

root@db2033:~# hpssacli controller all show status

Smart Array P420i in Slot 0 (Embedded)
   Controller Status: OK
   Cache Status: Permanently Disabled
   Battery/Capacitor Status: Failed (Replace Batteries/Capacitors)

This host is out of warranty since December, @Papaul do you have any used BBU from a decommissioned or host something that can be used to replace this one?
Thanks

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Marostegui triaged this task as Medium priority.Jan 15 2018, 6:16 AM
Marostegui moved this task from Triage to In progress on the DBA board.

@Marostegui sorry but we don't have any used BBU from a decommissioned host that we can use . (we have no decommissioned HP servers)

Thanks @Papaul - I have checked the hosts that will soon be decommissioned and none of them are HP.
@RobH any ideas on what can we do about this?

The server kept lagging.
I have forced the controller to go to WriteBack temporarily till we decide how to proceed with this host.

root@db2033:~# hpssacli controller all show detail | grep "Drive Write Cache"
   Drive Write Cache: Disabled
hpssacli ctrl slot=0 modify dwc=enable
root@db2033:~# hpssacli controller all show detail | grep "Drive Write Cache"
   Drive Write Cache: Enabled

The server is now quickly catching up.

To go back to the previous state (no BBU so WT: hpssacli ctrl slot=0 modify dwc=disable )

I suggest we decommission the host. The raid controller BBU is now bad, we have no other decom raid controllers, and the system is out of warranty.

Having one system run a different config (different writeback) then the rest seems non-ideal.

This host was scheduled for replacement for July 2019 though.
Crazy idea, maybe there are some spare BBUs in eqiad?

We don't keep spare hardware for in warranty systems. The only chance of having a BBU that works is if we've decommissioned other HPs.

@Cmjohnson can you check if you have gen8 HP's decommissioned but sitting in storage? If you do, we could use the raid controller BBU off of one. Please let us know!

@RobH no I do not have any spares at this time.

Then we should probably switch over x1 codfw master to another host.

I'd advise doing whatever is needed to put db2033 on the list for immediate decommission. Please let me know when assistance is needed from the DC ops side!

Marostegui renamed this task from Failed BBU on db2033 (x1 master) to Replace codfw x1 master (db2033) (WAS: Failed BBU on db2033 (x1 master)).Jan 17 2018, 6:53 AM
Marostegui removed Papaul as the assignee of this task.
Marostegui removed projects: ops-codfw, SRE.
Marostegui removed subscribers: Cmjohnson, Papaul.

I would suggest we remove db2034 s1 from codfw rc service, reimage it, and make it the new master.

Change 404943 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Remove db2034

https://gerrit.wikimedia.org/r/404943

Change 404943 merged by jenkins-bot:
[operations/mediawiki-config@master] db-codfw.php: Remove db2034 from s1

https://gerrit.wikimedia.org/r/404943

Mentioned in SAL (#wikimedia-operations) [2018-01-18T10:29:27Z] <marostegui@tin> Synchronized wmf-config/db-codfw.php: Remove db2034 from s1 as it will be in x1 - T184888 (duration: 01m 12s)

Change 404955 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] install_server: Allow db2034 reinstall as stretch

https://gerrit.wikimedia.org/r/404955

Change 404955 merged by Marostegui:
[operations/puppet@production] install_server: Allow db2034 reinstall as stretch

https://gerrit.wikimedia.org/r/404955

Change 405246 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: Move db2034 from s1 to x1 master

https://gerrit.wikimedia.org/r/405246

Change 405246 merged by Marostegui:
[operations/puppet@production] mariadb: Move db2034 from s1 to x1 master

https://gerrit.wikimedia.org/r/405246

Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts:

['db2034.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201801190641_marostegui_4986.log.

Completed auto-reimage of hosts:

['db2034.codfw.wmnet']

and were ALL successful.

Mentioned in SAL (#wikimedia-operations) [2018-01-19T07:11:33Z] <marostegui> Stop x1 on dbstore2002 to copy its content to db2034 - T184888

Change 405255 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/software@master] s1,x1.hosts: Move db2034 from s1 to x1

https://gerrit.wikimedia.org/r/405255

db2034 is now replicating in x1 (stretch + mariadb 10.1.30)
Pending the actual codfw failover on x1. Probably worth waiting a few days to make sure db2034 has no issues.

Change 405255 merged by jenkins-bot:
[operations/software@master] s1,x1.hosts: Move db2034 from s1 to x1

https://gerrit.wikimedia.org/r/405255

Mentioned in SAL (#wikimedia-operations) [2018-01-22T12:01:06Z] <marostegui> Change x1 codfw topology: db2034 to replicate from eqiad T184888

db2034 is now replicating directly from eqiad master it has no slaves hanging from it (that will be done once we do the failover)

I will try to move the dbstore codfw servers under db2034 and make it the master next week.

This comment was removed by Marostegui.

What is pending here is:

  • Move dbstore2001 and dbstore2002 under db2034
  • Convert db2034 to master role in puppet

That should be it.
We can then move db2033 under db2034

Change 412629 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db2034.yaml: Add role master to db2034

https://gerrit.wikimedia.org/r/412629

Mentioned in SAL (#wikimedia-operations) [2018-02-19T07:42:44Z] <marostegui> Change topology on x1 codfw - T184888

Change 412629 merged by Marostegui:
[operations/puppet@production] db2034.yaml: Add role master to db2034

https://gerrit.wikimedia.org/r/412629

Change 412633 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-codfw.php: Change x1 codfw master

https://gerrit.wikimedia.org/r/412633

Change 412633 merged by jenkins-bot:
[operations/mediawiki-config@master] db-codfw.php: Change x1 codfw master

https://gerrit.wikimedia.org/r/412633

Change 412635 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db2033.yaml: Remove master role from db2033

https://gerrit.wikimedia.org/r/412635

Mentioned in SAL (#wikimedia-operations) [2018-02-19T08:02:17Z] <marostegui@tin> Synchronized wmf-config/db-codfw.php: Promote db2034 to x1 codfw master - T184888 (duration: 00m 56s)

Change 412635 merged by Marostegui:
[operations/puppet@production] db2033.yaml: Remove master role from db2033

https://gerrit.wikimedia.org/r/412635

Change 412637 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mysql-core_codfw: Add master/slave x1 codfw

https://gerrit.wikimedia.org/r/412637

Change 412637 merged by Marostegui:
[operations/puppet@production] mysql-core_codfw: Add master/slave x1 codfw

https://gerrit.wikimedia.org/r/412637

The failover is now done.
db2033 is now a x1 slave along with dbstore2001, dbstore2002.

Marostegui claimed this task.

@Papaul do you think we can use the BBU from db2064 (T195228) on this host?
I am starting to worry that the new x1 host (T199501) will not arrive on time for the DC failover

And this got fixed by itself:

root@db2033:~# hpssacli controller all show status

Smart Array P420i in Slot 0 (Embedded)
   Controller Status: OK
   Cache Status: OK
   Battery/Capacitor Status: OK

Mentioned in SAL (#wikimedia-operations) [2018-10-20T05:38:20Z] <marostegui> Force writeback on db2033 - T184888

And this failed again after a couple of months:

root@db2033:~# hpssacli controller all show status

Smart Array P420i in Slot 0 (Embedded)
   Controller Status: OK
   Cache Status: Permanently Disabled
   Battery/Capacitor Status: Failed (Replace Batteries/Capacitors)

I have forced the cache to be write back

root@db2033:~# hpssacli controller all show detail | grep "Drive Write Cache"
   Drive Write Cache: Disabled
root@db2033:~# hpssacli ctrl slot=0 modify dwc=enable
root@db2033:~# hpssacli controller all show detail | grep "Drive Write Cache"
   Drive Write Cache: Enabled

Change 468914 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-codfw.php: Clarify db2033 BBU status

https://gerrit.wikimedia.org/r/468914

Change 468914 merged by jenkins-bot:
[operations/mediawiki-config@master] db-codfw.php: Clarify db2033 BBU status

https://gerrit.wikimedia.org/r/468914