Replace codfw x1 master (db2033) (WAS: Failed BBU on db2033 (x1 master))
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Marostegui
	Jan 15 2018, 6:15 AM

Description

db2033 has a failed BBU.
This is codfw x1 master

root@db2033:~# hpssacli controller all show status

Smart Array P420i in Slot 0 (Embedded)
   Controller Status: OK
   Cache Status: Permanently Disabled
   Battery/Capacitor Status: Failed (Replace Batteries/Capacitors)

This host is out of warranty since December, @Papaul do you have any used BBU from a decommissioned or host something that can be used to replace this one?
Thanks

Details

Subject	Repo	Branch	Lines +/-
db-codfw.php: Clarify db2033 BBU status	operations/mediawiki-config	master	+1 -1
mysql-core_codfw: Add master/slave x1 codfw	operations/puppet	production	+6 -1
db2033.yaml: Remove master role from db2033	operations/puppet	production	+0 -1
db-codfw.php: Change x1 codfw master	operations/mediawiki-config	master	+2 -2
db2034.yaml: Add role master to db2034	operations/puppet	production	+1 -0
s1,x1.hosts: Move db2034 from s1 to x1	operations/software	master	+1 -1
mariadb: Move db2034 from s1 to x1 master	operations/puppet	production	+8 -5
install_server: Allow db2034 reinstall as stretch	operations/puppet	production	+2 -3
db-codfw.php: Remove db2034 from s1	operations/mediawiki-config	master	+1 -6

Customize query in gerrit

Related Objects

Mentioned In: T227862: (OoW) db2045 failed battery
T219493: Prepare to decommission 2 codfw x1 hosts db2033 and db2034
T214264: BBU issues on codfw
T202051: db2042 (m3) master RAID battery failed
T201757: Degraded RAID on db2033
T191275: Prepare and indicate proper master db failover candidates for all codfw database sections (s1-s8, x1)
Mentioned Here: T195228: db2064 crashed and totally broken - decommission it

Event Timeline

Marostegui created this task.Jan 15 2018, 6:15 AM

Restricted Application added a project: SRE. · View Herald TranscriptJan 15 2018, 6:15 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Marostegui triaged this task as Medium priority.Jan 15 2018, 6:16 AM

Marostegui moved this task from Triage to In progress on the DBA board.

@Marostegui sorry but we don't have any used BBU from a decommissioned host that we can use . (we have no decommissioned HP servers)

Thanks @Papaul - I have checked the hosts that will soon be decommissioned and none of them are HP.
@RobH any ideas on what can we do about this?

Maybe we should force this host to be WB even without the BBU to make sure it catches up: https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=codfw%20prometheus%2Fops&var-server=db2033&var-port=9104&panelId=6&fullscreen&from=1515936574721&to=1516022974721

Mentioned in SAL (#wikimedia-operations) [2018-01-15T16:31:55Z] <marostegui> Force WB on db2033 - T184888

The server kept lagging.
I have forced the controller to go to WriteBack temporarily till we decide how to proceed with this host.

root@db2033:~# hpssacli controller all show detail | grep "Drive Write Cache"
   Drive Write Cache: Disabled

hpssacli ctrl slot=0 modify dwc=enable

root@db2033:~# hpssacli controller all show detail | grep "Drive Write Cache"
   Drive Write Cache: Enabled

The server is now quickly catching up.

To go back to the previous state (no BBU so WT: hpssacli ctrl slot=0 modify dwc=disable )

I suggest we decommission the host. The raid controller BBU is now bad, we have no other decom raid controllers, and the system is out of warranty.

Having one system run a different config (different writeback) then the rest seems non-ideal.

This host was scheduled for replacement for July 2019 though.
Crazy idea, maybe there are some spare BBUs in eqiad?

We don't keep spare hardware for in warranty systems. The only chance of having a BBU that works is if we've decommissioned other HPs.

@Cmjohnson can you check if you have gen8 HP's decommissioned but sitting in storage? If you do, we could use the raid controller BBU off of one. Please let us know!

@RobH no I do not have any spares at this time.

Then we should probably switch over x1 codfw master to another host.

I'd advise doing whatever is needed to put db2033 on the list for immediate decommission. Please let me know when assistance is needed from the DC ops side!

I would suggest we remove db2034 s1 from codfw rc service, reimage it, and make it the new master.

Change 404943 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Remove db2034

https://gerrit.wikimedia.org/r/404943

gerritbot added a project: Patch-For-Review.Jan 18 2018, 9:14 AM

Change 404943 merged by jenkins-bot:
[operations/mediawiki-config@master] db-codfw.php: Remove db2034 from s1

https://gerrit.wikimedia.org/r/404943

Mentioned in SAL (#wikimedia-operations) [2018-01-18T10:29:27Z] <marostegui@tin> Synchronized wmf-config/db-codfw.php: Remove db2034 from s1 as it will be in x1 - T184888 (duration: 01m 12s)

Change 404955 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] install_server: Allow db2034 reinstall as stretch

https://gerrit.wikimedia.org/r/404955

Change 404955 merged by Marostegui:
[operations/puppet@production] install_server: Allow db2034 reinstall as stretch

https://gerrit.wikimedia.org/r/404955

Change 405246 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: Move db2034 from s1 to x1 master

https://gerrit.wikimedia.org/r/405246

Change 405246 merged by Marostegui:
[operations/puppet@production] mariadb: Move db2034 from s1 to x1 master

https://gerrit.wikimedia.org/r/405246

Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts:

['db2034.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201801190641_marostegui_4986.log.

Completed auto-reimage of hosts:

['db2034.codfw.wmnet']

and were ALL successful.

Mentioned in SAL (#wikimedia-operations) [2018-01-19T07:11:33Z] <marostegui> Stop x1 on dbstore2002 to copy its content to db2034 - T184888

Change 405255 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/software@master] s1,x1.hosts: Move db2034 from s1 to x1

https://gerrit.wikimedia.org/r/405255

db2034 is now replicating in x1 (stretch + mariadb 10.1.30)
Pending the actual codfw failover on x1. Probably worth waiting a few days to make sure db2034 has no issues.

Change 405255 merged by jenkins-bot:
[operations/software@master] s1,x1.hosts: Move db2034 from s1 to x1

https://gerrit.wikimedia.org/r/405255

Mentioned in SAL (#wikimedia-operations) [2018-01-22T12:01:06Z] <marostegui> Change x1 codfw topology: db2034 to replicate from eqiad T184888

db2034 is now replicating directly from eqiad master it has no slaves hanging from it (that will be done once we do the failover)

I will try to move the dbstore codfw servers under db2034 and make it the master next week.

Marostegui added a comment.Feb 9 2018, 12:22 PM

This comment was removed by Marostegui.

jcrespo moved this task from In progress to Pending comment on the DBA board.Feb 14 2018, 11:56 AM

What is pending here is:

Move dbstore2001 and dbstore2002 under db2034
Convert db2034 to master role in puppet

That should be it.
We can then move db2033 under db2034

Change 412629 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db2034.yaml: Add role master to db2034

https://gerrit.wikimedia.org/r/412629

Mentioned in SAL (#wikimedia-operations) [2018-02-19T07:42:44Z] <marostegui> Change topology on x1 codfw - T184888

Change 412629 merged by Marostegui:
[operations/puppet@production] db2034.yaml: Add role master to db2034

https://gerrit.wikimedia.org/r/412629

Change 412633 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-codfw.php: Change x1 codfw master

https://gerrit.wikimedia.org/r/412633

Change 412633 merged by jenkins-bot:
[operations/mediawiki-config@master] db-codfw.php: Change x1 codfw master

https://gerrit.wikimedia.org/r/412633

Change 412635 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db2033.yaml: Remove master role from db2033

https://gerrit.wikimedia.org/r/412635

Mentioned in SAL (#wikimedia-operations) [2018-02-19T08:02:17Z] <marostegui@tin> Synchronized wmf-config/db-codfw.php: Promote db2034 to x1 codfw master - T184888 (duration: 00m 56s)

Change 412635 merged by Marostegui:
[operations/puppet@production] db2033.yaml: Remove master role from db2033

https://gerrit.wikimedia.org/r/412635

Change 412637 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mysql-core_codfw: Add master/slave x1 codfw

https://gerrit.wikimedia.org/r/412637

Change 412637 merged by Marostegui:
[operations/puppet@production] mysql-core_codfw: Add master/slave x1 codfw

https://gerrit.wikimedia.org/r/412637

The failover is now done.
db2033 is now a x1 slave along with dbstore2001, dbstore2002.

Marostegui closed this task as Resolved.Feb 19 2018, 8:19 AM

Marostegui claimed this task.

Marostegui mentioned this in T191275: Prepare and indicate proper master db failover candidates for all codfw database sections (s1-s8, x1).Apr 3 2018, 5:48 AM

jcrespo mentioned this in T201757: Degraded RAID on db2033.Aug 13 2018, 7:02 AM

@Papaul do you think we can use the BBU from db2064 (T195228) on this host?
I am starting to worry that the new x1 host (T199501) will not arrive on time for the DC failover

Marostegui mentioned this in T202051: db2042 (m3) master RAID battery failed.Aug 29 2018, 8:11 AM

And this got fixed by itself:

root@db2033:~# hpssacli controller all show status

Smart Array P420i in Slot 0 (Embedded)
   Controller Status: OK
   Cache Status: OK
   Battery/Capacitor Status: OK

Mentioned in SAL (#wikimedia-operations) [2018-10-20T05:38:20Z] <marostegui> Force writeback on db2033 - T184888

And this failed again after a couple of months:

root@db2033:~# hpssacli controller all show status

Smart Array P420i in Slot 0 (Embedded)
   Controller Status: OK
   Cache Status: Permanently Disabled
   Battery/Capacitor Status: Failed (Replace Batteries/Capacitors)

I have forced the cache to be write back

root@db2033:~# hpssacli controller all show detail | grep "Drive Write Cache"
   Drive Write Cache: Disabled
root@db2033:~# hpssacli ctrl slot=0 modify dwc=enable
root@db2033:~# hpssacli controller all show detail | grep "Drive Write Cache"
   Drive Write Cache: Enabled

Change 468914 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-codfw.php: Clarify db2033 BBU status

https://gerrit.wikimedia.org/r/468914

Change 468914 merged by jenkins-bot:
[operations/mediawiki-config@master] db-codfw.php: Clarify db2033 BBU status

https://gerrit.wikimedia.org/r/468914

Marostegui mentioned this in T214264: BBU issues on codfw.Feb 21 2019, 10:39 AM

Marostegui mentioned this in T219493: Prepare to decommission 2 codfw x1 hosts db2033 and db2034.Mar 28 2019, 10:25 AM

jcrespo mentioned this in T227862: (OoW) db2045 failed battery.Jul 12 2019, 10:14 AM

Maintenance_bot removed a project: Patch-For-Review.Jul 12 2019, 11:10 AM

Replace codfw x1 master (db2033) (WAS: Failed BBU on db2033 (x1 master))Closed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Replace codfw x1 master (db2033) (WAS: Failed BBU on db2033 (x1 master))
Closed, ResolvedPublic
Actions