Page MenuHomePhabricator

db2042 (m3) master RAID battery failed
Closed, DuplicatePublic

Description

root@db2042:~$ hpssacli ctrl slot=0 show detail
   Cache Status Details: Cache disabled; battery/capacitor failed to charge to an acceptable level
   Cache Ratio: 10% Read / 90% Write
   Drive Write Cache: Disabled
   Battery/Capacitor Count: 1
   Battery/Capacitor Status: Failed (Replace Batteries)

root@db2042:~$ hpssacli controller all show status

Smart Array P420i in Slot 0 (Embedded)
   Controller Status: OK
   Cache Status: Permanently Disabled
   Battery/Capacitor Status: Failed (Replace Batteries/Capacitors)

Event Timeline

jcrespo created this task.Aug 16 2018, 10:55 AM
Restricted Application added a project: Operations. · View Herald TranscriptAug 16 2018, 10:55 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Mentioned in SAL (#wikimedia-operations) [2018-08-16T11:12:21Z] <jynus> stopping db2042 for maintenance T202051

jcrespo closed this task as Resolved.Aug 16 2018, 11:45 AM
jcrespo claimed this task.

Solved with a reboot, let's reopen if it happens after some time CC @Marostegui @Papaul.

This server started to recharge its BBU again:

WARNING: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12 - Controller: OK - Cache: Temporarily Disabled - Battery/Capacitor: Recharging
Smart Array P420i in Slot 0 (Embedded)
   Controller Status: OK
   Cache Status: Temporarily Disabled
   Battery/Capacitor Status: Recharging
Marostegui added a comment.EditedAug 29 2018, 5:51 AM

db2042 keeps recharging. Even if the BBU fails eventually, I will leave it as it is and not replace the BBU (we only have 1, which is the one from db2064).
The reason I wouldn't use the BBU from db2064 in db2042 is because I would keep it for db2033 (T201757#4526985 and T184888) just in case.

db2033 is x1 meaning it will have reads once codfw is active, whereas db2042 is misc and doesn't have reads and it is not even getting delayed when the controller is set to WriteThrough meaning it can cope with the load finely.

jcrespo reopened this task as Open.Aug 29 2018, 7:36 AM

We can reboot it again- it worked last time- at least as a short term measure.

It is still recharging, it has not failed yet

This is the HW log from the first time the battery failed (16th Aug)

description=POST Error: 1705-Slot X Drive Array - Please replace Cache Module Super-Cap. Caching will be enabled once Super-Cap has been replaced and charged

I will reboot and let's see what happens

Mentioned in SAL (#wikimedia-operations) [2018-08-29T07:47:50Z] <marostegui> Reboot db2042 - T202051

After the reboot it has finally marked itself as failed:

date=08/29/2018
time=07:54
description=POST Error: 1705-Slot X Drive Array - Please replace Cache Module Super-Cap. Caching will be enabled once Super-Cap has been replaced and charged
root@db2042:~# hpssacli controller all show status

Smart Array P420i in Slot 0 (Embedded)
   Controller Status: OK
   Cache Status: Permanently Disabled
   Battery/Capacitor Status: Failed (Replace Batteries/Capacitors)

I suggest not to do anything with this host as per: T202051#4541027
Once we've got the new x1 host, we can move db2033 and replace db2042 or use db2064's BBU for this host.

Marostegui added a parent task: Unknown Object (Task).Aug 29 2018, 8:04 AM

Mentioned in SAL (#wikimedia-operations) [2018-08-29T08:13:59Z] <marostegui> Force WriteBack policy on db2042 T202051

I have forced the BBU to be WriteBack to let the server catch up:

root@db2042:~# hpssacli controller all show detail | grep "Drive Write Cache"
   Drive Write Cache: Disabled

root@db2042:~# hpssacli ctrl slot=0 modify dwc=enable

root@db2042:~#  hpssacli controller all show detail | grep "Drive Write Cache"
   Drive Write Cache: Enabled

And once it caught up, I reverted it:

root@db2042:~#  hpssacli ctrl slot=0 modify dwc=disable
root@db2042:~# hpssacli controller all show detail | grep "Drive Write Cache"
   Drive Write Cache: Disabled
Marostegui moved this task from Triage to In progress on the DBA board.Aug 29 2018, 10:03 AM

Mentioned in SAL (#wikimedia-operations) [2018-08-29T16:49:11Z] <marostegui> Force RAID controller to WB policy T202051

I have forced db2042 to be WB again as it was lagging too much behind:

16:45 < icinga-wm> PROBLEM - MariaDB Slave Lag: m3 on db2078 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 329.98 seconds                                                                                                
16:45 < icinga-wm> PROBLEM - MariaDB Slave Lag: m3 on db2042 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 335.53 seconds
root@db2042:~# hpssacli ctrl slot=0 modify dwc=enable



root@db2042:~# hpssacli controller all show detail | grep "Drive Write Cache"
   Drive Write Cache: Enabled
Marostegui changed the task status from Open to Stalled.Sep 3 2018, 8:46 AM
Marostegui triaged this task as Medium priority.

I am marking this as Stalled and if no one objects I think we should proceed with T202051#4541285 leaving the RAID controller with WB enforced.

jcrespo removed jcrespo as the assignee of this task.Sep 18 2018, 11:03 AM
Marostegui renamed this task from db2042 RAID battery failed to db2042 (m3) master RAID battery failed.Sep 23 2018, 1:40 PM

Mentioned in SAL (#wikimedia-operations) [2018-10-18T14:21:27Z] <banyek> shutting down mysql and powering down db2042 (T202051)

Banyek closed this task as Resolved.Oct 18 2018, 3:59 PM
Banyek claimed this task.

@Papaul did power drain that fixed the battery status.
We tried our spare battery in this host as well (T205257) but it doesn't worked here either. We can say, we don't have a spare BBU.
The host is back in action, and replicating as well as host db2078 (which is a slave of db2042)

if it fails again, I suggest we go for a DC failover.

Volans added a subscriber: Volans.Oct 18 2018, 8:25 PM

db2042 failed to start ferm at reboot due to a DNS timeout query:

Oct 18 15:53:04 db2042 ferm[837]: DNS query for 'prometheus2003.codfw.wmnet' failed: query timed out
[...SNIP...]
Oct 18 15:53:04 db2042 systemd[1]: Failed to start ferm firewall configuration.

Apparently the 2 icinga checks that report it were not noticed as probably the host was downtimed for the programmed maintenance.
I've manually started ferm and it all worked fine but it has been without ferm since the reboot.
I'm opening a separated task to fix the puppet/systemd side of it

Opened T207417 for the ferm part.

This failed again and I have created T202051 to track it

jcrespo reopened this task as Open.Jan 10 2019, 11:53 AM

Leaving it open and acking it on icinga so we don't forget about it.

Let's merge this then T209261 with this task or the other way around

jcrespo added a comment.EditedJan 10 2019, 12:54 PM

Sorry, I searched but I didn't find the other one, as on your above comment you probably meant that but linked to itself by mistake. I am ok with any method, as long as there is at least one task open.

My only request is to keep "db2042" "RAID" and "battery" on the title so I can find it in the future. :-D

I have merged this into T209261 as that other one has a more "important" title so we don't forget! :)