root@db2042:~$ hpssacli ctrl slot=0 show detail Cache Status Details: Cache disabled; battery/capacitor failed to charge to an acceptable level Cache Ratio: 10% Read / 90% Write Drive Write Cache: Disabled Battery/Capacitor Count: 1 Battery/Capacitor Status: Failed (Replace Batteries) root@db2042:~$ hpssacli controller all show status Smart Array P420i in Slot 0 (Embedded) Controller Status: OK Cache Status: Permanently Disabled Battery/Capacitor Status: Failed (Replace Batteries/Capacitors)
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | • Banyek | T206593 Productionize db2096 on x1 | |||
Resolved | Marostegui | T206191 rack/setup/install db2096 (x1 codfw expansion host) | |||
Unknown Object (Task) | |||||
Duplicate | • Banyek | T202051 db2042 (m3) master RAID battery failed |
Event Timeline
Mentioned in SAL (#wikimedia-operations) [2018-08-16T11:12:21Z] <jynus> stopping db2042 for maintenance T202051
Solved with a reboot, let's reopen if it happens after some time CC @Marostegui @Papaul.
This server started to recharge its BBU again:
WARNING: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12 - Controller: OK - Cache: Temporarily Disabled - Battery/Capacitor: Recharging
Smart Array P420i in Slot 0 (Embedded) Controller Status: OK Cache Status: Temporarily Disabled Battery/Capacitor Status: Recharging
db2042 keeps recharging. Even if the BBU fails eventually, I will leave it as it is and not replace the BBU (we only have 1, which is the one from db2064).
The reason I wouldn't use the BBU from db2064 in db2042 is because I would keep it for db2033 (T201757#4526985 and T184888) just in case.
db2033 is x1 meaning it will have reads once codfw is active, whereas db2042 is misc and doesn't have reads and it is not even getting delayed when the controller is set to WriteThrough meaning it can cope with the load finely.
This is the HW log from the first time the battery failed (16th Aug)
description=POST Error: 1705-Slot X Drive Array - Please replace Cache Module Super-Cap. Caching will be enabled once Super-Cap has been replaced and charged
I will reboot and let's see what happens
Mentioned in SAL (#wikimedia-operations) [2018-08-29T07:47:50Z] <marostegui> Reboot db2042 - T202051
After the reboot it has finally marked itself as failed:
date=08/29/2018 time=07:54 description=POST Error: 1705-Slot X Drive Array - Please replace Cache Module Super-Cap. Caching will be enabled once Super-Cap has been replaced and charged
root@db2042:~# hpssacli controller all show status Smart Array P420i in Slot 0 (Embedded) Controller Status: OK Cache Status: Permanently Disabled Battery/Capacitor Status: Failed (Replace Batteries/Capacitors)
I suggest not to do anything with this host as per: T202051#4541027
Once we've got the new x1 host, we can move db2033 and replace db2042 or use db2064's BBU for this host.
Mentioned in SAL (#wikimedia-operations) [2018-08-29T08:13:59Z] <marostegui> Force WriteBack policy on db2042 T202051
I have forced the BBU to be WriteBack to let the server catch up:
root@db2042:~# hpssacli controller all show detail | grep "Drive Write Cache" Drive Write Cache: Disabled root@db2042:~# hpssacli ctrl slot=0 modify dwc=enable root@db2042:~# hpssacli controller all show detail | grep "Drive Write Cache" Drive Write Cache: Enabled
And once it caught up, I reverted it:
root@db2042:~# hpssacli ctrl slot=0 modify dwc=disable root@db2042:~# hpssacli controller all show detail | grep "Drive Write Cache" Drive Write Cache: Disabled
Mentioned in SAL (#wikimedia-operations) [2018-08-29T16:49:11Z] <marostegui> Force RAID controller to WB policy T202051
I have forced db2042 to be WB again as it was lagging too much behind:
16:45 < icinga-wm> PROBLEM - MariaDB Slave Lag: m3 on db2078 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 329.98 seconds 16:45 < icinga-wm> PROBLEM - MariaDB Slave Lag: m3 on db2042 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 335.53 seconds
root@db2042:~# hpssacli ctrl slot=0 modify dwc=enable root@db2042:~# hpssacli controller all show detail | grep "Drive Write Cache" Drive Write Cache: Enabled
I am marking this as Stalled and if no one objects I think we should proceed with T202051#4541285 leaving the RAID controller with WB enforced.
Mentioned in SAL (#wikimedia-operations) [2018-10-18T14:21:27Z] <banyek> shutting down mysql and powering down db2042 (T202051)
db2042 failed to start ferm at reboot due to a DNS timeout query:
Oct 18 15:53:04 db2042 ferm[837]: DNS query for 'prometheus2003.codfw.wmnet' failed: query timed out [...SNIP...] Oct 18 15:53:04 db2042 systemd[1]: Failed to start ferm firewall configuration.
Apparently the 2 icinga checks that report it were not noticed as probably the host was downtimed for the programmed maintenance.
I've manually started ferm and it all worked fine but it has been without ferm since the reboot.
I'm opening a separated task to fix the puppet/systemd side of it
Sorry, I searched but I didn't find the other one, as on your above comment you probably meant that but linked to itself by mistake. I am ok with any method, as long as there is at least one task open.
My only request is to keep "db2042" "RAID" and "battery" on the title so I can find it in the future. :-D
I have merged this into T209261 as that other one has a more "important" title so we don't forget! :)