Error says:
PROBLEM - MegaRAID on db1051 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough
It has been temporarily depooled on: https://gerrit.wikimedia.org/r/406847
Error says:
PROBLEM - MegaRAID on db1051 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough
It has been temporarily depooled on: https://gerrit.wikimedia.org/r/406847
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
db-eqiad.php: Repool db1051 on vslow | operations/mediawiki-config | master | +3 -1 |
Mentioned in SAL (#wikimedia-operations) [2018-01-31T07:08:10Z] <marostegui> Force BBU relearn on db1051 - T186049
root@db1051:~# megacli -AdpBbuCmd -a0 BBU status for Adapter: 0 BatteryType: BBU Battery State: Unknown Battery backup charge time : 0 hours BBU Capacity Info for Adapter: 0 Relative State of Charge: 15 % Absolute State of charge: 0 % Remaining Capacity: 82 mAh Full Charge Capacity: 581 mAh Run time to empty: Battery is not being charged. Average time to empty: 7 Min. Estimated Time to full recharge: Battery is not being charged. Cycle Count: 4 Max Error = 0 % Remaining Capacity Alarm = 0 mAh Remining Time Alarm = 0 Min BBU Design Info for Adapter: 0 Date of Manufacture: 07/18, 2011 Design Capacity: 90 mAh Design Voltage: 0 mV Specification Info: 0 Serial Number: 0 Pack Stat Configuration: 0x0000 Manufacture Name: Firmware Version : 0148 03 Device Name: Device Chemistry: Battery FRU: N/A Module Version = 0148 03 Transparent Learn = 1 App Data = 1 BBU Properties for Adapter: 0 Auto Learn Period: 90 Days Next Learn time: None Learn Delay Interval:0 Hours Auto-Learn Mode: Disabled Exit Code: 0x00
I have forced a re-learn just in case it is a one time thing (most likely it is not).
Once it fails again, we better replace it.
After the relearn:
root@db1051:~# megacli -AdpBbuCmd -a0 | grep Optimal Battery State: Optimal
root@db1051:~# megacli -ldinfo -l0 -a0 | grep Policy Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU Default Access Policy: Read/Write Current Access Policy: Read/Write Disk Cache Policy : Disk's Default Default Power Savings Policy: Controller Defined Current Power Savings Policy: None
This server definitely needs a BBU replacement.
@Cmjohnson can you let us know a day that works for you to get it replaced?
Thanks!
Mentioned in SAL (#wikimedia-operations) [2018-02-06T14:53:01Z] <marostegui> Poweroff db1051 for BBU replacement - T186049
@Cmjohnson this server is now off.
Feel free to power it on once you've done the replacement
Thanks!
After a full recharge it looks good now:
root@db1051:~# megacli -AdpBbuCmd -a0 BBU status for Adapter: 0 BatteryType: BBU Voltage: 3936 mV Current: 0 mA Temperature: 48 C Battery State: Optimal BBU Firmware Status:
The raid is back to WB:
root@db1051:~# megacli -ldinfo -l0 -a0 | grep Policy Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Thanks again Chris!
Change 408761 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Repool db1051 on vslow
Change 408761 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Repool db1051 on vslow
Mentioned in SAL (#wikimedia-operations) [2018-02-07T07:00:44Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Start repooling db1051 after the BBU change - T186049 (duration: 01m 15s)
Mentioned in SAL (#wikimedia-operations) [2018-02-07T09:52:41Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Fully repool db1051 after the BBU change - T186049 (duration: 01m 14s)