Hello,
db1031 seems to have a broken BBU:
First a high increase on disk utilization: https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&orgId=1&var-server=db1031&var-network=eth0&from=1495512674378&to=1495527469190&panelId=19&fullscreen
root@db1031:~# megacli -AdpBbuCmd -a0 BBU status for Adapter: 0 BatteryType: BBU Voltage: 4021 mV Current: 0 mA Temperature: 32 C Battery State: Degraded(Need Attention) A manual learn is required. BBU Firmware Status: Charging Status : None
After seeing that, we can see the policy being WriteThrough:
root@db1031:~# megacli -LDInfo -Lall -aALL Adapter 0 -- Virtual Drive Information: Virtual Drive: 0 (Target Id: 0) Name : RAID Level : Primary-1, Secondary-0, RAID Level Qualifier-0 Size : 1.633 TB Sector Size : 512 Mirror Data : 1.633 TB State : Optimal Strip Size : 256 KB Number Of Drives per span:2 Span Depth : 6 Default Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU Current Cache Policy: WriteThrough, ReadAdaptive, Direct, No Write Cache if Bad BBU
I forced a learn cycle first to see if that helped the BBU to recover (as it did sometimes on db1048 (T160731#3109104)
root@db1031:~# megacli -AdpBbuCmd -BbuLearn -aALL -NoLog Adapter 0: BBU Learn Succeeded. Exit Code: 0x00
But nothing ever happened after a while:
BBU status for Adapter: 0 BatteryType: BBU Voltage: 4021 mV Current: 0 mA Temperature: 32 C Battery State: Degraded(Need Attention) A manual learn is required. BBU Firmware Status: Charging Status : None Voltage : OK Temperature : OK Learn Cycle Requested : Yes
I disabled the learning cycle just in case, before setting the default policy to WriteBack to avoid any issues with the BBU misbehaving again:
root@db1031:~# megacli -AdpBbuCmd -a0 | grep Auto-Learn Auto-Learn Mode: Warn via Event root@db1031:~# echo "autoLearnMode=1" > disable_learn root@db1031:~# megacli -AdpBbuCmd -SetBbuProperties -f disable_learn -a0 Adapter 0: Set BBU Properties Succeeded. Exit Code: 0x00
And forced the WB by default:
root@db1031:~# megacli -LDSetProp -ForcedWB -Immediate -Lall -aAll Set Write Policy to Forced WriteBack on Adapter 0, VD 0 (target id: 0) success Exit Code: 0x00
Default Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU Current Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
And once it was set to WB, the disk IO dropped down as it can be see on: https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&orgId=1&var-server=db1031&var-network=eth0&from=1495511890227&to=1495527409190