Page MenuHomePhabricator

Icinga/MegaRAID alert on an-worker1100
Closed, ResolvedPublic

Description

Noticed on alerts.wikimedia.org:

CRITICAL: 23 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, ...

@elukey thought it might be because the backup battery unit is not working, but it self-reported to be healthy. We're not sure if this can be trusted, however.

Event Timeline

elukey added a subscriber: Cmjohnson.

@Cmjohnson this is kind of strange, I don't see any problem reported by megacli for the BBU but I cannot enforce WriteBack on the RAID controller, as if the BBU wasn't working. Any idea/tips about what to do?

@elukey that's a first! Maybe the raid bios settings are wrong?

elukey@an-worker1100:~$ sudo megacli -AdpBbuCmd -BbuLearn -aAll
                                     
Adapter 0: BBU Learn Failed

Exit Code: 0x01

This is also weird..

Mentioned in SAL (#wikimedia-analytics) [2021-04-08T15:35:54Z] <elukey> reboot an-worker1100 to see if it helps with the strange BBU behavior in T279475

The alert recovered, but I discovered a bad disk that needs to be replaced (had to clear preserved cache to allow boot, and one partition didn't mount). Hopefully we'll get an automatic task, if not I'll create one!

One drive is in a Foreign state, no idea why (also unconfigured - good):

Enclosure Device ID: 32
Slot Number: 10
Enclosure position: 1
Device Id: 10
WWN: 5000c500cf8ee990
Sequence Number: 1
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SATA

Raw Size: 1.819 TB [0xe8e088b0 Sectors]
Non Coerced Size: 1.818 TB [0xe8d088b0 Sectors]
Coerced Size: 1.818 TB [0xe8d00000 Sectors]
Sector Size:  512
Logical Sector Size:  512
Physical Sector Size:  512
Firmware state: Unconfigured(good), Spun Up
Device Firmware Level: NB33
Shield Counter: 0
Successful diagnostics completion on :  N/A
SAS Address(0): 0x500056b3ae7adbca
Connected Port Number: 0(path0) 
Inquiry Data:             W462MHHRST2000NX0423                                NB33
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: Foreign 
Foreign Secure: Drive is not secured by a foreign lock key
Device Speed: 6.0Gb/s 
Link Speed: 6.0Gb/s 
Media Type: Hard Disk Device
Drive Temperature :28C (82.40 F)
PI Eligibility:  No 
Drive is formatted for PI information:  No
PI: No PI
Drive's NCQ setting : N/A
Port-0 :
Port status: Active
Port's Linkspeed: 6.0Gb/s 
Drive has flagged a S.M.A.R.T alert : No

I had to do:

megacli -CfgForeign -Scan -a0
megacli -CfgForeign -Clear -a0
megacli -CfgLdAdd -r0 [32:10] -a0

And the disk came back to life and I was able to re-mount its partition. Doing another reboot to see how it goes.

elukey claimed this task.

All good, I'll re-open in case something weird comes up, but now all disks are good :)