Page MenuHomePhabricator

Replace RAID controller battery in an-worker1082
Closed, ResolvedPublic

Description

Hello,

It seems that another RAID controller battery has failed in a hadoop worker: an-worker1082

Would you be able to investigate options for replacement please?

We saw a very similar incident recently in T308434 and although the server was out of warranty you were able to find a compatible battery in one of the hadoop workers due for refresh. (analytics10[58-69])

Event Timeline

Looks like it's a R730 that's out of warranty. @Cmjohnson or @Jclark-ctr - do we still have any extra RAID controller batteries lying around? Thanks, Willy

@BTullis I do have a raid controller, when do you want to schedule this? Tomorrow, Wednesday 1530UTC?

@Cmjohnson I can do it now or in the next 30 minutes if that's good for you? Otherwise, yes tomorrow at 15:30 UTC is good too.

@BTullis I am a little behind but can we do this now?

FYI this is still randomly alerting on IRC:

icinga-wm| PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s)
           must have write cache policy WriteBack, currently using: WriteThrough,
           WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough,
           WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough,
           WriteThrough, WriteThrough
           https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring

@BTullis Can we try and do this Monday, please?

silenced the alert in alerts.wikimedia.org and Icinga for a couple of weeks :)

Mentioned in SAL (#wikimedia-operations) [2022-08-02T10:50:19Z] <btullis@cumin1001> START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on an-worker1082.eqiad.wmnet with reason: T312626 btullis

Mentioned in SAL (#wikimedia-operations) [2022-08-02T10:50:32Z] <btullis@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on an-worker1082.eqiad.wmnet with reason: T312626 btullis

@Cmjohnson - Apologies for all of the delay on this, I just kept missing you. I've now downtimed an-worker1082 for 3 days and I've shut it down already.
If it's convenient you can do the battery swap whenever you like and just boot it afterwards. Feel free to ping me on IRC if you'd like me to check anything.

@BTullis replaced the battery and powered on, everything looks good from my end. resolving