Page MenuHomePhabricator

db1088 crashed
Closed, ResolvedPublic

Description

22:50 <+icinga-wm_>	PROBLEM - Host db1088 is DOWN: PING CRITICAL - Packet loss = 100%

Host has been depooled from s6, the rest of which hopefully will survive with somewhat increased read load until Monday.

Event Timeline

CDanis triaged this task as High priority.Jun 20 2020, 11:00 PM
CDanis created this task.

Looks like it rebooted by itself (which hilariously was the first thing to make it page), but I'm leaving it depooled.

Kormat added a subscriber: wiki_willy.
Kormat added a subscriber: Kormat.

DCOps: The BBU on this machine has failed. Do you have a spare BBU in the DC, or if not, can we please order a replacement? Cheers.

/system1/log1/record7
  Targets
  Properties
    number=7
    severity=Caution
    date=06/20/2020
    time=22:42
    description=Smart Storage Battery failure (Battery 1, service information: 0x0A). Action: Gather AHS log and contact Support
  Verbs
    cd version exit show

/system1/log1/record8
  Targets
  Properties
    number=8
    severity=Critical
    date=06/20/2020
    time=23:00
    description=ASR Detected by System ROM
  Verbs
    cd version exit show

/system1/log1/record9
  Targets
  Properties
    number=9
    severity=Caution
    date=06/20/2020
    time=23:00
    description=POST Error: 313-HPE Smart Storage Battery 1 Failure - Battery Shutdown Event Code: 0x0400. Action: Restart system. Contact HPE support if condition persists.
  Verbs
    cd version exit show

Mysql is started and catching up on replication. Once that's completed we'll perform a data consistency check.

Change 606962 had a related patch set uploaded (by Kormat; owner: Kormat):
[operations/puppet@production] mariadb: Silence notifications for db1088

https://gerrit.wikimedia.org/r/606962

Change 606962 merged by Kormat:
[operations/puppet@production] mariadb: Silence notifications for db1088

https://gerrit.wikimedia.org/r/606962

wiki_willy added a subscriber: Jclark-ctr.

@Jclark-ctr - I think there are some bbu's leftover from the last time you requested some spares to be ordered, but let me know if not. Thanks, Willy

Data consistency check passed.

@Kormat I am going to apply the MCR schema change and once I am done, maybe we can reimage this to Buster and 10.4 while DCOPs look for a BBU?

@Kormat MCR schema change applied, you can proceed with the reimage anytime
Thank you!

Reimage done, and host has caught up with replication.

@Marostegui we have spare bbu. i happen to be on site today can. Are you available to assist?

@Jclark-ctr - i'm available. I can power the host off now for you to do the replacement. Just let me know when it's back. Cheers.

@Jclark-ctr : great, thanks :)

The diagnostics are happy with the new bbu:

root@db1088:~# hpssacli ctrl all show status

Smart Array P840 in Slot 1
   Controller Status: OK
   Cache Status: Not Configured
   Battery/Capacitor Status: OK
Kormat removed a project: ops-eqiad.

Re-opening for us to keep track of re-adding the host back into service.

Should a BBU failure cause a reboot?

Is this configurable?

description=POST Error: 313-HPE Smart Storage Battery 1 Failure - Battery Shutdown Event Code: 0x0400. Action: Restart system. Contact HPE support if condition persists.

Verbs

HP says, the server should not reboot due to battery failure: https://support.hpe.com/hpesc/public/docDisplay?docId=mmr_kc-0126260#:~:text=POST%20Error%3A%20313%20%2D%20HPE%20Smart,other%20reasons%20for%20a%20reboot.

"Note: Ideally battery failure will not cause a server to reboot, there could be other reasons for a reboot. Logs from server has to be checked by HPE to determine the cause of reboot."

https://support.hpe.com/hpesc/public/docDisplay?docId=mmr_sf-EN_US000019604 - suggests controller firmware update

No, a BBU failure should not trigger a host reboot. Unfortunately, this is something we've seen with HP hosts over the years. Dell has also shown (sometimes) similar behaviours, which ended up with RAID controllers replacement.

We contacted HP years ago about it when we first saw it and they did recommend to upgrade firmware, which didn't really solve the issue or we never suffered another BBU issue on those upgraded hosts so hard to say. BBU issues aren't super common, and they rarely hit the same host twice.
Interestingly, this always happens when hosts are old and out of support, so we cannot really get a new controller or troubleshoot more This host isn't an exception, it is out of support and meant to be refreshed next FY.

Change 607440 had a related patch set uploaded (by Kormat; owner: Kormat):
[operations/puppet@production] Revert "mariadb: Silence notifications for db1088"

https://gerrit.wikimedia.org/r/607440

Change 607440 merged by Kormat:
[operations/puppet@production] Revert "mariadb: Silence notifications for db1088"

https://gerrit.wikimedia.org/r/607440

Mentioned in SAL (#wikimedia-operations) [2020-06-24T08:31:21Z] <kormat@cumin1001> dbctl commit (dc=all): 'Pool db1088 @ 20% into s6 T255927', diff saved to https://phabricator.wikimedia.org/P11645 and previous config saved to /var/cache/conftool/dbconfig/20200624-083120-kormat.json

Mentioned in SAL (#wikimedia-operations) [2020-06-24T09:36:25Z] <kormat@cumin1001> dbctl commit (dc=all): 'Pool db1088 @ 50% into s6 T255927', diff saved to https://phabricator.wikimedia.org/P11647 and previous config saved to /var/cache/conftool/dbconfig/20200624-093624-kormat.json

Mentioned in SAL (#wikimedia-operations) [2020-06-24T09:53:39Z] <kormat@cumin1001> dbctl commit (dc=all): 'Pool db1088 @ 75% into s6 T255927', diff saved to https://phabricator.wikimedia.org/P11648 and previous config saved to /var/cache/conftool/dbconfig/20200624-095338-kormat.json

Mentioned in SAL (#wikimedia-operations) [2020-06-24T15:36:05Z] <kormat@cumin1001> dbctl commit (dc=all): 'Pool db1088 @ 100% into s6 T255927', diff saved to https://phabricator.wikimedia.org/P11652 and previous config saved to /var/cache/conftool/dbconfig/20200624-153604-kormat.json

Anything else left here after the 100% repool or we can close this?
Thank you!