Page MenuHomePhabricator

db1079 BBU crashed host rebooted
Closed, ResolvedPublic

Description

This host will be replaced in Q2.
As usual with HP hosts, they tend to get their BBU broken after a few years and once it fails, the host reboots itself:

/system1/log1/record15
  Targets
  Properties
    number=15
    severity=Caution
    date=07/06/2020
    time=13:45
    description=POST Error: 313-HPE Smart Storage Battery 1 Failure - Battery Shutdown Event Code: 0x0400. Action: Restart system. Contact HPE support if condition persists.
  Verbs
    cd version exit show

/system1/log1/record14
  Targets
  Properties
    number=14
    severity=Critical
    date=07/06/2020
    time=13:44
    description=ASR Detected by System ROM
  Verbs
    cd version exit show


</system1/log1>hpiLO-> show record13

status=0
status_tag=COMMAND COMPLETED
Mon Jul  6 14:02:39 2020



/system1/log1/record13
  Targets
  Properties
    number=13
    severity=Caution
    date=07/06/2020
    time=13:25
    description=Smart Storage Battery failure (Battery 1, service information: 0x0A). Action: Gather AHS log and contact Support
  Verbs
[13:52:22]  <+icinga-wm>	PROBLEM - mysqld processes #page on db1079 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting

Event Timeline

Change 609788 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1079: Add broken BBU status

https://gerrit.wikimedia.org/r/609788

@Marostegui yes believe we still have some. I will be on site in a few hours if we wanted to change it today

db1079 was depooled: P11751
Main traffic removed from db1136 as it is currently the only s7 API host on eqiad: P11752

Both should be removed or taken into account if host is repooled.

@Marostegui yes believe we still have some. I will be on site in a few hours if we wanted to change it today

Excellent, I am going to leave the host powered off for you. Once you are done, please bring it back and I can take it from there.
Thank you

Mentioned in SAL (#wikimedia-operations) [2020-07-06T14:09:08Z] <marostegui> Stop MySQL and poweroff db1079 T257216

Change 609788 merged by Marostegui:
[operations/puppet@production] db1079: Add broken BBU status

https://gerrit.wikimedia.org/r/609788

Thank you, I can see it now!

=> controller all show status

Smart Array P840 in Slot 1
   Controller Status: OK
   Cache Status: Not Configured
   Battery/Capacitor Status: OK


=>

I have started mysql and it is now catching up

Mentioned in SAL (#wikimedia-operations) [2020-07-07T06:18:50Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Slowly repool db1079 T257216', diff saved to https://phabricator.wikimedia.org/P11760 and previous config saved to /var/cache/conftool/dbconfig/20200707-061849-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2020-07-07T06:20:08Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Give db1136 some weight back into main traffic T257216', diff saved to https://phabricator.wikimedia.org/P11761 and previous config saved to /var/cache/conftool/dbconfig/20200707-062008-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2020-07-07T06:37:38Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Slowly repool db1079 and give more main weight to db1136 T257216', diff saved to https://phabricator.wikimedia.org/P11762 and previous config saved to /var/cache/conftool/dbconfig/20200707-063737-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2020-07-07T07:27:05Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Slowly repool db1079 and give more main weight to db1136 T257216', diff saved to https://phabricator.wikimedia.org/P11764 and previous config saved to /var/cache/conftool/dbconfig/20200707-072703-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2020-07-07T07:39:18Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Fully repool db1079 and db1136 T257216', diff saved to https://phabricator.wikimedia.org/P11765 and previous config saved to /var/cache/conftool/dbconfig/20200707-073918-marostegui.json

db1079 fully repooled, db1136 also got its original weight restored.
All done!
Thanks you John for replacing the BBU so fast!