Page MenuHomePhabricator

db1085 crashed
Closed, ResolvedPublic

Description

Broken BBU

Event Timeline

CDanis triaged this task as High priority.Jul 19 2020, 6:47 PM

BBU issues as expected. This host is also scheduled to be refreshed next Q:

/system1/log1/record16
  Targets
  Properties
    number=16
    severity=Caution
    date=07/19/2020
    time=18:44
    description=POST Error: 313-HPE Smart Storage Battery 1 Failure - Battery Shutdown Event Code: 0x0400. Action: Restart system. Contact HPE support if condition persists.
  Verbs
    cd version exit show


</system1/log1>hpiLO-> show record15

status=0
status_tag=COMMAND COMPLETED
Sun Jul 19 18:48:08 2020



/system1/log1/record15
  Targets
  Properties
    number=15
    severity=Critical
    date=07/19/2020
    time=18:43
    description=ASR Detected by System ROM
  Verbs
    cd version exit show


</system1/log1>hpiLO-> show record14

status=0
status_tag=COMMAND COMPLETED
Sun Jul 19 18:48:11 2020



/system1/log1/record14
  Targets
  Properties
    number=14
    severity=Caution
    date=07/19/2020
    time=18:26
    description=Smart Storage Battery failure (Battery 1, service information: 0x0A). Action: Gather AHS log and contact Support
  Verbs
    cd version exit show

Pretty much the same issue as T258336

And the BBU is gone:

root@db1085:~#  hpssacli controller all show detail | grep -i Battery
   No-Battery Write Cache: Disabled
   Battery/Capacitor Count: 0

Change 614590 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1085: Disable notifications

https://gerrit.wikimedia.org/r/614590

Change 614590 merged by Marostegui:
[operations/puppet@production] db1085: Disable notifications

https://gerrit.wikimedia.org/r/614590

Mentioned in SAL (#wikimedia-operations) [2020-07-19T19:16:10Z] <marostegui> Upgrade and reboot db1085 T258360

Host upgraded and rebooted.
MySQL looks ok, replication started

Change 615165 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1085: Enable notifications

https://gerrit.wikimedia.org/r/615165

Change 615165 merged by Marostegui:
[operations/puppet@production] db1085: Enable notifications

https://gerrit.wikimedia.org/r/615165

Mentioned in SAL (#wikimedia-operations) [2020-07-21T10:45:46Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Slowly repool db1085 T258360', diff saved to https://phabricator.wikimedia.org/P11985 and previous config saved to /var/cache/conftool/dbconfig/20200721-104546-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2020-07-21T10:58:52Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Slowly repool db1085 T258360', diff saved to https://phabricator.wikimedia.org/P11986 and previous config saved to /var/cache/conftool/dbconfig/20200721-105852-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2020-07-21T11:08:55Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Fully repool db1085 T258360', diff saved to https://phabricator.wikimedia.org/P11987 and previous config saved to /var/cache/conftool/dbconfig/20200721-110854-marostegui.json

Marostegui claimed this task.

I have fully repooled this host.
It doesn't have a BBU, but s6 doesn't really have much load, so it will probably be able to keep up with replication without issues.
Next follow up will be done at T258386 and/or T258361