Page MenuHomePhabricator

db1084 crashed due to BBU failure
Closed, ResolvedPublic

Description

BBU broke: T245621#5897114

Details

Related Gerrit Patches:
operations/puppet : productiondb1084: Disable notifications
operations/puppet : productiondb1084: Disable notifications

Event Timeline

jcrespo created this task.Feb 19 2020, 1:35 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 19 2020, 1:35 PM

Looks like BBU died:

Battery/Capacitor Count: 0
/system1/log1/record15
  Targets
  Properties
    number=15
    severity=Caution
    date=02/19/2020
    time=13:19
    description=Smart Storage Battery failure (Battery 1, service information: 0x0A). Action: Gather AHS log and contact Support
  Verbs
    cd version exit show

/system1/log1/record17
  Targets
  Properties
    number=17
    severity=Caution
    date=02/19/2020
    time=13:32
    description=POST Error: 313-HPE Smart Storage Battery 1 Failure - Battery Shutdown Event Code: 0x0400. Action: Restart system. Contact HPE support if condition persists.

@wiki_willy do we have spare HP BBUs in eqiad?

Marostegui renamed this task from db1084 reboot causing commonswiki connection errors (crash?) to db1084 crashed due to BBU failure.Feb 19 2020, 1:57 PM
Marostegui triaged this task as Medium priority.
Marostegui updated the task description. (Show Details)
Marostegui moved this task from Triage to In progress on the DBA board.

Change 573288 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1084: Disable notifications

https://gerrit.wikimedia.org/r/573288

Change 573288 merged by Marostegui:
[operations/puppet@production] db1084: Disable notifications

https://gerrit.wikimedia.org/r/573288

Mentioned in SAL (#wikimedia-operations) [2020-02-19T14:02:43Z] <marostegui> Start mysql on db1084 without replication - T245621

Mentioned in SAL (#wikimedia-operations) [2020-02-19T14:07:19Z] <marostegui> Upgrade and reboot db1084 - T245621

Mentioned in SAL (#wikimedia-operations) [2020-02-19T14:29:31Z] <marostegui> Data checksum on db1084 T245621

@Marostegui - we have a few spare BBUs in the process of being shipped onsite, one of them for T244958, which should be arriving early next week. You can just shoot open a dc-ops task with us, and we can have it taken care of. Thanks, Willy

Data checksum has finished without issues. So I am going to slowly repool this host so it can at least serve some traffic

Mentioned in SAL (#wikimedia-operations) [2020-02-20T06:24:46Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Slowly repool db1084 after crash - T245621', diff saved to https://phabricator.wikimedia.org/P10466 and previous config saved to /var/cache/conftool/dbconfig/20200220-062445-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2020-02-20T09:12:33Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Slowly repool db1084 after crash - T245621', diff saved to https://phabricator.wikimedia.org/P10467 and previous config saved to /var/cache/conftool/dbconfig/20200220-091233-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2020-02-20T10:51:18Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Slowly repool db1084 after crash - T245621', diff saved to https://phabricator.wikimedia.org/P10468 and previous config saved to /var/cache/conftool/dbconfig/20200220-105117-marostegui.json

Change 574923 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1084: Disable notifications

https://gerrit.wikimedia.org/r/574923

Change 574923 merged by Marostegui:
[operations/puppet@production] db1084: Disable notifications

https://gerrit.wikimedia.org/r/574923

Mentioned in SAL (#wikimedia-operations) [2020-02-27T11:45:44Z] <jynus@cumin1001> dbctl commit (dc=all): 'Repool db1084 at 10% T245621', diff saved to https://phabricator.wikimedia.org/P10538 and previous config saved to /var/cache/conftool/dbconfig/20200227-114542-jynus.json

Mentioned in SAL (#wikimedia-operations) [2020-02-27T15:03:03Z] <jynus@cumin1001> dbctl commit (dc=all): 'Repool db1084 at 50% T245621', diff saved to https://phabricator.wikimedia.org/P10542 and previous config saved to /var/cache/conftool/dbconfig/20200227-150302-jynus.json

I will let @Marostegui put it back to 100% and do the full revert and finishing touches + resolv.

Mentioned in SAL (#wikimedia-operations) [2020-02-28T06:25:37Z] <marostegui@cumin1001> dbctl commit (dc=all): '75% of original weight to db1084 - T245621', diff saved to https://phabricator.wikimedia.org/P10549 and previous config saved to /var/cache/conftool/dbconfig/20200228-062536-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2020-02-28T06:40:37Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Fully repool db1084 - T245621', diff saved to https://phabricator.wikimedia.org/P10550 and previous config saved to /var/cache/conftool/dbconfig/20200228-064037-marostegui.json

Marostegui closed this task as Resolved.Feb 28 2020, 6:41 AM

Host fully repooled

Thanks everyone!