Page MenuHomePhabricator

db1093 (s6 candidate master) went down - broken BBU
Closed, ResolvedPublic

Description

dba1093 went down today:

19:24 <+icinga-wm> PROBLEM - Host db1093 is DOWN: PING CRITICAL - Packet loss = 100%

later it came back up:

21:44 <+icinga-wm> RECOVERY - Host db1093 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms

at this point we got paged about it because

21:46 <+icinga-wm> PROBLEM - mysqld processes on db1093 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting

We figured that it's normal and as desired that mysql does not come up after reboots.. but that we are still supposed to depool it.

Followed the docs at https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_slave

and made: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/507237

which tgr kindly deployed

CRITICAL: Slot 1: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2 - Controller: OK - Battery count: 0

Event Timeline

also about a minute later:

22:10 <+icinga-wm> PROBLEM - HP RAID on db1093 is CRITICAL: CRITICAL: Slot 1: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2 - Controller: OK - Battery count: 0

which automatically created T222128

22:10 < mutante> and that RAID issue is why it went down.. i think
22:10 < mutante> there were lots of alerts in SOFT state earlier
22:11 < mutante> now it finally went from SOFT to HARD

Change 507241 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1093: Disable notifications

https://gerrit.wikimedia.org/r/507241

Marostegui updated the task description. (Show Details)
Marostegui updated the task description. (Show Details)
Marostegui added a subscriber: Tgr.

Thanks a lot @Dzahn and @Tgr for taking care of this - we will take it from here
Much appreciated

Change 507241 merged by Marostegui:
[operations/puppet@production] db1093: Disable notifications

https://gerrit.wikimedia.org/r/507241

Marostegui renamed this task from db1093 went down - depooled to db1093 (s6 candidate master) went down - broken BBU.Apr 30 2019, 5:16 AM
Marostegui assigned this task to Cmjohnson.
Marostegui added a subscriber: Cmjohnson.

The BBU looks broken:

/system1/log1/record13
  Targets
  Properties
    number=13
    severity=Caution
    date=04/29/2019
    time=23:19
    description=Smart Storage Battery failure (Battery 1, service information: 0x0A). Action: Gather AHS log and contact Support
  Verbs
    cd version exit show

/system1/log1/record14
  Targets
  Properties
    number=14
    severity=Critical
    date=04/30/2019
    time=01:41
    description=ASR Detected by System ROM
  Verbs
    cd version exit show

/system1/log1/record15
  Targets
  Properties
    number=15
    severity=Caution
    date=04/30/2019
    time=01:42
    description=POST Error: 313-HPE Smart Storage Battery 1 Failure - Battery Shutdown Event Code: 0x0400. Action: Restart system. Contact HPE support if condition persists.
  Verbs
    cd version exit show
root@db1093:~# hpssacli controller all show detail | grep -i battery
   No-Battery Write Cache: Disabled
   Battery/Capacitor Count: 0

@Cmjohnson can we contact HP to get a new BBU?
Also, upon installation of the new BBU, we should make sure we are running the latest BIOS version and firmware.
Triaging this as HIGH because it is the candidate master

Change 507242 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Clarify db1093 status

https://gerrit.wikimedia.org/r/507242

Change 507242 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Clarify db1093 status

https://gerrit.wikimedia.org/r/507242

I have started MySQL which started correctly.
As it started fine, I have started replication too, once it has caught up, I am going to do a data check with another host to make sure the data isn't corrupted.

For what is worth, the LB looks like it worked fine.
The time line is:

23:24: db1093 goes down
23:24-23:30: Spike of errors and then some residual ones from jobrunner mostly: https://logstash.wikimedia.org/goto/0afd44275cc38db84317bfb02257ebe0
23:30: no more errors
02:09: db1093 is depooled

The following tables have been checked against multiple hosts and reported no differences:

archive
logging
page
revision
text
user
change_tag
actor
ipblocks
comment

So I am going to repool this host with low weight as it doesn't have BBU, so it just receive some queries at least.

Change 507263 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Give some traffic to db1093

https://gerrit.wikimedia.org/r/507263

Change 507264 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1093: Enable notifications

https://gerrit.wikimedia.org/r/507264

Change 507264 merged by Marostegui:
[operations/puppet@production] db1093: Enable notifications

https://gerrit.wikimedia.org/r/507264

Change 507263 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Give some traffic to db1093

https://gerrit.wikimedia.org/r/507263

Change 508167 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Give some API weight to db1093

https://gerrit.wikimedia.org/r/508167

Change 508167 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Give some API weight to db1093

https://gerrit.wikimedia.org/r/508167

This is only under warranty until later this month, and was brought up in the SRE weekly meeting. This needs to be high priority! Supposedly warranty is until May 24th in netbox, but please treat this as top action item!

I created a task for this with HPE.

Case ID: 5338390467
Case title:
Failed BBU
Severity 3-Normal
Product serial number: MXQ616071T
Product number: 755258-B21
Submitted: 5/6/2019 12:30:17 PM
Last updated: 5/6/2019 12:30:17 PM
Source: Web
Case status: Received by HPE

Change 508574 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depoool db1093

https://gerrit.wikimedia.org/r/508574

Change 508574 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depoool db1093

https://gerrit.wikimedia.org/r/508574

Mentioned in SAL (#wikimedia-operations) [2019-05-07T13:45:26Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Depool db1093 for BBU replacement T222127 (duration: 00m 51s)

Mentioned in SAL (#wikimedia-operations) [2019-05-07T13:45:32Z] <marostegui> Stop MySQL and poweroff db1093 for BBU replacement - T222127

Chris has changed the BBU and I can already see it:

root@db1093:~#  hpssacli controller all show detail | grep -i battery
   No-Battery Write Cache: Disabled
   Battery/Capacitor Count: 1
   Battery/Capacitor Status: OK

I have started MySQL but won't repool the host till tomorrow, to give it sometime to make sure it is all fine.

Change 508780 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Fully repool db1093

https://gerrit.wikimedia.org/r/508780

Change 508780 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Fully repool db1093

https://gerrit.wikimedia.org/r/508780

This host has been fully repooled
Thanks @Cmjohnson for replacing the BBU