Page MenuHomePhabricator

db1091 crashed
Closed, ResolvedPublic

Description

db1091 crashed

[09:08:59]  <+icinga-wm>	PROBLEM - Host db1091 is DOWN: PING CRITICAL - Packet loss = 100%

This is what we have in HW logs:

/system1/log1/record10
  Targets
  Properties
    number=10
    severity=Caution
    date=06/05/2019
    time=07:06
    description=Smart Storage Battery failure (Battery 1, service information: 0x0A). Action: Gather AHS log and contact Support

/system1/log1/record11
  Targets
  Properties
    number=11
    severity=Critical
    date=06/05/2019
    time=07:17
    description=ASR Detected by System ROM

/system1/log1/record12
  Targets
  Properties
    number=12
    severity=Caution
    date=06/05/2019
    time=07:18
    description=POST Error: 313-HPE Smart Storage Battery 1 Failure - Battery Shutdown Event Code: 0x0400. Action: Restart system. Contact HPE support if condition persists.

Event Timeline

Marostegui added a subscriber: jcrespo.

BBU broke

Battery/Capacitor Count: 0

@Cmjohnson Can we give this host some priority? I wouldn't want to have it down for the whole offsite week.
I believe its support just expired, so we might not be able to get a replacement for the BBU, if it is really broken, but can we maybe upgrade its firmware/BIOS? Do you happen to have a spare BBU around the DC?

@jcrespo I am going to place db1135 temporarily (T222682) to replace this host until we have found a solution

Change 514433 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: Temporary: place db1135 into s4

https://gerrit.wikimedia.org/r/514433

Change 514433 merged by Marostegui:
[operations/puppet@production] mariadb: Temporary: place db1135 into s4

https://gerrit.wikimedia.org/r/514433

Mentioned in SAL (#wikimedia-operations) [2019-06-05T07:45:36Z] <marostegui> Transfer dbprov1001.eqiad.wmnet:snapshot.s4.2019-06-04--21-37-03.tar.gz to db1135 to provision it on s4 T225060

Change 514436 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1091: Disable notifications

https://gerrit.wikimedia.org/r/514436

Change 514436 merged by Marostegui:
[operations/puppet@production] db1091: Disable notifications

https://gerrit.wikimedia.org/r/514436

Change 514439 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad,db-codfw.php: Provision db1135 into s4

https://gerrit.wikimedia.org/r/514439

Change 514439 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad,db-codfw.php: Provision db1135 into s4

https://gerrit.wikimedia.org/r/514439

Mentioned in SAL (#wikimedia-operations) [2019-06-05T09:25:16Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Pool without traffic db1135 into s4 T225060 (duration: 00m 56s)

Mentioned in SAL (#wikimedia-operations) [2019-06-05T09:26:16Z] <marostegui@deploy1001> Synchronized wmf-config/db-codfw.php: Pool without traffic db1135 into s4 T225060 (duration: 00m 55s)

Change 514450 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1135: Enable notifications

https://gerrit.wikimedia.org/r/514450

Change 514450 merged by Marostegui:
[operations/puppet@production] db1135: Enable notifications

https://gerrit.wikimedia.org/r/514450

Mentioned in SAL (#wikimedia-operations) [2019-06-05T14:24:40Z] <marostegui> Poweroff db1091 for BBU replacement - T225060

Good afternoon! db1091...i do have a spare bbu but that spare has been helpful the last year or so. HP is slow to send out the batteries, they can take days to get because of their slow response time and then having to ship batteries via ground transportation only. If I use it for this server than I am not able to quickly change out the bbu on something that may be more important in the future. The call
10:22 is yours since you have the most BBU issues.

10:22 <marostegui> Manuel Arostegui cmjohnson1: you have a spare BBU??

10:22 <cmjohnson1> Chris i do but see above

10:23 <marostegui> Manuel Arostegui cmjohnson1: Yeah, I see, I think we do need it for this host, as it is one of the ones that support most of the weight in s4 (commonswiki) which is one of the biggest wiksi

10:23 cmjohnson1: we might get 2 extra hosts at the end of q1 if analytics are able to free them up, but for now I think we do need db1091 in service

10:23 <cmjohnson1> Chris okay, works for me I will get to it today...can you leave it down.

10:24 <marostegui> Manuel Arostegui I will power it off for you yep

10:24 cmjohnson1: db1091 is now poweredoff, thank you so much

The bbu has been replaced.

Thank you so much @Cmjohnson
I can see the battery now:

Cache Backup Power Source: Batteries
Battery/Capacitor Count: 1
Battery/Capacitor Status: OK

Next steps I will take:

  • Start MySQL and let it replicate
  • Once replication is in sync, I will run a data check
  • Repool db1091 if data is good
  • I will leave db1135 pooled in s4 for the next week, it doesn't hurt
  • After the summit, I will send back db1135 to its original planned place

Thanks!

Mentioned in SAL (#wikimedia-operations) [2019-06-05T17:32:16Z] <marostegui> Start MySQL with replication stopped on db1091 - T225060

Mentioned in SAL (#wikimedia-operations) [2019-06-05T17:36:56Z] <marostegui> Start replication db1091 - T225060

Mentioned in SAL (#wikimedia-operations) [2019-06-05T19:48:18Z] <marostegui> Check data consistency on db1091 against db1135 - T225060

So, there data is consistent on main tables

archive
logging
page
revision
text
user
change_tag
actor
ipblocks
comment

Going to start repooling this host.

Change 514643 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1091: Enable notifications

https://gerrit.wikimedia.org/r/514643

Change 514643 merged by Marostegui:
[operations/puppet@production] db1091: Enable notifications

https://gerrit.wikimedia.org/r/514643

Mentioned in SAL (#wikimedia-operations) [2019-06-06T05:09:17Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Slowly repool db1091 after getting its BBU replaced T225060 (duration: 00m 56s)

Change 514651 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Fully repool db1091

https://gerrit.wikimedia.org/r/514651

Change 514651 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Fully repool db1091

https://gerrit.wikimedia.org/r/514651

db1091 is fully repooled.
I will remove db1135 from s4 after the SRE summit

CDanis added a subscriber: CDanis.

db1091 had some hardware failure again about 01:11 UTC.

Got a bunch of errors on sd 0:1:0:0 / sda, a bunch of SCSI commands failing with hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK, then mysqld crashed with SIGBUS.

The root filesystem is mounted read-only and maybe damaged -- trying to SSH in and start a shell returns an input/output error.

For now we've depooled and it is downtimed until 17:26 UTC Monday. More investigation to follow in business hours.

/system1/log1/record21
  Targets
  Properties
    number=21
    severity=Critical
    date=11/01/2020
    time=01:07
    description=Drive Array Controller Failure (Slot 1)
  Verbs
    cd version exit show

Change 637857 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1091: Disable notifications

https://gerrit.wikimedia.org/r/637857

Change 637857 merged by Marostegui:
[operations/puppet@production] db1091: Disable notifications

https://gerrit.wikimedia.org/r/637857

Thanks everyone who responded to this incident.
Looks like we'd need another disk for this host. @wiki_willy do we have some spares?
This host is scheduled for replacement with the new hosts that are scheduled for procurement at T264584, but if we still have some used disks somewhere, we could use them.
Otherwise let's just decommission it.

I have disabled notifications via puppet (and on Icinga manually, as puppet won't run on the host - to be consistent) as it will take a few days to either decom or reimage this one.

Hi @Marostegui - @Jclark-ctr is in charge of gathering up all the decom'd hardware for recycling, so we can have him check this week for any spare drives lying around. We should create a new task for this though, since this is a different hardware issue than the BBU issue that was resolved in June. Thanks, Willy