Page MenuHomePhabricator

db1092 crashed - BBU broken
Closed, ResolvedPublic

Description

db1092 got frozen while a heavy alter got replicated from the master.

This is what we have on the HW logs:

</system1/log1>hpiLO-> show record14

status=0
status_tag=COMMAND COMPLETED
Wed Sep 26 07:44:43 2018



/system1/log1/record14
  Targets
  Properties
    number=14
    severity=Caution
    date=09/26/2018
    time=06:05
    description=Smart Storage Battery failure (Battery 1, service information: 0x0A). Action: Gather AHS log and contact Support
  Verbs
    cd version exit show

Update:
@Cmjohnson: Jaime kindly downloaded the logs for Support so you have them at: T205514#4618246

Related Objects

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2018-09-26T07:50:51Z] <marostegui> Hard reset db1092, server crashed - T205514

On reboot:

313-HPE Smart Storage Battery 1 Failure - Battery Shutdown Event Code: 0x0400.
Action: Restart system. Contact HPE support if condition persists.

Change 462872 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1092

https://gerrit.wikimedia.org/r/462872

Change 462872 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1092

https://gerrit.wikimedia.org/r/462872

Mentioned in SAL (#wikimedia-operations) [2018-09-26T08:04:17Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Depool db1092, server crashed - T205514 (duration: 00m 56s)

Change 462873 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Disable notifications on db1092, crashed

https://gerrit.wikimedia.org/r/462873

Marostegui triaged this task as Medium priority.
Marostegui added a project: ops-eqiad.
Marostegui moved this task from Triage to In progress on the DBA board.
Marostegui added a subscriber: Cmjohnson.

@Cmjohnson looks like we need a new BBU. This host is under warranty, can you talk to HP and see if we can get a new BBU before 10th Oct (as that is the DC failover scheduled date).
Thanks!

Change 462873 merged by Jcrespo:
[operations/puppet@production] mariadb: Disable notifications on db1092, crashed

https://gerrit.wikimedia.org/r/462873

Marostegui renamed this task from db1092 crashed to db1092 crashed - BBU broken.Sep 26 2018, 8:19 AM
Marostegui added a subscriber: ops-monitoring-bot.

A support ticket has been submitted with HPE

Case ID: 5332806955

Banyek added a subscriber: Banyek.

I will reclone this database instance

The donor host will be db1104 for recloning, and I update that first

Mentioned in SAL (#wikimedia-operations) [2018-09-27T08:30:16Z] <banyek> upgrading db1104 (kernel-mariadb) and rebooting it (T205514)

Mentioned in SAL (#wikimedia-operations) [2018-09-27T08:56:41Z] <banyek> stopping replocication & mariadb on db1104 and db1092 as db1092 is getting recloned from db1104 (T205514)

Change 463219 had a related patch set uploaded (by Banyek; owner: Banyek):
[operations/mediawiki-config@master] mariadb: depool db1104

https://gerrit.wikimedia.org/r/463219

Change 463219 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: depool db1104

https://gerrit.wikimedia.org/r/463219

Mentioned in SAL (#wikimedia-operations) [2018-09-27T09:54:43Z] <banyek@deploy1001> Synchronized wmf-config/db-eqiad.php: T205514: depooling db1104, adding db1109 as temproray api host for s8 (duration: 00m 56s)

recloning finished, the hosts are replicating again

Mentioned in SAL (#wikimedia-operations) [2018-09-27T13:46:23Z] <banyek@deploy1001> Synchronized wmf-config/db-eqiad.php: T205514: revert: depooling db1104, adding db1109 as temproray api host for s8 (duration: 00m 55s)

Mentioned in SAL (#wikimedia-operations) [2018-09-27T13:49:54Z] <banyek@deploy1001> Synchronized wmf-config/db-eqiad.php: T205514: revert: depooling db1104, adding db1109 as temproray api host for s8 (duration: 00m 56s)

the HP required AHS log has been uploaded to their dropbox. Waiting on their response.

Change 464753 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Clarify db1092 status

https://gerrit.wikimedia.org/r/464753

Change 464753 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Clarify db1092 status

https://gerrit.wikimedia.org/r/464753

Mentioned in SAL (#wikimedia-operations) [2018-10-05T05:13:41Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Clarify db1092 status - T205514 (duration: 00m 57s)

Change 465120 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Repool db1092 with low weight

https://gerrit.wikimedia.org/r/465120

Change 465120 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Repool db1092 with low weight

https://gerrit.wikimedia.org/r/465120

Mentioned in SAL (#wikimedia-operations) [2018-10-08T07:20:33Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Repool db1092 with low weight - T205514 (duration: 01m 27s)

The battery was sent to our old office address in San Francisco, they are shipping a new battery...because it's a battery it has to go ground and will be 3-5 days

Thanks for the update Chris - unbelievable!

Change 465959 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Increase weight for db1092

https://gerrit.wikimedia.org/r/465959

Change 465959 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Increase weight for db1092

https://gerrit.wikimedia.org/r/465959

Mentioned in SAL (#wikimedia-operations) [2018-10-15T15:39:03Z] <marostegui> Stop MySQL and poweroff db1092 for BBU replacement - T205514

Battery replaced by Chris - thank you!:

Battery/Capacitor Count: 1
Battery/Capacitor Status: OK