Page MenuHomePhabricator

db1091 crashed
Closed, ResolvedPublic

Description

db1091 crashed

[09:08:59]  <+icinga-wm>	PROBLEM - Host db1091 is DOWN: PING CRITICAL - Packet loss = 100%

This is what we have in HW logs:

/system1/log1/record10
  Targets
  Properties
    number=10
    severity=Caution
    date=06/05/2019
    time=07:06
    description=Smart Storage Battery failure (Battery 1, service information: 0x0A). Action: Gather AHS log and contact Support

/system1/log1/record11
  Targets
  Properties
    number=11
    severity=Critical
    date=06/05/2019
    time=07:17
    description=ASR Detected by System ROM

/system1/log1/record12
  Targets
  Properties
    number=12
    severity=Caution
    date=06/05/2019
    time=07:18
    description=POST Error: 313-HPE Smart Storage Battery 1 Failure - Battery Shutdown Event Code: 0x0400. Action: Restart system. Contact HPE support if condition persists.

Event Timeline

Restricted Application added a project: Operations. · View Herald TranscriptJun 5 2019, 7:16 AM
Marostegui triaged this task as High priority.Jun 5 2019, 7:22 AM
Marostegui added a subscriber: jcrespo.

BBU broke

Battery/Capacitor Count: 0

@Cmjohnson Can we give this host some priority? I wouldn't want to have it down for the whole offsite week.
I believe its support just expired, so we might not be able to get a replacement for the BBU, if it is really broken, but can we maybe upgrade its firmware/BIOS? Do you happen to have a spare BBU around the DC?

@jcrespo I am going to place db1135 temporarily (T222682) to replace this host until we have found a solution

Marostegui moved this task from Triage to In progress on the DBA board.Jun 5 2019, 7:26 AM

Change 514433 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: Temporary: place db1135 into s4

https://gerrit.wikimedia.org/r/514433

Change 514433 merged by Marostegui:
[operations/puppet@production] mariadb: Temporary: place db1135 into s4

https://gerrit.wikimedia.org/r/514433

Mentioned in SAL (#wikimedia-operations) [2019-06-05T07:45:36Z] <marostegui> Transfer dbprov1001.eqiad.wmnet:snapshot.s4.2019-06-04--21-37-03.tar.gz to db1135 to provision it on s4 T225060

Change 514436 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1091: Disable notifications

https://gerrit.wikimedia.org/r/514436

Change 514436 merged by Marostegui:
[operations/puppet@production] db1091: Disable notifications

https://gerrit.wikimedia.org/r/514436

Marostegui updated the task description. (Show Details)Jun 5 2019, 8:11 AM

Mentioned in SAL (#wikimedia-operations) [2019-06-05T08:12:11Z] <marostegui> Reboot db1091 T225060

Change 514439 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad,db-codfw.php: Provision db1135 into s4

https://gerrit.wikimedia.org/r/514439

Change 514439 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad,db-codfw.php: Provision db1135 into s4

https://gerrit.wikimedia.org/r/514439

Mentioned in SAL (#wikimedia-operations) [2019-06-05T09:25:16Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Pool without traffic db1135 into s4 T225060 (duration: 00m 56s)

Mentioned in SAL (#wikimedia-operations) [2019-06-05T09:26:16Z] <marostegui@deploy1001> Synchronized wmf-config/db-codfw.php: Pool without traffic db1135 into s4 T225060 (duration: 00m 55s)

Change 514450 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1135: Enable notifications

https://gerrit.wikimedia.org/r/514450

Change 514450 merged by Marostegui:
[operations/puppet@production] db1135: Enable notifications

https://gerrit.wikimedia.org/r/514450

Mentioned in SAL (#wikimedia-operations) [2019-06-05T14:24:40Z] <marostegui> Poweroff db1091 for BBU replacement - T225060

Good afternoon! db1091...i do have a spare bbu but that spare has been helpful the last year or so. HP is slow to send out the batteries, they can take days to get because of their slow response time and then having to ship batteries via ground transportation only. If I use it for this server than I am not able to quickly change out the bbu on something that may be more important in the future. The call
10:22 is yours since you have the most BBU issues.

10:22 <marostegui> Manuel Arostegui cmjohnson1: you have a spare BBU??

10:22 <cmjohnson1> Chris i do but see above

10:23 <marostegui> Manuel Arostegui cmjohnson1: Yeah, I see, I think we do need it for this host, as it is one of the ones that support most of the weight in s4 (commonswiki) which is one of the biggest wiksi

10:23 cmjohnson1: we might get 2 extra hosts at the end of q1 if analytics are able to free them up, but for now I think we do need db1091 in service

10:23 <cmjohnson1> Chris okay, works for me I will get to it today...can you leave it down.

10:24 <marostegui> Manuel Arostegui I will power it off for you yep

10:24 cmjohnson1: db1091 is now poweredoff, thank you so much

Cmjohnson closed this task as Resolved.Jun 5 2019, 5:10 PM

The bbu has been replaced.

Thank you so much @Cmjohnson
I can see the battery now:

Cache Backup Power Source: Batteries
Battery/Capacitor Count: 1
Battery/Capacitor Status: OK

Next steps I will take:

  • Start MySQL and let it replicate
  • Once replication is in sync, I will run a data check
  • Repool db1091 if data is good
  • I will leave db1135 pooled in s4 for the next week, it doesn't hurt
  • After the summit, I will send back db1135 to its original planned place

Thanks!

Mentioned in SAL (#wikimedia-operations) [2019-06-05T17:32:16Z] <marostegui> Start MySQL with replication stopped on db1091 - T225060

Mentioned in SAL (#wikimedia-operations) [2019-06-05T17:36:56Z] <marostegui> Start replication db1091 - T225060

Mentioned in SAL (#wikimedia-operations) [2019-06-05T19:48:18Z] <marostegui> Check data consistency on db1091 against db1135 - T225060

So, there data is consistent on main tables

archive
logging
page
revision
text
user
change_tag
actor
ipblocks
comment

Going to start repooling this host.

Change 514643 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1091: Enable notifications

https://gerrit.wikimedia.org/r/514643

Change 514643 merged by Marostegui:
[operations/puppet@production] db1091: Enable notifications

https://gerrit.wikimedia.org/r/514643

Mentioned in SAL (#wikimedia-operations) [2019-06-06T05:09:17Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Slowly repool db1091 after getting its BBU replaced T225060 (duration: 00m 56s)

Change 514651 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Fully repool db1091

https://gerrit.wikimedia.org/r/514651

Change 514651 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Fully repool db1091

https://gerrit.wikimedia.org/r/514651

db1091 is fully repooled.
I will remove db1135 from s4 after the SRE summit