Page MenuHomePhabricator

db1095 backup source crashed: broken BBU
Open, MediumPublic

Description

Times in UTC:

[07:46:29]  <+icinga-wm>	PROBLEM - Host db1095 is DOWN: PING CRITICAL - Packet loss = 100%

Needs investigation

Details

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptWed, Feb 12, 7:52 AM

Looks storage related:

/system1/log1/record9
  Targets
  Properties
    number=9
    severity=Caution
    date=02/12/2020
    time=07:43
    description=Smart Storage Battery failure (Battery 1, service information: 0x0A). Action: Gather AHS log and contact Support
  Verbs
    cd version exit show

Change 571661 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1095: Disable notifications

https://gerrit.wikimedia.org/r/571661

Change 571661 merged by Marostegui:
[operations/puppet@production] db1095: Disable notifications

https://gerrit.wikimedia.org/r/571661

It rebooted itself:

[07:57:01]  <+icinga-wm>	RECOVERY - Host db1095 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms

And the BBU is gone:

root@db1095:~# hpssacli  controller all show detail | grep -i battery
   No-Battery Write Cache: Disabled
   Battery/Capacitor Count: 0
Marostegui triaged this task as Medium priority.
time=07:55
description=POST Error: 313-HPE Smart Storage Battery 1 Failure - Battery Shutdown Event Code: 0x0400. Action: Restart system. Contact HPE support if condition persists.

I have left MySQL stopped, leaving it to @jcrespo as this is a backup source
This host is out of warranty by almost a year (purchased in March 2016), but my recommendation is to buy a BBU like we did at {T231670} and {T233567}
Assigning to @jcrespo for next steps

Marostegui renamed this task from db1095 backup source crashed to db1095 backup source crashed: broken BBU.Wed, Feb 12, 8:32 AM
Marostegui moved this task from Triage to In progress on the DBA board.Wed, Feb 12, 9:09 AM

created /srv/sqldata.s2 on db1140 and ran:

transfer.py --type=decompress --no-encrypt --no-checksum dbprov1002.eqiad.wmnet:/srv/backups/snapshots/latest/snapshot.s2.2020-02-12--01-22-05.tar.gz db1140.eqiad.wmnet:/srv/sqldata.s2

This is more complicated than it should eventually be, but less than it used to be.

Change 571693 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] backups: Migrate db1095 backup source instances (s2, s3) to db1140

https://gerrit.wikimedia.org/r/571693

Change 571696 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] backups: Disable db1140 notifications

https://gerrit.wikimedia.org/r/571696

Now running:

transfer.py --type=decompress --no-encrypt --no-checksum dbprov1002.eqiad.wmnet:/srv/backups/snapshots/latest/snapshot.s3.2020-02-12--05-46-09.tar.gz db1140.eqiad.wmnet:/srv/sqldata.s3

I predict s3 will take more time due to filesystem object overhead.

I predict s3 will take more time due to filesystem object overhead.

I was actually wrong, it only took 45 minutes (bottleneck was the 1G network, not IO, memory or CPU)..

Change 571693 merged by Jcrespo:
[operations/puppet@production] backups: Migrate db1095 backup source instances (s2, s3) to db1140

https://gerrit.wikimedia.org/r/571693

Change 571696 merged by Jcrespo:
[operations/puppet@production] backups: Disable db1140 notifications

https://gerrit.wikimedia.org/r/571696

eqiad backup service has been restored on a different host, now to handle hw issues.

Mentioned in SAL (#wikimedia-operations) [2020-02-12T15:32:31Z] <marostegui> Disable event handler for db1095 RAID check on icinga - T244958

jcrespo reassigned this task from jcrespo to wiki_willy.EditedWed, Feb 12, 4:52 PM
jcrespo added a project: ops-eqiad.

Battery of db1095, out of warranty, HP, is toasted. It would be nice not throw away the whole server for just the RAID battery. Could we order one?

For dc operators: Server is out of rotation/service, data useless after the crash, and notifications disabled- can be put down at any time and in any way without needing our (DBAs) attention.

RobH reassigned this task from wiki_willy to Jclark-ctr.Wed, Feb 12, 5:11 PM
RobH moved this task from Backlog to Hardware Failure / Troubleshoot on the ops-eqiad board.
RobH added subscribers: Jclark-ctr, wiki_willy, RobH.

Please note that we just ordered replacement raid batteries for HP Gen9 raid controllers via T243547.

@Jclark-ctr: Please use one of the batteries from T243547 to fix this host. Thanks!

RobH removed a subscriber: RobH.Wed, Feb 12, 5:11 PM

@jcrespo. Battery replacement delivery date is 02/22/20 Please message me on irc for what time works best for you for replacement. I can accommodate your schedule

Thank, Jclark-ctl. No need to wait for us in this particular case, as it is as important that the service was immediately moved elsewhere and the data considered irrecoverable (but service has to return at it). As I said:

Server is out of rotation/service, data useless after the crash, and notifications disabled- can be put down at any time and in any way without needing our (DBAs) attention.

Please proceed with db1095, shutting it down and putting it up on your own and ping us when done.

Change 572685 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] backups: Disable s3-eqiad backups until source host is restored

https://gerrit.wikimedia.org/r/572685

Change 572685 merged by Jcrespo:
[operations/puppet@production] backups: Disable s3-eqiad backups until source host is restored

https://gerrit.wikimedia.org/r/572685

Mentioned in SAL (#wikimedia-operations) [2020-02-19T17:40:58Z] <jynus> starting data check between db1078 and db1140:3313 T244958

Change 573961 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backups: Update code to wmfbackup HEAD to fix stalling issue

https://gerrit.wikimedia.org/r/573961

Change 573961 merged by Jcrespo:
[operations/puppet@production] mariadb-backups: Update code to wmfbackup HEAD to fix stalling issue

https://gerrit.wikimedia.org/r/573961