Page MenuHomePhabricator

db1095 backup source crashed: broken BBU
Closed, ResolvedPublic

Description

Times in UTC:

[07:46:29]  <+icinga-wm>	PROBLEM - Host db1095 is DOWN: PING CRITICAL - Packet loss = 100%

Needs investigation

Event Timeline

Looks storage related:

/system1/log1/record9
  Targets
  Properties
    number=9
    severity=Caution
    date=02/12/2020
    time=07:43
    description=Smart Storage Battery failure (Battery 1, service information: 0x0A). Action: Gather AHS log and contact Support
  Verbs
    cd version exit show

Change 571661 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1095: Disable notifications

https://gerrit.wikimedia.org/r/571661

Change 571661 merged by Marostegui:
[operations/puppet@production] db1095: Disable notifications

https://gerrit.wikimedia.org/r/571661

It rebooted itself:

[07:57:01]  <+icinga-wm>	RECOVERY - Host db1095 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms

And the BBU is gone:

root@db1095:~# hpssacli  controller all show detail | grep -i battery
   No-Battery Write Cache: Disabled
   Battery/Capacitor Count: 0
Marostegui triaged this task as Medium priority.
time=07:55
description=POST Error: 313-HPE Smart Storage Battery 1 Failure - Battery Shutdown Event Code: 0x0400. Action: Restart system. Contact HPE support if condition persists.

I have left MySQL stopped, leaving it to @jcrespo as this is a backup source
This host is out of warranty by almost a year (purchased in March 2016), but my recommendation is to buy a BBU like we did at {T231670} and {T233567}
Assigning to @jcrespo for next steps

Marostegui renamed this task from db1095 backup source crashed to db1095 backup source crashed: broken BBU.Feb 12 2020, 8:32 AM

created /srv/sqldata.s2 on db1140 and ran:

transfer.py --type=decompress --no-encrypt --no-checksum dbprov1002.eqiad.wmnet:/srv/backups/snapshots/latest/snapshot.s2.2020-02-12--01-22-05.tar.gz db1140.eqiad.wmnet:/srv/sqldata.s2

This is more complicated than it should eventually be, but less than it used to be.

Change 571693 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] backups: Migrate db1095 backup source instances (s2, s3) to db1140

https://gerrit.wikimedia.org/r/571693

Change 571696 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] backups: Disable db1140 notifications

https://gerrit.wikimedia.org/r/571696

Now running:

transfer.py --type=decompress --no-encrypt --no-checksum dbprov1002.eqiad.wmnet:/srv/backups/snapshots/latest/snapshot.s3.2020-02-12--05-46-09.tar.gz db1140.eqiad.wmnet:/srv/sqldata.s3

I predict s3 will take more time due to filesystem object overhead.

I predict s3 will take more time due to filesystem object overhead.

I was actually wrong, it only took 45 minutes (bottleneck was the 1G network, not IO, memory or CPU)..

Change 571693 merged by Jcrespo:
[operations/puppet@production] backups: Migrate db1095 backup source instances (s2, s3) to db1140

https://gerrit.wikimedia.org/r/571693

Change 571696 merged by Jcrespo:
[operations/puppet@production] backups: Disable db1140 notifications

https://gerrit.wikimedia.org/r/571696

eqiad backup service has been restored on a different host, now to handle hw issues.

Mentioned in SAL (#wikimedia-operations) [2020-02-12T15:32:31Z] <marostegui> Disable event handler for db1095 RAID check on icinga - T244958

jcrespo added a project: ops-eqiad.

Battery of db1095, out of warranty, HP, is toasted. It would be nice not throw away the whole server for just the RAID battery. Could we order one?

For dc operators: Server is out of rotation/service, data useless after the crash, and notifications disabled- can be put down at any time and in any way without needing our (DBAs) attention.

RobH moved this task from Backlog to Hardware Failure / Troubleshoot on the ops-eqiad board.
RobH added subscribers: Jclark-ctr, wiki_willy, RobH.

Please note that we just ordered replacement raid batteries for HP Gen9 raid controllers via T243547.

@Jclark-ctr: Please use one of the batteries from T243547 to fix this host. Thanks!

@jcrespo. Battery replacement delivery date is 02/22/20 Please message me on irc for what time works best for you for replacement. I can accommodate your schedule

Thank, Jclark-ctl. No need to wait for us in this particular case, as it is as important that the service was immediately moved elsewhere and the data considered irrecoverable (but service has to return at it). As I said:

Server is out of rotation/service, data useless after the crash, and notifications disabled- can be put down at any time and in any way without needing our (DBAs) attention.

Please proceed with db1095, shutting it down and putting it up on your own and ping us when done.

Change 572685 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] backups: Disable s3-eqiad backups until source host is restored

https://gerrit.wikimedia.org/r/572685

Change 572685 merged by Jcrespo:
[operations/puppet@production] backups: Disable s3-eqiad backups until source host is restored

https://gerrit.wikimedia.org/r/572685

Mentioned in SAL (#wikimedia-operations) [2020-02-19T17:40:58Z] <jynus> starting data check between db1078 and db1140:3313 T244958

Change 573961 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backups: Update code to wmfbackup HEAD to fix stalling issue

https://gerrit.wikimedia.org/r/573961

Change 573961 merged by Jcrespo:
[operations/puppet@production] mariadb-backups: Update code to wmfbackup HEAD to fix stalling issue

https://gerrit.wikimedia.org/r/573961

This comment was removed by Jclark-ctr.

I can see the battery - thank you

root@db1095:~# hpssacli  controller all show detail | grep -i battery
   No-Battery Write Cache: Disabled
   Battery/Capacitor Count: 1
   Battery/Capacitor Status: OK

Leaving this task open for @jcrespo to followup, as there are other things to do, like re-clone the host, start mysql etc I believe.

Assigning to Jaime to reflect that this is now pending his follow-up
Thank you John!

Thanks, I will repopulate this host with some production data, which may help with T246198.

Change 574996 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] database-backups: Add s3 to db1095 (backup source)

https://gerrit.wikimedia.org/r/574996

Change 574996 merged by Jcrespo:
[operations/puppet@production] database-backups: Add s3 to db1095 (backup source)

https://gerrit.wikimedia.org/r/574996

Change 575002 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] database-backups: Productionize db1095 as the backup source of s3

https://gerrit.wikimedia.org/r/575002

Change 575002 merged by Jcrespo:
[operations/puppet@production] database-backups: Productionize db1095 as the backup source of s3

https://gerrit.wikimedia.org/r/575002

I have updated zarcillo to point prometheus to the right instances (CC @Marostegui to try to have that up to date, because I sometimes forget, and it is totally my fault for not making it easy and automatic).

I will run a data check on s3 and s2 and then consider this fixed. Maybe moving s2 on codfw to to mirror data distribution?

Mentioned in SAL (#wikimedia-operations) [2020-02-26T15:51:09Z] <jynus> starting s2, s3 eqiad backup source data check; expect increase read traffic on db1095:3313, db1140:3312, db1078, db1090:3312 T244958

No differences found on s3, s2 tables between source backups and production. Issue fixed.