db1095 backup source crashed: broken BBU
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• Marostegui
	Feb 12 2020, 7:52 AM

Description

Times in UTC:

[07:46:29]  <+icinga-wm>	PROBLEM - Host db1095 is DOWN: PING CRITICAL - Packet loss = 100%

Needs investigation

Details

Subject	Repo	Branch	Lines +/-
database-backups: Productionize db1095 as the backup source of s3	operations/puppet	production	+5 -7
database-backups: Add s3 to db1095 (backup source)	operations/puppet	production	+2 -3
mariadb-backups: Update code to wmfbackup HEAD to fix stalling issue	operations/puppet	production	+0 -2
backups: Disable s3-eqiad backups until source host is restored	operations/puppet	production	+9 -9
backups: Disable db1140 notifications	operations/puppet	production	+1 -1
backups: Migrate db1095 backup source instances (s2, s3) to db1140	operations/puppet	production	+12 -9
db1095: Disable notifications	operations/puppet	production	+1 -0

Customize query in gerrit

Related Objects

Mentioned In: T246198: db1140 s2 replication disconnected temporarily
T245621: db1084 crashed due to BBU failure
Mentioned Here: T246198: db1140 s2 replication disconnected temporarily

Event Timeline

• Marostegui created this task.Feb 12 2020, 7:52 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 12 2020, 7:52 AM

Looks storage related:

/system1/log1/record9
  Targets
  Properties
    number=9
    severity=Caution
    date=02/12/2020
    time=07:43
    description=Smart Storage Battery failure (Battery 1, service information: 0x0A). Action: Gather AHS log and contact Support
  Verbs
    cd version exit show

Change 571661 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1095: Disable notifications

https://gerrit.wikimedia.org/r/571661

Change 571661 merged by Marostegui:
[operations/puppet@production] db1095: Disable notifications

https://gerrit.wikimedia.org/r/571661

It rebooted itself:

[07:57:01]  <+icinga-wm>	RECOVERY - Host db1095 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms

And the BBU is gone:

root@db1095:~# hpssacli  controller all show detail | grep -i battery
   No-Battery Write Cache: Disabled
   Battery/Capacitor Count: 0

Maintenance_bot removed a project: Patch-For-Review.Feb 12 2020, 8:10 AM

time=07:55
description=POST Error: 313-HPE Smart Storage Battery 1 Failure - Battery Shutdown Event Code: 0x0400. Action: Restart system. Contact HPE support if condition persists.

I have left MySQL stopped, leaving it to @jcrespo as this is a backup source
This host is out of warranty by almost a year (purchased in March 2016), but my recommendation is to buy a BBU like we did at {T231670} and {T233567}
Assigning to @jcrespo for next steps

• Marostegui renamed this task from db1095 backup source crashed to db1095 backup source crashed: broken BBU.Feb 12 2020, 8:32 AM

• Marostegui merged a task: T244960: Degraded RAID on db1095.Feb 12 2020, 8:34 AM

• Marostegui added a subscriber: ops-monitoring-bot.

• Marostegui moved this task from Triage to In progress on the DBA board.Feb 12 2020, 9:09 AM

created /srv/sqldata.s2 on db1140 and ran:

transfer.py --type=decompress --no-encrypt --no-checksum dbprov1002.eqiad.wmnet:/srv/backups/snapshots/latest/snapshot.s2.2020-02-12--01-22-05.tar.gz db1140.eqiad.wmnet:/srv/sqldata.s2

This is more complicated than it should eventually be, but less than it used to be.

Change 571693 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] backups: Migrate db1095 backup source instances (s2, s3) to db1140

https://gerrit.wikimedia.org/r/571693

gerritbot added a project: Patch-For-Review.Feb 12 2020, 10:55 AM

Change 571696 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] backups: Disable db1140 notifications

https://gerrit.wikimedia.org/r/571696

Now running:

transfer.py --type=decompress --no-encrypt --no-checksum dbprov1002.eqiad.wmnet:/srv/backups/snapshots/latest/snapshot.s3.2020-02-12--05-46-09.tar.gz db1140.eqiad.wmnet:/srv/sqldata.s3

I predict s3 will take more time due to filesystem object overhead.

I predict s3 will take more time due to filesystem object overhead.

I was actually wrong, it only took 45 minutes (bottleneck was the 1G network, not IO, memory or CPU)..

Change 571693 merged by Jcrespo:
[operations/puppet@production] backups: Migrate db1095 backup source instances (s2, s3) to db1140

https://gerrit.wikimedia.org/r/571693

Change 571696 merged by Jcrespo:
[operations/puppet@production] backups: Disable db1140 notifications

https://gerrit.wikimedia.org/r/571696

Maintenance_bot removed a project: Patch-For-Review.Feb 12 2020, 2:10 PM

eqiad backup service has been restored on a different host, now to handle hw issues.

• Marostegui merged a task: T245031: Degraded RAID on db1095.Feb 12 2020, 3:31 PM

Mentioned in SAL (#wikimedia-operations) [2020-02-12T15:32:31Z] <marostegui> Disable event handler for db1095 RAID check on icinga - T244958

Battery of db1095, out of warranty, HP, is toasted. It would be nice not throw away the whole server for just the RAID battery. Could we order one?

For dc operators: Server is out of rotation/service, data useless after the crash, and notifications disabled- can be put down at any time and in any way without needing our (DBAs) attention.

Please note that we just ordered replacement raid batteries for HP Gen9 raid controllers via T243547.

@Jclark-ctr: Please use one of the batteries from T243547 to fix this host. Thanks!

RobH unsubscribed.Feb 12 2020, 5:11 PM

@jcrespo. Battery replacement delivery date is 02/22/20 Please message me on irc for what time works best for you for replacement. I can accommodate your schedule

Thank, Jclark-ctl. No need to wait for us in this particular case, as it is as important that the service was immediately moved elsewhere and the data considered irrecoverable (but service has to return at it). As I said:

Server is out of rotation/service, data useless after the crash, and notifications disabled- can be put down at any time and in any way without needing our (DBAs) attention.

Please proceed with db1095, shutting it down and putting it up on your own and ping us when done.

Change 572685 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] backups: Disable s3-eqiad backups until source host is restored

https://gerrit.wikimedia.org/r/572685

gerritbot added a project: Patch-For-Review.Feb 17 2020, 3:23 PM

Change 572685 merged by Jcrespo:
[operations/puppet@production] backups: Disable s3-eqiad backups until source host is restored

https://gerrit.wikimedia.org/r/572685

Maintenance_bot removed a project: Patch-For-Review.Feb 17 2020, 4:10 PM

wiki_willy mentioned this in T245621: db1084 crashed due to BBU failure.Feb 19 2020, 5:00 PM

Mentioned in SAL (#wikimedia-operations) [2020-02-19T17:40:58Z] <jynus> starting data check between db1078 and db1140:3313 T244958

Change 573961 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backups: Update code to wmfbackup HEAD to fix stalling issue

https://gerrit.wikimedia.org/r/573961

gerritbot added a project: Patch-For-Review.Feb 21 2020, 9:34 AM

Change 573961 merged by Jcrespo:
[operations/puppet@production] mariadb-backups: Update code to wmfbackup HEAD to fix stalling issue

https://gerrit.wikimedia.org/r/573961

Maintenance_bot removed a project: Patch-For-Review.Feb 21 2020, 10:10 AM

Jclark-ctr added a comment.Feb 25 2020, 9:49 PM

This comment was removed by Jclark-ctr.

Replaced BBU @jcrespo @Marostegui

I can see the battery - thank you

root@db1095:~# hpssacli  controller all show detail | grep -i battery
   No-Battery Write Cache: Disabled
   Battery/Capacitor Count: 1
   Battery/Capacitor Status: OK

Leaving this task open for @jcrespo to followup, as there are other things to do, like re-clone the host, start mysql etc I believe.

Assigning to Jaime to reflect that this is now pending his follow-up
Thank you John!

Thanks, I will repopulate this host with some production data, which may help with T246198.

Change 574996 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] database-backups: Add s3 to db1095 (backup source)

https://gerrit.wikimedia.org/r/574996

Change 574996 merged by Jcrespo:
[operations/puppet@production] database-backups: Add s3 to db1095 (backup source)

https://gerrit.wikimedia.org/r/574996

Change 575002 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] database-backups: Productionize db1095 as the backup source of s3

https://gerrit.wikimedia.org/r/575002

Change 575002 merged by Jcrespo:
[operations/puppet@production] database-backups: Productionize db1095 as the backup source of s3

https://gerrit.wikimedia.org/r/575002

Maintenance_bot removed a project: Patch-For-Review.Feb 26 2020, 12:11 PM

jcrespo mentioned this in T246198: db1140 s2 replication disconnected temporarily.Feb 26 2020, 1:46 PM

I have updated zarcillo to point prometheus to the right instances (CC @Marostegui to try to have that up to date, because I sometimes forget, and it is totally my fault for not making it easy and automatic).

I will run a data check on s3 and s2 and then consider this fixed. Maybe moving s2 on codfw to to mirror data distribution?

Mentioned in SAL (#wikimedia-operations) [2020-02-26T15:51:09Z] <jynus> starting s2, s3 eqiad backup source data check; expect increase read traffic on db1095:3313, db1140:3312, db1078, db1090:3312 T244958

No differences found on s3, s2 tables between source backups and production. Issue fixed.

db1095 backup source crashed: broken BBUClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

db1095 backup source crashed: broken BBU
Closed, ResolvedPublic
Actions