Times in UTC:
[07:46:29] <+icinga-wm> PROBLEM - Host db1095 is DOWN: PING CRITICAL - Packet loss = 100%
Needs investigation
Times in UTC:
[07:46:29] <+icinga-wm> PROBLEM - Host db1095 is DOWN: PING CRITICAL - Packet loss = 100%
Needs investigation
Looks storage related:
/system1/log1/record9 Targets Properties number=9 severity=Caution date=02/12/2020 time=07:43 description=Smart Storage Battery failure (Battery 1, service information: 0x0A). Action: Gather AHS log and contact Support Verbs cd version exit show
Change 571661 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1095: Disable notifications
Change 571661 merged by Marostegui:
[operations/puppet@production] db1095: Disable notifications
It rebooted itself:
[07:57:01] <+icinga-wm> RECOVERY - Host db1095 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms
And the BBU is gone:
root@db1095:~# hpssacli controller all show detail | grep -i battery No-Battery Write Cache: Disabled Battery/Capacitor Count: 0
time=07:55 description=POST Error: 313-HPE Smart Storage Battery 1 Failure - Battery Shutdown Event Code: 0x0400. Action: Restart system. Contact HPE support if condition persists.
I have left MySQL stopped, leaving it to @jcrespo as this is a backup source
This host is out of warranty by almost a year (purchased in March 2016), but my recommendation is to buy a BBU like we did at {T231670} and {T233567}
Assigning to @jcrespo for next steps
created /srv/sqldata.s2 on db1140 and ran:
transfer.py --type=decompress --no-encrypt --no-checksum dbprov1002.eqiad.wmnet:/srv/backups/snapshots/latest/snapshot.s2.2020-02-12--01-22-05.tar.gz db1140.eqiad.wmnet:/srv/sqldata.s2
This is more complicated than it should eventually be, but less than it used to be.
Change 571693 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] backups: Migrate db1095 backup source instances (s2, s3) to db1140
Change 571696 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] backups: Disable db1140 notifications
Now running:
transfer.py --type=decompress --no-encrypt --no-checksum dbprov1002.eqiad.wmnet:/srv/backups/snapshots/latest/snapshot.s3.2020-02-12--05-46-09.tar.gz db1140.eqiad.wmnet:/srv/sqldata.s3
I predict s3 will take more time due to filesystem object overhead.
I predict s3 will take more time due to filesystem object overhead.
I was actually wrong, it only took 45 minutes (bottleneck was the 1G network, not IO, memory or CPU)..
Change 571693 merged by Jcrespo:
[operations/puppet@production] backups: Migrate db1095 backup source instances (s2, s3) to db1140
Change 571696 merged by Jcrespo:
[operations/puppet@production] backups: Disable db1140 notifications
Mentioned in SAL (#wikimedia-operations) [2020-02-12T15:32:31Z] <marostegui> Disable event handler for db1095 RAID check on icinga - T244958
Battery of db1095, out of warranty, HP, is toasted. It would be nice not throw away the whole server for just the RAID battery. Could we order one?
For dc operators: Server is out of rotation/service, data useless after the crash, and notifications disabled- can be put down at any time and in any way without needing our (DBAs) attention.
Please note that we just ordered replacement raid batteries for HP Gen9 raid controllers via T243547.
@Jclark-ctr: Please use one of the batteries from T243547 to fix this host. Thanks!
@jcrespo. Battery replacement delivery date is 02/22/20 Please message me on irc for what time works best for you for replacement. I can accommodate your schedule
Thank, Jclark-ctl. No need to wait for us in this particular case, as it is as important that the service was immediately moved elsewhere and the data considered irrecoverable (but service has to return at it). As I said:
Server is out of rotation/service, data useless after the crash, and notifications disabled- can be put down at any time and in any way without needing our (DBAs) attention.
Please proceed with db1095, shutting it down and putting it up on your own and ping us when done.
Change 572685 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] backups: Disable s3-eqiad backups until source host is restored
Change 572685 merged by Jcrespo:
[operations/puppet@production] backups: Disable s3-eqiad backups until source host is restored
Mentioned in SAL (#wikimedia-operations) [2020-02-19T17:40:58Z] <jynus> starting data check between db1078 and db1140:3313 T244958
Change 573961 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backups: Update code to wmfbackup HEAD to fix stalling issue
Change 573961 merged by Jcrespo:
[operations/puppet@production] mariadb-backups: Update code to wmfbackup HEAD to fix stalling issue
I can see the battery - thank you
root@db1095:~# hpssacli controller all show detail | grep -i battery No-Battery Write Cache: Disabled Battery/Capacitor Count: 1 Battery/Capacitor Status: OK
Leaving this task open for @jcrespo to followup, as there are other things to do, like re-clone the host, start mysql etc I believe.
Thanks, I will repopulate this host with some production data, which may help with T246198.
Change 574996 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] database-backups: Add s3 to db1095 (backup source)
Change 574996 merged by Jcrespo:
[operations/puppet@production] database-backups: Add s3 to db1095 (backup source)
Change 575002 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] database-backups: Productionize db1095 as the backup source of s3
Change 575002 merged by Jcrespo:
[operations/puppet@production] database-backups: Productionize db1095 as the backup source of s3
I have updated zarcillo to point prometheus to the right instances (CC @Marostegui to try to have that up to date, because I sometimes forget, and it is totally my fault for not making it easy and automatic).
I will run a data check on s3 and s2 and then consider this fixed. Maybe moving s2 on codfw to to mirror data distribution?
Mentioned in SAL (#wikimedia-operations) [2020-02-26T15:51:09Z] <jynus> starting s2, s3 eqiad backup source data check; expect increase read traffic on db1095:3313, db1140:3312, db1078, db1090:3312 T244958
No differences found on s3, s2 tables between source backups and production. Issue fixed.