Page MenuHomePhabricator

db1074 crashed: Broken BBU
Closed, ResolvedPublic

Description

db1074 went down:

05:18 <+icinga-wm> PROBLEM - Host db1074 is DOWN: PING CRITICAL - Packet loss = 100%

This caused slave lag on db1125

<+icinga-wm> PROBLEM - MariaDB Slave IO: s2 on db1125 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No,

The console showed nothing, so ema powercycled it.

Then jynus depooled it and pointed out this hits tendril which is annoying but does not break wikis.

06:10 <+logmsgbot> !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Mirror dbctl depool of db1074 (duration: 00m 55s)

After the depool we saw recovery of the slave lag alert.

Details

Related Gerrit Patches:
operations/puppet : productiondb1074: Enable notifications
operations/puppet : productiondb1074: Disable notifications also in puppet

Event Timeline

Dzahn created this task.Aug 30 2019, 10:14 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 30 2019, 10:14 AM
Dzahn updated the task description. (Show Details)Aug 30 2019, 10:18 AM
Dzahn removed a subscriber: Dzahn.

[10:23:40] <jynus> !log reseting db1074 from iLo

On reboot:

313-HPE Smart Storage Battery 1 Failure - Battery Shutdown Event Code: 0x0400.
Action: Restart system. Contact HPE support if condition persists.



Important information available or errors detected

On reboot:

313-HPE Smart Storage Battery 1 Failure - Battery Shutdown Event Code: 0x0400.
Action: Restart system. Contact HPE support if condition persists.
Important information available or errors detected

For what is worth, is the BBU is really broken, this host isn't under warranty anymore.

Marostegui updated the task description. (Show Details)Aug 30 2019, 10:31 AM

BBU is broken:

  description=Smart Storage Battery failure (Battery 1, service information: 0x0A). Action: Gather AHS log and contact Support
Verbs
  cd vers
  description=POST Error: 313-HPE Smart Storage Battery 1 Failure - Battery Shutdown Event Code: 0x0400. Action: Restart system. Contact HPE support if condition persists.

@wiki_willy this host just got the support expired on March, any chances we might have a spare BBU on eqiad? This host is an active slave and losing it just by a few months wouldn't be great.

Marostegui renamed this task from db1074 crashed to db1074 crashed: Broken BBU.Aug 30 2019, 10:36 AM

Mentioned in SAL (#wikimedia-operations) [2019-08-30T11:33:56Z] <jynus> switching db1125:s2 (eqiad sanitarium) to replicate from codfw T231638

Marostegui triaged this task as High priority.Aug 30 2019, 1:31 PM
Marostegui moved this task from Triage to In progress on the DBA board.
wiki_willy added a project: ops-eqiad.
wiki_willy added subscribers: Jclark-ctr, Cmjohnson.

@Cmjohnson @Jclark-ctr - do you guys know offhand if we have a spare BBU lying around from a decom'd server by any chance? If not, let me know and we'll order the part.

Thanks,
Willy

@wiki_willy negative, we do not have any spare BBUs lying around.

Thanks for confirming @Cmjohnson , subtask T231670 created for Rob to order the part. Thanks, Willy

Thanks for confirming @Cmjohnson , subtask T231670 created for Rob to order the part. Thanks, Willy

Thank you sooo much :)

Change 536970 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1074: Disable notifications also in puppet

https://gerrit.wikimedia.org/r/536970

Change 536970 merged by Marostegui:
[operations/puppet@production] db1074: Disable notifications also in puppet

https://gerrit.wikimedia.org/r/536970

Mentioned in SAL (#wikimedia-operations) [2019-09-17T07:58:08Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Pool db1074 with just 50 to keep its warmness level just in case T231638', diff saved to https://phabricator.wikimedia.org/P9115 and previous config saved to /var/cache/conftool/dbconfig/20190917-075807-marostegui.json

This host original weight was 200 in main traffic and 1 in API. I have only pooled it with weight 50 on main traffic, just to get it to do something.

Change 537319 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1074: Enable notifications

https://gerrit.wikimedia.org/r/537319

The BBU showed up again (usual behaviour with a broken BBU)

root@db1074:~# hpssacli controller all show status

Smart Array P840 in Slot 1
   Controller Status: OK
   Cache Status: Not Configured
   Battery/Capacitor Status: OK

Change 537319 merged by Marostegui:
[operations/puppet@production] db1074: Enable notifications

https://gerrit.wikimedia.org/r/537319

Marostegui mentioned this in Unknown Object (Task).Sep 25 2019, 1:05 PM

Reminder to move sanitarium (T231638#5453802) back here (or somewhere else on eqiad) before closing this ticket.

Jclark-ctr closed subtask Unknown Object (Task) as Resolved.Sep 27 2019, 2:02 PM

Mentioned in SAL (#wikimedia-operations) [2019-10-09T12:42:19Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1074 for BBU replacement T231638', diff saved to https://phabricator.wikimedia.org/P9278 and previous config saved to /var/cache/conftool/dbconfig/20191009-124218-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2019-10-09T12:42:41Z] <marostegui> Stop MySQL and power off db1074 for BBU replacement T231638

The BBU has been replaced by @Jclark-ctr (thanks!)
Let's leave the task open for 24h, as we need to also move sanitarium back under this host.

Claiming this task to indicate that I have to work on this now before closing it.

Marostegui lowered the priority of this task from High to Medium.Oct 9 2019, 2:45 PM

Mentioned in SAL (#wikimedia-operations) [2019-10-10T14:42:02Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Slowly repool db1074 after BBU replacement T231638', diff saved to https://phabricator.wikimedia.org/P9305 and previous config saved to /var/cache/conftool/dbconfig/20191010-144201-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2019-10-10T14:57:38Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Fully repool db1074 after getting its BBU replaced T231638', diff saved to https://phabricator.wikimedia.org/P9306 and previous config saved to /var/cache/conftool/dbconfig/20191010-145737-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2019-10-14T07:33:20Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1074 and db2126 to change sanitarium to replicate from db1074 T231638', diff saved to https://phabricator.wikimedia.org/P9320 and previous config saved to /var/cache/conftool/dbconfig/20191014-073319-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2019-10-14T07:54:04Z] <marostegui> Stop db1074 and db2126 in sync to change sanitarium's master for s2 - T231638

Mentioned in SAL (#wikimedia-operations) [2019-10-14T08:51:44Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Repool db1074 and db2126 after changing sanitarium to replicate from db1074 T231638', diff saved to https://phabricator.wikimedia.org/P9322 and previous config saved to /var/cache/conftool/dbconfig/20191014-085143-marostegui.json

Marostegui closed this task as Resolved.Oct 14 2019, 8:53 AM

db1125:3312 has been moved under db1074 with the following coordinates (GTID also enabled):

change master to master_host='db1074.eqiad.wmnet', master_user='repl', master_password='x' ,master_port=3306, MASTER_SSL=1,master_log_pos=388898652,master_log_file='db1074-bin.004543';