db1074 crashed: Broken BBU
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Dzahn
	Aug 30 2019, 10:14 AM

Description

db1074 went down:

05:18 <+icinga-wm> PROBLEM - Host db1074 is DOWN: PING CRITICAL - Packet loss = 100%

This caused slave lag on db1125

<+icinga-wm> PROBLEM - MariaDB Slave IO: s2 on db1125 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No,

The console showed nothing, so ema powercycled it.

Then jynus depooled it and pointed out this hits tendril which is annoying but does not break wikis.

06:10 <+logmsgbot> !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Mirror dbctl depool of db1074 (duration: 00m 55s)

After the depool we saw recovery of the slave lag alert.

Details

	Subject	Repo	Branch	Lines +/-
	db1074: Enable notifications	operations/puppet	production	+0 -1
	db1074: Disable notifications also in puppet	operations/puppet	production	+1 -0

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Marostegui	T231638 db1074 crashed: Broken BBU
					Unknown Object (Task)

Event Timeline

Dzahn created this task.Aug 30 2019, 10:14 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 30 2019, 10:14 AM

Dzahn updated the task description. (Show Details)Aug 30 2019, 10:18 AM

Dzahn unsubscribed.

[10:23:40] <jynus> !log reseting db1074 from iLo

On reboot:

313-HPE Smart Storage Battery 1 Failure - Battery Shutdown Event Code: 0x0400.
Action: Restart system. Contact HPE support if condition persists.



Important information available or errors detected

In T231638#5453726, @jcrespo wrote:

On reboot:

313-HPE Smart Storage Battery 1 Failure - Battery Shutdown Event Code: 0x0400.
Action: Restart system. Contact HPE support if condition persists.



Important information available or errors detected

For what is worth, is the BBU is really broken, this host isn't under warranty anymore.

Marostegui updated the task description. (Show Details)Aug 30 2019, 10:31 AM

BBU is broken:

  description=Smart Storage Battery failure (Battery 1, service information: 0x0A). Action: Gather AHS log and contact Support
Verbs
  cd vers
  description=POST Error: 313-HPE Smart Storage Battery 1 Failure - Battery Shutdown Event Code: 0x0400. Action: Restart system. Contact HPE support if condition persists.

@wiki_willy this host just got the support expired on March, any chances we might have a spare BBU on eqiad? This host is an active slave and losing it just by a few months wouldn't be great.

Marostegui renamed this task from db1074 crashed to db1074 crashed: Broken BBU.Aug 30 2019, 10:36 AM

Marostegui merged a task: T231639: Degraded RAID on db1074.Aug 30 2019, 10:53 AM

Marostegui added a subscriber: ops-monitoring-bot.

Mentioned in SAL (#wikimedia-operations) [2019-08-30T11:33:56Z] <jynus> switching db1125:s2 (eqiad sanitarium) to replicate from codfw T231638

Marostegui triaged this task as High priority.Aug 30 2019, 1:31 PM

Marostegui moved this task from Triage to In progress on the DBA board.

@Cmjohnson @Jclark-ctr - do you guys know offhand if we have a spare BBU lying around from a decom'd server by any chance? If not, let me know and we'll order the part.

Thanks,
Willy

@wiki_willy negative, we do not have any spare BBUs lying around.

Thanks for confirming @Cmjohnson , subtask T231670 created for Rob to order the part. Thanks, Willy

In T231638#5454666, @wiki_willy wrote:

Thanks for confirming @Cmjohnson , subtask T231670 created for Rob to order the part. Thanks, Willy

Thank you sooo much :)

Marostegui mentioned this in T232592: Degraded RAID on db1074.Sep 11 2019, 11:57 AM

Change 536970 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1074: Disable notifications also in puppet

https://gerrit.wikimedia.org/r/536970

gerritbot added a project: Patch-For-Review.Sep 16 2019, 9:02 AM

Change 536970 merged by Marostegui:
[operations/puppet@production] db1074: Disable notifications also in puppet

https://gerrit.wikimedia.org/r/536970

Maintenance_bot removed a project: Patch-For-Review.Sep 16 2019, 9:10 AM

Mentioned in SAL (#wikimedia-operations) [2019-09-17T07:58:08Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Pool db1074 with just 50 to keep its warmness level just in case T231638', diff saved to https://phabricator.wikimedia.org/P9115 and previous config saved to /var/cache/conftool/dbconfig/20190917-075807-marostegui.json

This host original weight was 200 in main traffic and 1 in API. I have only pooled it with weight 50 on main traffic, just to get it to do something.

Change 537319 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1074: Enable notifications

https://gerrit.wikimedia.org/r/537319

gerritbot added a project: Patch-For-Review.Sep 17 2019, 8:02 AM

The BBU showed up again (usual behaviour with a broken BBU)

root@db1074:~# hpssacli controller all show status

Smart Array P840 in Slot 1
   Controller Status: OK
   Cache Status: Not Configured
   Battery/Capacitor Status: OK

Change 537319 merged by Marostegui:
[operations/puppet@production] db1074: Enable notifications

https://gerrit.wikimedia.org/r/537319

Maintenance_bot removed a project: Patch-For-Review.Sep 17 2019, 8:10 AM

• Cmjohnson moved this task from Backlog to Hardware Failure / Troubleshoot on the ops-eqiad board.Sep 19 2019, 12:09 PM

Marostegui mentioned this in T233534: db1075 (s3 master) crashed - BBU failure.Sep 23 2019, 5:06 AM

Marostegui mentioned this in T233569: Batch db1074-db1079 hosts having BBU issues.Sep 23 2019, 5:10 AM

Marostegui mentioned this in Unknown Object (Task).Sep 25 2019, 1:05 PM

Reminder to move sanitarium (T231638#5453802) back here (or somewhere else on eqiad) before closing this ticket.

Jclark-ctr closed subtask Unknown Object (Task) as Resolved.Sep 27 2019, 2:02 PM

Mentioned in SAL (#wikimedia-operations) [2019-10-09T12:42:19Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1074 for BBU replacement T231638', diff saved to https://phabricator.wikimedia.org/P9278 and previous config saved to /var/cache/conftool/dbconfig/20191009-124218-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2019-10-09T12:42:41Z] <marostegui> Stop MySQL and power off db1074 for BBU replacement T231638

The BBU has been replaced by @Jclark-ctr (thanks!)
Let's leave the task open for 24h, as we need to also move sanitarium back under this host.

Claiming this task to indicate that I have to work on this now before closing it.

Marostegui lowered the priority of this task from High to Medium.Oct 9 2019, 2:45 PM

Mentioned in SAL (#wikimedia-operations) [2019-10-10T14:42:02Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Slowly repool db1074 after BBU replacement T231638', diff saved to https://phabricator.wikimedia.org/P9305 and previous config saved to /var/cache/conftool/dbconfig/20191010-144201-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2019-10-10T14:57:38Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Fully repool db1074 after getting its BBU replaced T231638', diff saved to https://phabricator.wikimedia.org/P9306 and previous config saved to /var/cache/conftool/dbconfig/20191010-145737-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2019-10-14T07:33:20Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1074 and db2126 to change sanitarium to replicate from db1074 T231638', diff saved to https://phabricator.wikimedia.org/P9320 and previous config saved to /var/cache/conftool/dbconfig/20191014-073319-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2019-10-14T07:54:04Z] <marostegui> Stop db1074 and db2126 in sync to change sanitarium's master for s2 - T231638

Mentioned in SAL (#wikimedia-operations) [2019-10-14T08:51:44Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Repool db1074 and db2126 after changing sanitarium to replicate from db1074 T231638', diff saved to https://phabricator.wikimedia.org/P9322 and previous config saved to /var/cache/conftool/dbconfig/20191014-085143-marostegui.json

db1125:3312 has been moved under db1074 with the following coordinates (GTID also enabled):

change master to master_host='db1074.eqiad.wmnet', master_user='repl', master_password='x' ,master_port=3306, MASTER_SSL=1,master_log_pos=388898652,master_log_file='db1074-bin.004543';

db1074 crashed: Broken BBUClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

db1074 crashed: Broken BBU
Closed, ResolvedPublic
Actions

Related Objects
Search...