db1076 crashed - BBU failure
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Marostegui
	Oct 6 2020, 2:03 PM

Description

db1076 crashed today, and the logs show BBU failure:

[13:57:22]  <+icinga-wm>	PROBLEM - Host db1076 is DOWN: PING CRITICAL - Packet loss = 100%

/system1/log1/record17
  Targets
  Properties
    number=17
    severity=Caution
    date=10/06/2020
    time=13:47
    description=Smart Storage Battery failure (Battery 1, service information: 0x0A). Action: Gather AHS log and contact Support
  Verbs
    cd version exit show

This host is going away soon as part of the refresh (T264584)

Details

	Subject	Repo	Branch	Lines +/-
	db1076: Add clarification comment on BBU status	operations/puppet	production	+1 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
		Unknown Object (Task)
Resolved	Marostegui	T258361 Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers)
Declined	None	T258386 db1080-95 batch possibly suffering BBU issues
Resolved	Marostegui	T264755 db1076 crashed - BBU failure

Event Timeline

Marostegui created this task.Oct 6 2020, 2:03 PM

Mentioned in SAL (#wikimedia-operations) [2020-10-06T14:03:54Z] <marostegui> Power cycle db1076 T264755

The battery is gone:

root@db1076:~# hpssacli controller all show detail | grep Battery
   No-Battery Write Cache: Disabled
   Battery/Capacitor Count: 0

Marostegui triaged this task as Medium priority.Oct 6 2020, 2:07 PM

Marostegui moved this task from Triage to In progress on the DBA board.

Mentioned in SAL (#wikimedia-operations) [2020-10-06T14:08:00Z] <marostegui> Reboot db1076 for kernel upgrade T264755

LSobanski moved this task from Inbox to Epic on the Data-Persistence board.Oct 6 2020, 2:09 PM

Change 632494 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1076: Add clarification comment on BBU status

https://gerrit.wikimedia.org/r/632494

Change 632494 merged by Marostegui:
[operations/puppet@production] db1076: Add clarification comment on BBU status

https://gerrit.wikimedia.org/r/632494

I have rebooted the host to make sure it boots up cleanly and to get it on the newest kernel.
Let's leave the controller on write through policy and if we see that the server cannot keep up, we can force it to WB, but I don't expect that to happen, as it hasn't happened with others in the past since we migrated to SSDs.

MySQL has started and recovered from the crash, I am going to let replication catch up and then start a table comparison to make sure we are good on the data consistency side

Comparison between the master and this host started for the following tables:

actor actor_id
archive ar_id
change_tag ct_id
comment comment_id
logging log_id
page page_id
revision rev_id
revision_actor_temp revactor_rev
revision_comment_temp revcomment_rev
slots slot_revision_id
text old_id
user user_id
watchlist wl_id
ipblocks ipb_id
page page_id

Maintenance_bot removed a project: Patch-For-Review.Oct 6 2020, 3:10 PM

Mentioned in SAL (#wikimedia-operations) [2020-10-07T09:09:44Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Repool db1076 T264755 ', diff saved to https://phabricator.wikimedia.org/P12943 and previous config saved to /var/cache/conftool/dbconfig/20201007-090943-marostegui.json

The table comparison came back clean. I have repooled the host.
This host is a candidate master for s2, so it runs stretch and 10.1. It will be replaced during Q2 (T258361), so probably not worth rebuilding another host to replace it.
If this host would need to become master for any reason, the broken BBU shouldn't degrade its performance by running it with WT instead of WB policy. We can always force WB if we see that during its slave role it gets behind.
Resolving this for now.

db1076 crashed - BBU failureClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

db1076 crashed - BBU failure
Closed, ResolvedPublic
Actions

Related Objects
Search...