Page MenuHomePhabricator

db1076 crashed - BBU failure
Closed, ResolvedPublic

Description

db1076 crashed today, and the logs show BBU failure:

[13:57:22]  <+icinga-wm>	PROBLEM - Host db1076 is DOWN: PING CRITICAL - Packet loss = 100%
/system1/log1/record17
  Targets
  Properties
    number=17
    severity=Caution
    date=10/06/2020
    time=13:47
    description=Smart Storage Battery failure (Battery 1, service information: 0x0A). Action: Gather AHS log and contact Support
  Verbs
    cd version exit show

This host is going away soon as part of the refresh (T264584)

Event Timeline

The battery is gone:

root@db1076:~# hpssacli controller all show detail | grep Battery
   No-Battery Write Cache: Disabled
   Battery/Capacitor Count: 0
Marostegui triaged this task as Medium priority.Oct 6 2020, 2:07 PM
Marostegui moved this task from Triage to In progress on the DBA board.

Mentioned in SAL (#wikimedia-operations) [2020-10-06T14:08:00Z] <marostegui> Reboot db1076 for kernel upgrade T264755

Change 632494 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1076: Add clarification comment on BBU status

https://gerrit.wikimedia.org/r/632494

Change 632494 merged by Marostegui:
[operations/puppet@production] db1076: Add clarification comment on BBU status

https://gerrit.wikimedia.org/r/632494

I have rebooted the host to make sure it boots up cleanly and to get it on the newest kernel.
Let's leave the controller on write through policy and if we see that the server cannot keep up, we can force it to WB, but I don't expect that to happen, as it hasn't happened with others in the past since we migrated to SSDs.

MySQL has started and recovered from the crash, I am going to let replication catch up and then start a table comparison to make sure we are good on the data consistency side

Marostegui added a subscriber: ops-monitoring-bot.

Comparison between the master and this host started for the following tables:

actor actor_id
archive ar_id
change_tag ct_id
comment comment_id
logging log_id
page page_id
revision rev_id
revision_actor_temp revactor_rev
revision_comment_temp revcomment_rev
slots slot_revision_id
text old_id
user user_id
watchlist wl_id
ipblocks ipb_id
page page_id

Mentioned in SAL (#wikimedia-operations) [2020-10-07T09:09:44Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Repool db1076 T264755 ', diff saved to https://phabricator.wikimedia.org/P12943 and previous config saved to /var/cache/conftool/dbconfig/20201007-090943-marostegui.json

Marostegui claimed this task.

The table comparison came back clean. I have repooled the host.
This host is a candidate master for s2, so it runs stretch and 10.1. It will be replaced during Q2 (T258361), so probably not worth rebuilding another host to replace it.
If this host would need to become master for any reason, the broken BBU shouldn't degrade its performance by running it with WT instead of WB policy. We can always force WB if we see that during its slave role it gets behind.
Resolving this for now.