Page MenuHomePhabricator

investigate RAID BBU auto-learn on db hosts
Closed, ResolvedPublic

Description

db1021 experienced sudden replag, which gdb showed was threads writing to disk
blocking on fsync(). Digging around:
megacli -AdpBbuCmd -a0 | grep Relative
Relative State of Charge: 9 %
And BBU charge was climbing, implying it was lower, perhaps even 0%? Looks like
the write-back cache was disabled in the process.
megacli -AdpBbuCmd -GetBbuProperties -a0 | grep "Learn"
Auto Learn Period: 90 Days
Next Learn time: Wed Oct 15 12:04:59 2014
Learn Delay Interval:0 Hours
Auto-Learn Mode: Enabled
90 days before that is right now. Investigate disabling auto-learn on db hosts.

Event Timeline

rtimport raised the priority of this task from to Medium.Dec 18 2014, 1:58 AM
rtimport added a project: ops-core.
rtimport set Reference to rt7916.
Springle created this task.Jul 17 2014, 7:03 PM

Dependency on ticket #7780 added by springle

Consider using "Write Cache OK if Bad BBU" to avoid dropping to WiteThrough
during learning. Would require more monitoring including battery state.

Yes, this would definitely do it, I've heard about this before.
Note that not too long ago at #7780 I wrote "or even weird statuses such as battery train schedules", referring exactly to something like this.
Faidon

Status changed from 'new' to 'open' by RT_System

Springle reassigned this task from Springle to jcrespo.May 18 2015, 4:06 AM
Springle set Security to None.

This is made more complicated by the switch to HP boxes in CODFW. May be related to, or influenced by, T97998.

jcrespo added a comment.EditedMay 19 2015, 10:22 AM

I do not feel comfortable with enforcing "Write Cache OK if Bad BBU", specially with the current status of RAID monitoring.

My proposed solution right now is:

  • Disable the automatic learning phase so that it only warns every 3 months:

    echo "autoLearnMode=2" > BbuProperties

    megacli -AdpBbuCmd -SetBbuProperties -f BbuProperties -a0

    (doesn't work on all RAIDs, we will have to disable it on some "autoLearnMode=1").

    Worst case scenario: no learn cycle is ever done instead of: actual hardware defect is not detected

    From the manual:

"Warn Via Event: The firmware warns about a pending learning cycle. You can initiate a learning cycle manually.
After the learning cycle is complete, the firmware resets the counter and warns you when the next learning cycle
time is reached."

  • Schedule maintenance manually making sure it includes a manual learning phase: $ megacli -AdpBbuCmd -BbuLearn -a0. This should be done on T84050 by detecting the RAID warning and, and a warning is sent to icinga.

When battery checks are in place (aside from being WIP, most the RAID-using hosts are not checking it on nagios), add that RAID alert to nagios if necessary.

jcrespo added a project: Patch-For-Review.EditedMay 19 2015, 6:46 PM

For some reason, the patch wasn't caught:
https://gerrit.wikimedia.org/r/#/c/212027/

jcrespo moved this task from Triage to Blocked external/Not db team on the DBA board.
This comment was removed by jcrespo.
jcrespo closed this task as Resolved.Sep 2 2015, 4:17 PM

All database hosts that allow "Warn only" have been setup as such. On the few that didn't, it has been disabled. This is the script that has been run:

salt '<hosts>' cmd.run 'echo "autoLearnMode=1" > /tmp/BbuProperties; megacli -AdpBbuCmd -SetBbuProperties -f /tmp/BbuProperties -a0; echo "autoLearnMode=2" > /tmp/BbuProperties; megacli -AdpBbuCmd -SetBbuProperties -f /tmp/BbuProperties -a0; rm /tmp/BbuProperties'

This should fix immediate issues, and maybe we can revisit in the future better monitoring or a plan for planned BBU maintenance.

Restricted Application added a subscriber: Matanya. · View Herald TranscriptSep 2 2015, 4:17 PM
jcrespo changed the visibility from "WMF-NDA (Project)" to "Public (No Login Required)".Sep 2 2015, 4:18 PM