investigate RAID BBU auto-learn on db hosts
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• Springle
	Jul 17 2014, 7:03 PM

Description

db1021 experienced sudden replag, which gdb showed was threads writing to disk
blocking on fsync(). Digging around:
megacli -AdpBbuCmd -a0 | grep Relative
Relative State of Charge: 9 %
And BBU charge was climbing, implying it was lower, perhaps even 0%? Looks like
the write-back cache was disabled in the process.
megacli -AdpBbuCmd -GetBbuProperties -a0 | grep "Learn"
Auto Learn Period: 90 Days
Next Learn time: Wed Oct 15 12:04:59 2014
Learn Delay Interval:0 Hours
Auto-Learn Mode: Enabled
90 days before that is right now. Investigate disabling auto-learn on db hosts.

Details

Reference: rt7916

	Subject	Repo	Branch	Lines +/-
	Create new module for managing RAID settings	operations/puppet	production	+58 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	jcrespo	T84178 investigate RAID BBU auto-learn on db hosts
Resolved	faidon	T84050 Refactor RAID checks (check-raid)
Open	None	T83476 Icinga RAID check: monitor rebuild status
Resolved	faidon	T97998 Add RAID monitoring for HP servers
Resolved	herron	T141252 icinga hp raid check timeout on busy ms-be and db machines
Resolved	herron	T172921 Nrpe command_timeout and "Service Check Timed Out" errors

Event Timeline

• rtimport raised the priority of this task from to Medium.Dec 18 2014, 1:58 AM

• rtimport added a project: ops-core.

• rtimport set Reference to rt7916.

• Springle created this task.Jul 17 2014, 7:03 PM

Dependency on ticket #7780 added by springle

Consider using "Write Cache OK if Bad BBU" to avoid dropping to WiteThrough
during learning. Would require more monitoring including battery state.

Yes, this would definitely do it, I've heard about this before.
Note that not too long ago at #7780 I wrote "or even weird statuses such as battery train schedules", referring exactly to something like this.
Faidon

Status changed from 'new' to 'open' by RT_System

• Springle reassigned this task from • Springle to jcrespo.May 18 2015, 4:06 AM

• Springle set Security to None.

This is made more complicated by the switch to HP boxes in CODFW. May be related to, or influenced by, T97998.

Just for reference:

I do not feel comfortable with enforcing "Write Cache OK if Bad BBU", specially with the current status of RAID monitoring.

My proposed solution right now is:

Disable the automatic learning phase so that it only warns every 3 months:

echo "autoLearnMode=2" > BbuProperties

megacli -AdpBbuCmd -SetBbuProperties -f BbuProperties -a0

(doesn't work on all RAIDs, we will have to disable it on some "autoLearnMode=1").

Worst case scenario: no learn cycle is ever done instead of: actual hardware defect is not detected

From the manual:

"Warn Via Event: The firmware warns about a pending learning cycle. You can initiate a learning cycle manually.
After the learning cycle is complete, the firmware resets the counter and warns you when the next learning cycle
time is reached."

Schedule maintenance manually making sure it includes a manual learning phase: $ megacli -AdpBbuCmd -BbuLearn -a0. This should be done on T84050 by detecting the RAID warning and, and a warning is sent to icinga.

When battery checks are in place (aside from being WIP, most the RAID-using hosts are not checking it on nagios), add that RAID alert to nagios if necessary.

For some reason, the patch wasn't caught:
https://gerrit.wikimedia.org/r/#/c/212027/

jcrespo mentioned this in rOMWCe389c23359d5: Depooling db1063 for maintenance (sw upgrade and RAID maintenance).May 20 2015, 6:41 AM

jcrespo added a project: DBA.Jun 2 2015, 2:25 PM

jcrespo moved this task from Triage to Blocked external/Not db team on the DBA board.

jcrespo added a comment.Jun 8 2015, 5:59 PM

This comment was removed by jcrespo.

All database hosts that allow "Warn only" have been setup as such. On the few that didn't, it has been disabled. This is the script that has been run:

salt '<hosts>' cmd.run 'echo "autoLearnMode=1" > /tmp/BbuProperties; megacli -AdpBbuCmd -SetBbuProperties -f /tmp/BbuProperties -a0; echo "autoLearnMode=2" > /tmp/BbuProperties; megacli -AdpBbuCmd -SetBbuProperties -f /tmp/BbuProperties -a0; rm /tmp/BbuProperties'

This should fix immediate issues, and maybe we can revisit in the future better monitoring or a plan for planned BBU maintenance.

Restricted Application added a subscriber: Matanya. · View Herald TranscriptSep 2 2015, 4:17 PM

jcrespo changed the visibility from "WMF-NDA (Project)" to "Public (No Login Required)".Sep 2 2015, 4:18 PM

faidon closed subtask T84050: Refactor RAID checks (check-raid) as Resolved.May 30 2016, 10:01 PM

investigate RAID BBU auto-learn on db hostsClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

investigate RAID BBU auto-learn on db hosts
Closed, ResolvedPublic
Actions

Related Objects
Search...