Page MenuHomePhabricator

db1051 database host BBU issues
Closed, ResolvedPublic

Description

Error says:

PROBLEM - MegaRAID on db1051 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough

It has been temporarily depooled on: https://gerrit.wikimedia.org/r/406847

Event Timeline

jcrespo created this task.Jan 30 2018, 8:16 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 30 2018, 8:16 PM

Mentioned in SAL (#wikimedia-operations) [2018-01-31T07:08:10Z] <marostegui> Force BBU relearn on db1051 - T186049

root@db1051:~#  megacli -AdpBbuCmd  -a0

BBU status for Adapter: 0

BatteryType: BBU
Battery State: Unknown
  Battery backup charge time : 0 hours

BBU Capacity Info for Adapter: 0

  Relative State of Charge: 15 %
  Absolute State of charge: 0 %
  Remaining Capacity: 82 mAh
  Full Charge Capacity: 581 mAh
  Run time to empty: Battery is not being charged.
  Average time to empty: 7 Min.
  Estimated Time to full recharge: Battery is not being charged.
  Cycle Count: 4
Max Error = 0 %
Remaining Capacity Alarm = 0 mAh
Remining Time Alarm = 0 Min

BBU Design Info for Adapter: 0

  Date of Manufacture: 07/18, 2011
  Design Capacity: 90 mAh
  Design Voltage: 0 mV
  Specification Info: 0
  Serial Number: 0
  Pack Stat Configuration: 0x0000
  Manufacture Name:
  Firmware Version   : 0148 03
  Device Name:
  Device Chemistry:
  Battery FRU: N/A
Module Version = 0148 03
  Transparent Learn = 1
  App Data = 1

BBU Properties for Adapter: 0

  Auto Learn Period: 90 Days
  Next Learn time: None  Learn Delay Interval:0 Hours
  Auto-Learn Mode: Disabled

Exit Code: 0x00

I have forced a re-learn just in case it is a one time thing (most likely it is not).
Once it fails again, we better replace it.

After the relearn:

root@db1051:~# megacli -AdpBbuCmd  -a0 | grep Optimal
Battery State: Optimal
root@db1051:~# megacli -ldinfo -l0 -a0 | grep Policy
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Default Access Policy: Read/Write
Current Access Policy: Read/Write
Disk Cache Policy   : Disk's Default
Default Power Savings Policy: Controller Defined
Current Power Savings Policy: None
Marostegui moved this task from Triage to In progress on the DBA board.Jan 31 2018, 12:12 PM
Marostegui added a subscriber: Cmjohnson.

This server definitely needs a BBU replacement.
@Cmjohnson can you let us know a day that works for you to get it replaced?

Thanks!

Restricted Application added a project: Operations. · View Herald TranscriptFeb 1 2018, 7:00 AM
Dzahn assigned this task to Cmjohnson.Feb 1 2018, 11:33 PM
Dzahn triaged this task as Normal priority.
Cmjohnson moved this task from Backlog to Up next on the ops-eqiad board.Feb 2 2018, 4:39 PM

@Marostegui Let's do this Tuesday (my morning) 1500UTC

Tuesday 6 Feb

Great! Will have the server ready by then
Thanks!

Mentioned in SAL (#wikimedia-operations) [2018-02-06T14:53:01Z] <marostegui> Poweroff db1051 for BBU replacement - T186049

@Cmjohnson this server is now off.
Feel free to power it on once you've done the replacement

Thanks!

BBU is now charging
Thanks Chris!

Marostegui closed this task as Resolved.Feb 7 2018, 6:52 AM

After a full recharge it looks good now:

root@db1051:~# megacli -AdpBbuCmd  -a0

BBU status for Adapter: 0

BatteryType: BBU
Voltage: 3936 mV
Current: 0 mA
Temperature: 48 C
Battery State: Optimal
BBU Firmware Status:

The raid is back to WB:

root@db1051:~# megacli -ldinfo -l0 -a0 | grep Policy
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU

Thanks again Chris!

Change 408761 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Repool db1051 on vslow

https://gerrit.wikimedia.org/r/408761

Change 408761 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Repool db1051 on vslow

https://gerrit.wikimedia.org/r/408761

Mentioned in SAL (#wikimedia-operations) [2018-02-07T07:00:44Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Start repooling db1051 after the BBU change - T186049 (duration: 01m 15s)

Mentioned in SAL (#wikimedia-operations) [2018-02-07T09:52:41Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Fully repool db1051 after the BBU change - T186049 (duration: 01m 14s)