Page MenuHomePhabricator

db1059 BBU issues
Closed, ResolvedPublic

Description

db1059 has been complaining thru the night about the RAID policy

02:17 < icinga-wm> PROBLEM - MegaRAID on db1059 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough
02:37 < icinga-wm> RECOVERY - MegaRAID on db1059 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy
03:17 < icinga-wm> RECOVERY - MegaRAID on db1059 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy
05:17 < icinga-wm> PROBLEM - MegaRAID on db1059 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough                                                                                                                
05:47 < icinga-wm> RECOVERY - MegaRAID on db1059 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy
06:17 < icinga-wm> PROBLEM - MegaRAID on db1059 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough
root@db1059:~# megacli -AdpBbuCmd -aAll

BBU status for Adapter: 0

BatteryType: BBU
Battery State: Unknown
  Battery backup charge time : 0 hours

BBU Capacity Info for Adapter: 0

  Relative State of Charge: 18 %
  Absolute State of charge: 0 %
  Remaining Capacity: 89 mAh
  Full Charge Capacity: 519 mAh
  Run time to empty: Battery is not being charged.
  Average time to empty: 7 Min.
  Estimated Time to full recharge: Battery is not being charged.
  Cycle Count: 2
Max Error = 0 %
Remaining Capacity Alarm = 0 mAh
Remining Time Alarm = 0 Min

BBU Design Info for Adapter: 0

  Date of Manufacture: 07/18, 2011
  Design Capacity: 90 mAh
  Design Voltage: 0 mV
  Specification Info: 0
  Serial Number: 0
  Pack Stat Configuration: 0x0000
  Manufacture Name:
  Firmware Version   : 0148 03
  Device Name:
  Device Chemistry:
  Battery FRU: N/A
Module Version = 0148 03
  Transparent Learn = 1
  App Data = 1

BBU Properties for Adapter: 0

  Auto Learn Period: 90 Days
  Next Learn time: None  Learn Delay Interval:0 Hours
  Auto-Learn Mode: Disabled

On restart:

Theabatterytisrcurrentlyndischargedeor disconnected. VDs configured in          
write-back(mode0will0runBinawrite-throughimode to protect your data and will    
returndtoswrite-back policyowhenlthe battery is operational.If VDs have not     
returnedtto)write-backCmoderaftero30Mminutes of charging then contact           
technicalrsupportofornadditionaltassistance.                                    
The following VDs are affected: 00                                              
Press any key to continue.

@Cmjohnson this host is out of warranty, but do we have spare BBUs from decommissioned hosts that we can use here?

Event Timeline

Restricted Application added a project: Operations. · View Herald TranscriptJan 4 2018, 6:22 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Mentioned in SAL (#wikimedia-operations) [2018-01-04T06:23:20Z] <marostegui> Issue a BBU re-learn cycle on db1059 - T184160

Time: Fri Nov 24 23:39:07 2017
Event Description: Battery started charging
Time: Fri Nov 24 23:46:42 2017
Event Description: Battery charge complete
Time: Sun Nov 26 08:04:47 2017
Event Description: Battery charging was suspended due to high battery temperature
Time: Tue Dec 12 16:47:22 2017
Event Description: Battery Present
Event Description: Battery temperature is normal
Event Description: Current capacity of the battery is above threshold
Time: Tue Dec 12 16:47:46 2017
Event Description: Time established as 12/12/17 16:47:46; (54 seconds since power on)
Time: Sun Dec 17 22:33:28 2017
Event Description: Battery charging was suspended due to high battery temperature
Time: Sun Dec 17 23:42:48 2017
Event Description: Battery charging was suspended due to high battery temperature
Time: Mon Dec 18 00:02:18 2017
Event Description: Battery charging was suspended due to high battery temperature
Time: Mon Dec 18 08:01:08 2017
Event Description: Battery charging was suspended due to high battery temperature
Time: Mon Dec 18 09:31:03 2017
Event Description: Battery charging was suspended due to high battery temperature
Time: Mon Dec 18 14:30:03 2017
Event Description: Battery charging was suspended due to high battery temperature
Time: Mon Dec 18 15:10:08 2017
Event Description: Battery charging was suspended due to high battery temperature
Time: Mon Dec 18 16:09:43 2017
Event Description: Battery charging was suspended due to high battery temperature
Time: Tue Dec 19 01:57:58 2017
Event Description: Battery charging was suspended due to high battery temperature
Time: Tue Dec 19 05:57:23 2017
Event Description: Battery charging was suspended due to high battery temperature
Time: Tue Dec 19 09:17:48 2017
Event Description: Battery charging was suspended due to high battery temperature
Time: Tue Dec 19 09:37:18 2017
Event Description: Battery charging was suspended due to high battery temperature
Time: Tue Dec 19 09:56:48 2017
Event Description: Battery charging was suspended due to high battery temperature
Time: Tue Dec 19 10:17:23 2017
Event Description: Battery charging was suspended due to high battery temperature
Time: Tue Dec 19 10:36:53 2017
Event Description: Battery charging was suspended due to high battery temperature
Time: Tue Dec 19 11:36:28 2017
Event Description: Battery charging was suspended due to high battery temperature
Time: Tue Dec 19 12:37:08 2017
Event Description: Battery charging was suspended due to high battery temperature
Time: Tue Dec 19 12:56:38 2017
Event Description: Battery charging was suspended due to high battery temperature
Time: Tue Dec 19 13:16:08 2017
Event Description: Battery charging was suspended due to high battery temperature
Time: Tue Dec 19 15:56:28 2017
Event Description: Battery charging was suspended due to high battery temperature
Time: Tue Dec 19 16:46:18 2017
Event Description: Battery charging was suspended due to high battery temperature
Time: Tue Dec 19 17:55:38 2017
Event Description: Battery charging was suspended due to high battery temperature
Time: Tue Dec 19 22:55:43 2017
Event Description: Battery charging was suspended due to high battery temperature
Time: Wed Dec 20 03:14:38 2017
Event Description: Battery charging was suspended due to high battery temperature
Time: Wed Dec 20 03:44:58 2017
Event Description: Battery charging was suspended due to high battery temperature
Time: Wed Dec 20 12:43:23 2017
Event Description: Battery charging was suspended due to high battery temperature
Time: Wed Dec 20 13:23:28 2017
Event Description: Battery charging was suspended due to high battery temperature
Time: Wed Dec 20 18:22:28 2017
Event Description: Battery charging was suspended due to high battery temperature
Time: Thu Dec 21 07:31:08 2017
Event Description: Battery charging was suspended due to high battery temperature
Time: Thu Dec 21 10:30:58 2017
Event Description: Battery charging was suspended due to high battery temperature
Time: Thu Dec 21 17:49:43 2017
Event Description: Battery charging was suspended due to high battery temperature
Time: Thu Dec 21 22:29:13 2017
Event Description: Battery charging was suspended due to high battery temperature
Time: Fri Dec 22 02:39:28 2017
Event Description: Battery charging was suspended due to high battery temperature
Time: Fri Dec 22 13:07:48 2017
Event Description: Battery charging was suspended due to high battery temperature
Time: Sat Dec 23 00:27:03 2017
Event Description: Battery charging was suspended due to high battery temperature
Time: Sat Dec 23 09:15:43 2017
Event Description: Battery charging was suspended due to high battery temperature
Time: Sat Dec 23 12:35:03 2017
Event Description: Battery charging was suspended due to high battery temperature
Time: Sat Dec 23 14:45:03 2017
Event Description: Battery charging was suspended due to high battery temperature
Time: Sat Dec 23 19:44:03 2017
Event Description: Battery charging was suspended due to high battery temperature
Time: Wed Jan  3 22:38:58 2018
Event Description: Current capacity of the battery is below threshold
Time: Thu Jan  4 06:23:43 2018
Event Description: Battery relearn pending: Battery is under charge
Time: Thu Jan  4 06:24:48 2018
Event Description: Battery charging was suspended due to high battery temperature

After the manual relearn:

˜/icinga-wm 7:37> RECOVERY - MegaRAID on db1059 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy

Don't know for how long it will last

Marostegui triaged this task as Normal priority.Jan 4 2018, 6:45 AM
Marostegui moved this task from Triage to In progress on the DBA board.
PROBLEM - MegaRAID on db1059 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough

We should replace the BBU

˜/icinga-wm 10:13> PROBLEM - MegaRAID on db1059 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough

Mentioned in SAL (#wikimedia-operations) [2018-01-08T09:14:00Z] <marostegui> Force BBU relearn on db1059 - T184160

Cmjohnson moved this task from Backlog to Up next on the ops-eqiad board.Jan 9 2018, 3:20 PM

@Marostegui II have a used spare battery we can swap this out with. LMK when you want to schedule this

@Cmjohnson you want me to power off the server and we can do it now?

As per our chat, this will be done tomorrow

jcrespo renamed this task from db1059 possibly BBU issues to db1059 BBU issues.Jan 10 2018, 7:13 PM
jcrespo updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2018-01-11T06:17:20Z] <marostegui> Force BBU relearn on db1059 - T184160

Swapped the bbu....leaving this open to confirm everything is okay.

jcrespo closed this task as Resolved.Jan 11 2018, 6:15 PM
jcrespo added a subscriber: jcrespo.

icinga check says things are ok- we will reopen if they reappear. Thank you for the help!