Page MenuHomePhabricator

db1130 BBU possible issues
Closed, ResolvedPublic

Description

Creating this just for the record
Looks like db1130 is having issues with the BBU and the policy has changed to WriteThrough:

root@db1130:~#  megacli -LDInfo -Lall -aALL


Adapter 0 -- Virtual Drive Information:
Virtual Drive: 0 (Target Id: 0)
Name                :
RAID Level          : Primary-1, Secondary-0, RAID Level Qualifier-0
Size                : 4.364 TB
Sector Size         : 512
Is VD emulated      : No
Mirror Data         : 4.364 TB
State               : Optimal
Strip Size          : 256 KB
Number Of Drives    : 6
Span Depth          : 1
Default Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteThrough, ReadAhead, Direct, No Write Cache if Bad BBU

These are the HW logs:

-------------------------------------------------------------------------------
Record:      14
Date/Time:   10/30/2019 01:16:24
Source:      system
Severity:    Non-Critical
Description: The PERC1 battery is low.
-------------------------------------------------------------------------------
Record:      15
Date/Time:   10/30/2019 01:18:34
Source:      system
Severity:    Ok
Description: The PERC1 battery is operating normally.
-------------------------------------------------------------------------------
Record:      16
Date/Time:   12/13/2019 18:14:21
Source:      system
Severity:    Non-Critical
Description: The PERC1 battery is low.
-------------------------------------------------------------------------------
Record:      17
Date/Time:   12/13/2019 18:21:56
Source:      system
Severity:    Ok
Description: The PERC1 battery is operating normally.
-------------------------------------------------------------------------------
Record:      18
Date/Time:   12/13/2019 20:13:31
Source:      system
Severity:    Non-Critical
Description: The PERC1 battery is low.
-------------------------------------------------------------------------------
Record:      19
Date/Time:   12/13/2019 20:21:06
Source:      system
Severity:    Ok
Description: The PERC1 battery is operating normally.
-------------------------------------------------------------------------------
Record:      20
Date/Time:   12/13/2019 21:14:11
Source:      system
Severity:    Non-Critical
Description: The PERC1 battery is low.
-------------------------------------------------------------------------------
Record:      21
Date/Time:   12/13/2019 21:21:46
Source:      system
Severity:    Ok
Description: The PERC1 battery is operating normally.
-------------------------------------------------------------------------------
Record:      22
Date/Time:   12/13/2019 22:13:46
Source:      system
Severity:    Non-Critical
Description: The PERC1 battery is low.
-------------------------------------------------------------------------------
Record:      23
Date/Time:   12/13/2019 22:21:21
Source:      system
Severity:    Ok
Description: The PERC1 battery is operating normally.
-------------------------------------------------------------------------------
Record:      24
Date/Time:   12/13/2019 23:13:21
Source:      system
Severity:    Non-Critical
Description: The PERC1 battery is low.
-------------------------------------------------------------------------------
Record:      25
Date/Time:   12/13/2019 23:22:01
Source:      system
Severity:    Ok
Description: The PERC1 battery is operating normally.
-------------------------------------------------------------------------------
Record:      26
Date/Time:   12/14/2019 00:12:56
Source:      system
Severity:    Non-Critical
Description: The PERC1 battery is low.
-------------------------------------------------------------------------------
Record:      27
Date/Time:   12/14/2019 00:21:31
Source:      system
Severity:    Ok
Description: The PERC1 battery is operating normally.
-------------------------------------------------------------------------------
Record:      28
Date/Time:   12/14/2019 01:12:31
Source:      system
Severity:    Non-Critical
Description: The PERC1 battery is low.
-------------------------------------------------------------------------------
Record:      29
Date/Time:   12/14/2019 01:22:16
Source:      system
Severity:    Ok
Description: The PERC1 battery is operating normally.
-------------------------------------------------------------------------------
Record:      30
Date/Time:   12/14/2019 02:13:06
Source:      system
Severity:    Non-Critical
Description: The PERC1 battery is low.
-------------------------------------------------------------------------------
Record:      31
Date/Time:   12/14/2019 02:21:51
Source:      system
Severity:    Ok
Description: The PERC1 battery is operating normally.
-------------------------------------------------------------------------------
Record:      32
Date/Time:   12/14/2019 03:12:46
Source:      system
Severity:    Non-Critical
Description: The PERC1 battery is low.
-------------------------------------------------------------------------------
Record:      33
Date/Time:   12/14/2019 03:22:26
Source:      system
Severity:    Ok
Description: The PERC1 battery is operating normally.
-------------------------------------------------------------------------------
Record:      34
Date/Time:   12/14/2019 04:12:21
Source:      system
Severity:    Non-Critical
Description: The PERC1 battery is low.
-------------------------------------------------------------------------------
Record:      35
Date/Time:   12/14/2019 04:22:06
Source:      system
Severity:    Ok
Description: The PERC1 battery is operating normally.
-------------------------------------------------------------------------------
Record:      36
Date/Time:   12/14/2019 05:11:56
Source:      system
Severity:    Non-Critical
Description: The PERC1 battery is low.
-------------------------------------------------------------------------------
Record:      37
Date/Time:   12/14/2019 05:21:36
Source:      system
Severity:    Ok
Description: The PERC1 battery is operating normally.
-------------------------------------------------------------------------------
Record:      38
Date/Time:   12/14/2019 06:11:31
Source:      system
Severity:    Non-Critical
Description: The PERC1 battery is low.
-------------------------------------------------------------------------------
Record:      39
Date/Time:   12/14/2019 06:22:16
Source:      system
Severity:    Ok
Description: The PERC1 battery is operating normally.
-------------------------------------------------------------------------------
Record:      40
Date/Time:   12/14/2019 07:12:06
Source:      system
Severity:    Non-Critical
Description: The PERC1 battery is low.
-------------------------------------------------------------------------------
Record:      41
Date/Time:   12/14/2019 07:21:56
Source:      system
Severity:    Ok
Description: The PERC1 battery is operating normally.
-------------------------------------------------------------------------------
Record:      42
Date/Time:   12/14/2019 08:11:46
Source:      system
Severity:    Non-Critical
Description: The PERC1 battery is low.
-------------------------------------------------------------------------------
Record:      43
Date/Time:   12/14/2019 08:22:31
Source:      system
Severity:    Ok
Description: The PERC1 battery is operating normally.
-------------------------------------------------------------------------------
Record:      44
Date/Time:   12/14/2019 09:11:16
Source:      system
Severity:    Non-Critical
Description: The PERC1 battery is low.
-------------------------------------------------------------------------------
Record:      45
Date/Time:   12/14/2019 09:22:11
Source:      system
Severity:    Ok
Description: The PERC1 battery is operating normally.
-------------------------------------------------------------------------------
Record:      46
Date/Time:   12/14/2019 10:10:56
Source:      system
Severity:    Non-Critical
Description: The PERC1 battery is low.
-------------------------------------------------------------------------------
Record:      47
Date/Time:   12/14/2019 10:12:01
Source:      system
Severity:    Ok
Description: The PERC1 battery is operating normally.
-------------------------------------------------------------------------------
Record:      48
Date/Time:   12/14/2019 11:10:31
Source:      system
Severity:    Non-Critical
Description: The PERC1 battery is low.
-------------------------------------------------------------------------------
Record:      49
Date/Time:   12/14/2019 11:12:41
Source:      system
Severity:    Ok
Description: The PERC1 battery is operating normally.
-------------------------------------------------------------------------------
Record:      50
Date/Time:   12/14/2019 12:10:06
Source:      system
Severity:    Non-Critical
Description: The PERC1 battery is low.
-------------------------------------------------------------------------------
Record:      51
Date/Time:   12/14/2019 12:12:11
Source:      system
Severity:    Ok
Description: The PERC1 battery is operating normally.
-------------------------------------------------------------------------------
Record:      52
Date/Time:   12/14/2019 13:10:46
Source:      system
Severity:    Non-Critical
Description: The PERC1 battery is low.
-------------------------------------------------------------------------------
Record:      53
Date/Time:   12/14/2019 13:12:56
Source:      system
Severity:    Ok
Description: The PERC1 battery is operating normally.
-------------------------------------------------------------------------------
Record:      54
Date/Time:   12/14/2019 14:10:16
Source:      system
Severity:    Non-Critical
Description: The PERC1 battery is low.
-------------------------------------------------------------------------------
Record:      55
Date/Time:   12/14/2019 14:12:31
Source:      system
Severity:    Ok
Description: The PERC1 battery is operating normally.
-------------------------------------------------------------------------------
Record:      56
Date/Time:   12/14/2019 15:09:56
Source:      system
Severity:    Non-Critical
Description: The PERC1 battery is low.
-------------------------------------------------------------------------------
Record:      57
Date/Time:   12/14/2019 15:12:01
Source:      system
Severity:    Ok
Description: The PERC1 battery is operating normally.
-------------------------------------------------------------------------------
Record:      58
Date/Time:   12/14/2019 16:09:26
Source:      system
Severity:    Non-Critical
Description: The PERC1 battery is low.
-------------------------------------------------------------------------------
Record:      59
Date/Time:   12/14/2019 16:12:46
Source:      system
Severity:    Ok
Description: The PERC1 battery is operating normally.
-------------------------------------------------------------------------------
Record:      60
Date/Time:   12/14/2019 17:09:06
Source:      system
Severity:    Non-Critical
Description: The PERC1 battery is low.
-------------------------------------------------------------------------------
Record:      61
Date/Time:   12/14/2019 17:12:16
Source:      system
Severity:    Ok
Description: The PERC1 battery is operating normally.
-------------------------------------------------------------------------------
Record:      62
Date/Time:   12/14/2019 18:08:36
Source:      system
Severity:    Non-Critical
Description: The PERC1 battery is low.
-------------------------------------------------------------------------------
Record:      63
Date/Time:   12/14/2019 18:13:01
Source:      system
Severity:    Ok
Description: The PERC1 battery is operating normally.
-------------------------------------------------------------------------------
Record:      64
Date/Time:   12/14/2019 19:09:21
Source:      system
Severity:    Non-Critical
Description: The PERC1 battery is low.
-------------------------------------------------------------------------------
Record:      65
Date/Time:   12/14/2019 19:12:36
Source:      system
Severity:    Ok
Description: The PERC1 battery is operating normally.
-------------------------------------------------------------------------------
Record:      66
Date/Time:   12/14/2019 20:08:56
Source:      system
Severity:    Non-Critical
Description: The PERC1 battery is low.
-------------------------------------------------------------------------------
Record:      67
Date/Time:   12/14/2019 20:13:11
Source:      system
Severity:    Ok
Description: The PERC1 battery is operating normally.
-------------------------------------------------------------------------------
Record:      68
Date/Time:   12/14/2019 21:08:26
Source:      system
Severity:    Non-Critical
Description: The PERC1 battery is low.
-------------------------------------------------------------------------------
Record:      69
Date/Time:   12/14/2019 21:12:51
Source:      system
Severity:    Ok
Description: The PERC1 battery is operating normally.
-------------------------------------------------------------------------------
Record:      70
Date/Time:   12/14/2019 21:52:56
Source:      system
Severity:    Non-Critical
Description: The PERC1 battery is low.
-------------------------------------------------------------------------------
Record:      71
Date/Time:   12/14/2019 22:02:41
Source:      system
Severity:    Ok
Description: The PERC1 battery is operating normally.
-------------------------------------------------------------------------------
Record:      72
Date/Time:   12/14/2019 22:08:06
Source:      system
Severity:    Non-Critical
Description: The PERC1 battery is low.
-------------------------------------------------------------------------------
Record:      73
Date/Time:   12/14/2019 22:12:21
Source:      system
Severity:    Ok
Description: The PERC1 battery is operating normally.
-------------------------------------------------------------------------------
Record:      74
Date/Time:   12/14/2019 23:07:36
Source:      system
Severity:    Non-Critical
Description: The PERC1 battery is low.
-------------------------------------------------------------------------------
Record:      75
Date/Time:   12/14/2019 23:13:06
Source:      system
Severity:    Ok
Description: The PERC1 battery is operating normally.
-------------------------------------------------------------------------------
Record:      76
Date/Time:   12/15/2019 00:08:21
Source:      system
Severity:    Non-Critical
Description: The PERC1 battery is low.
-------------------------------------------------------------------------------
Record:      77
Date/Time:   12/15/2019 00:12:41
Source:      system
Severity:    Ok
Description: The PERC1 battery is operating normally.
-------------------------------------------------------------------------------
Record:      78
Date/Time:   12/15/2019 01:07:56
Source:      system
Severity:    Non-Critical
Description: The PERC1 battery is low.
-------------------------------------------------------------------------------
Record:      79
Date/Time:   12/15/2019 01:23:06
Source:      system
Severity:    Ok
Description: The PERC1 battery is operating normally.
-------------------------------------------------------------------------------
Record:      80
Date/Time:   12/15/2019 02:07:31
Source:      system
Severity:    Non-Critical
Description: The PERC1 battery is low.
-------------------------------------------------------------------------------
Record:      81
Date/Time:   12/15/2019 02:12:56
Source:      system
Severity:    Ok
Description: The PERC1 battery is operating normally.
-------------------------------------------------------------------------------
Record:      82
Date/Time:   12/15/2019 03:02:46
Source:      system
Severity:    Non-Critical
Description: The PERC1 battery is low.
-------------------------------------------------------------------------------
Record:      83
Date/Time:   12/15/2019 03:23:21
Source:      system
Severity:    Ok
Description: The PERC1 battery is operating normally.
-------------------------------------------------------------------------------
Record:      84
Date/Time:   12/15/2019 03:33:01
Source:      system
Severity:    Non-Critical
Description: The PERC1 battery is low.
-------------------------------------------------------------------------------
Record:      85
Date/Time:   12/15/2019 03:42:51
Source:      system
Severity:    Ok
Description: The PERC1 battery is operating normally.
-------------------------------------------------------------------------------
Record:      86
Date/Time:   12/15/2019 03:52:31
Source:      system
Severity:    Non-Critical
Description: The PERC1 battery is low.
-------------------------------------------------------------------------------
Record:      87
Date/Time:   12/15/2019 04:13:11
Source:      system
Severity:    Ok
Description: The PERC1 battery is operating normally.
-------------------------------------------------------------------------------
Record:      88
Date/Time:   12/15/2019 04:22:56
Source:      system
Severity:    Non-Critical
Description: The PERC1 battery is low.
-------------------------------------------------------------------------------
Record:      89
Date/Time:   12/15/2019 04:32:41
Source:      system
Severity:    Ok
Description: The PERC1 battery is operating normally.
-------------------------------------------------------------------------------
Record:      90
Date/Time:   12/15/2019 05:01:56
Source:      system
Severity:    Non-Critical
Description: The PERC1 battery is low.
-------------------------------------------------------------------------------
Record:      91
Date/Time:   12/15/2019 05:12:46
Source:      system
Severity:    Ok
Description: The PERC1 battery is operating normally.
-------------------------------------------------------------------------------
Record:      92
Date/Time:   12/15/2019 05:43:06
Source:      system
Severity:    Non-Critical
Description: The PERC1 battery is low.
-------------------------------------------------------------------------------
Record:      93
Date/Time:   12/15/2019 05:52:46
Source:      system
Severity:    Ok
Description: The PERC1 battery is operating normally.
-------------------------------------------------------------------------------
Record:      94
Date/Time:   12/15/2019 06:02:36
Source:      system
Severity:    Non-Critical
Description: The PERC1 battery is low.
-------------------------------------------------------------------------------
Record:      95
Date/Time:   12/15/2019 06:13:21
Source:      system
Severity:    Ok
Description: The PERC1 battery is operating normally.
-------------------------------------------------------------------------------
Record:      96
Date/Time:   12/15/2019 06:23:11
Source:      system
Severity:    Non-Critical
Description: The PERC1 battery is low.
-------------------------------------------------------------------------------
Record:      97
Date/Time:   12/15/2019 06:32:51
Source:      system
Severity:    Ok
Description: The PERC1 battery is operating normally.
-------------------------------------------------------------------------------
Record:      98
Date/Time:   12/15/2019 06:43:46
Source:      system
Severity:    Non-Critical
Description: The PERC1 battery is low.
-------------------------------------------------------------------------------
Record:      99
Date/Time:   12/15/2019 06:53:31
Source:      system
Severity:    Ok
Description: The PERC1 battery is operating normally.
-------------------------------------------------------------------------------
Record:      100
Date/Time:   12/15/2019 07:06:31
Source:      system
Severity:    Non-Critical
Description: The PERC1 battery is low.
-------------------------------------------------------------------------------
Record:      101
Date/Time:   12/15/2019 07:13:01
Source:      system
Severity:    Ok
Description: The PERC1 battery is operating normally.
-------------------------------------------------------------------------------
Record:      102
Date/Time:   12/15/2019 08:02:51
Source:      system
Severity:    Non-Critical
Description: The PERC1 battery is low.
-------------------------------------------------------------------------------
Record:      103
Date/Time:   12/15/2019 08:13:41
Source:      system
Severity:    Ok
Description: The PERC1 battery is operating normally.
-------------------------------------------------------------------------------
Record:      104
Date/Time:   12/15/2019 08:23:26
Source:      system
Severity:    Non-Critical
Description: The PERC1 battery is low.
-------------------------------------------------------------------------------
Record:      105
Date/Time:   12/15/2019 08:33:06
Source:      system
Severity:    Ok
Description: The PERC1 battery is operating normally.
-------------------------------------------------------------------------------
Record:      106
Date/Time:   12/15/2019 09:03:31
Source:      system
Severity:    Non-Critical
Description: The PERC1 battery is low.
-------------------------------------------------------------------------------
Record:      107
Date/Time:   12/15/2019 09:13:11
Source:      system
Severity:    Ok
Description: The PERC1 battery is operating normally.
-------------------------------------------------------------------------------
Record:      108
Date/Time:   12/15/2019 09:23:01
Source:      system
Severity:    Non-Critical
Description: The PERC1 battery is low.
-------------------------------------------------------------------------------
Record:      109
Date/Time:   12/15/2019 09:33:51
Source:      system
Severity:    Ok
Description: The PERC1 battery is operating normally.
-------------------------------------------------------------------------------
Record:      110
Date/Time:   12/15/2019 09:43:36
Source:      system
Severity:    Non-Critical
Description: The PERC1 battery is low.
-------------------------------------------------------------------------------
Record:      111
Date/Time:   12/15/2019 10:03:01
Source:      system
Severity:    Ok
Description: The PERC1 battery is operating normally.
-------------------------------------------------------------------------------
Record:      112
Date/Time:   12/15/2019 10:05:16
Source:      system
Severity:    Non-Critical
Description: The PERC1 battery is low.
-------------------------------------------------------------------------------
Record:      113
Date/Time:   12/15/2019 10:43:11
Source:      system
Severity:    Ok
Description: The PERC1 battery is operating normally.
-------------------------------------------------------------------------------
Record:      114
Date/Time:   12/15/2019 10:54:01
Source:      system
Severity:    Non-Critical
Description: The PERC1 battery is low.
-------------------------------------------------------------------------------
Record:      115
Date/Time:   12/15/2019 11:32:56
Source:      system
Severity:    Ok
Description: The PERC1 battery is operating normally.
-------------------------------------------------------------------------------
Record:      116
Date/Time:   12/15/2019 11:53:36
Source:      system
Severity:    Non-Critical
Description: The PERC1 battery is low.
-------------------------------------------------------------------------------
Record:      117
Date/Time:   12/15/2019 12:03:21
Source:      system
Severity:    Ok
Description: The PERC1 battery is operating normally.
-------------------------------------------------------------------------------
Record:      118
Date/Time:   12/15/2019 12:05:31
Source:      system
Severity:    Non-Critical
Description: The PERC1 battery is low.
-------------------------------------------------------------------------------

Looks like re-learn was enabled:

root@db1130:~# megacli -AdpBbuCmd  -a0

BBU status for Adapter: 0

BatteryType: BBU
Battery State: Unknown
  Battery backup charge time : 0 hours

BBU Capacity Info for Adapter: 0

  Relative State of Charge: 59 %
  Absolute State of charge: 0 %
  Remaining Capacity: 94 mAh
  Full Charge Capacity: 161 mAh
  Run time to empty: Battery is not being charged.
  Average time to empty: 8 Min.
  Estimated Time to full recharge: Battery is not being charged.
  Cycle Count: 11
Max Error = 0 %
Remaining Capacity Alarm = 0 mAh
Remining Time Alarm = 0 Min

BBU Design Info for Adapter: 0

  Date of Manufacture: 00/00, 0000
  Design Capacity: 0 mAh
  Design Voltage: 0 mV
  Specification Info: 0
  Serial Number: 0
  Pack Stat Configuration: 0x0000
  Manufacture Name: 0x129
  Firmware Version   : 0.6
  Device Name:
  Device Chemistry:
  Battery FRU: N/A
Module Version = 0.6
  Transparent Learn = 1
  App Data = 0

BBU Properties for Adapter: 0

  Auto Learn Period: 90 Days
  Next Learn time: Tue Jan 28 00:40:07 2020
  Learn Delay Interval:0 Hours
  Auto-Learn Mode: Transparent

Exit Code: 0x00

I have forced a re-learn cycle:

root@db1130:~# megacli -AdpBbuCmd -BbuLearn -aAll

Adapter 0: BBU Learn Succeeded.

And we got the recover:

[07:01:21]  <+icinga-wm>	RECOVERY - MegaRAID on db1130 is OK: OK: optimal, 1 logical, 6 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring

And HW logs:

-------------------------------------------------------------------------------
Record:      119
Date/Time:   12/16/2019 05:58:02
Source:      system
Severity:    Ok
Description: The PERC1 battery is operating normally.
-------------------------------------------------------------------------------

Let's see if it stops after the relearning.
We should check (and disable if enabled) the learning mode for the following hosts:

db11[21-38]
db21[03-35]

Event Timeline

Marostegui triaged this task as Medium priority.Dec 16 2019, 7:06 AM
Marostegui added a project: DBA.

Mentioned in SAL (#wikimedia-operations) [2019-12-16T07:27:19Z] <marostegui> Disable auto-learn on db[1126-1138].eqiad.wmnet T240823

Operating on db[1126-1138].eqiad.wmnet

[07:22:35] marostegui@cumin1001:~$ sudo cumin db11[26-38].eqiad.wmnet 'megacli -AdpBbuCmd  -a0 | grep "Auto-Learn"'
13 hosts will be targeted:
db[1126-1138].eqiad.wmnet
Confirm to continue [y/n]? y
===== NODE GROUP =====
(13) db[1126-1138].eqiad.wmnet
----- OUTPUT of 'megacli -AdpBbuC...rep "Auto-Learn"' -----
  Auto-Learn Mode: Transparent

Disabling it:

[07:23:06] marostegui@cumin1001:~$ sudo cumin db11[26-38].eqiad.wmnet 'echo "autoLearnMode=1" > /tmp/disable_learn && sudo megacli -AdpBbuCmd -SetBbuProperties -f /tmp/disable_learn -a0'
13 hosts will be targeted:
db[1126-1138].eqiad.wmnet
Confirm to continue [y/n]? y
===== NODE GROUP =====
(13) db[1126-1138].eqiad.wmnet
----- OUTPUT of 'echo "autoLearnM...isable_learn -a0' -----

Adapter 0: Set BBU Properties Succeeded.

Exit Code: 0x00

================
PASS:  |██████████████████████████████████████████████████████████████████████████████████| 100% (13/13) [00:00<00:00, 34.63hosts/s]
FAIL:  |                                                                                           |   0% (0/13) [00:00<?, ?hosts/s]
100.0% (13/13) success ratio (>= 100.0% threshold) for command: 'echo "autoLearnM...isable_learn -a0'.
100.0% (13/13) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.

Checking that it is disabled

[07:25:13] marostegui@cumin1001:~$ sudo cumin db11[26-38].eqiad.wmnet 'megacli -AdpBbuCmd  -a0 | grep "Auto-Learn"'
13 hosts will be targeted:
db[1126-1138].eqiad.wmnet
Confirm to continue [y/n]? y
===== NODE GROUP =====
(13) db[1126-1138].eqiad.wmnet
----- OUTPUT of 'megacli -AdpBbuC...rep "Auto-Learn"' -----
  Auto-Learn Mode: Disabled
================
PASS:  |██████████████████████████████████████████████████████████████████████████████████| 100% (13/13) [00:00<00:00, 32.78hosts/s]
FAIL:  |                                                                                           |   0% (0/13) [00:00<?, ?hosts/s]
100.0% (13/13) success ratio (>= 100.0% threshold) for command: 'megacli -AdpBbuC...rep "Auto-Learn"'.
100.0% (13/13) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.

Mentioned in SAL (#wikimedia-operations) [2019-12-16T07:38:43Z] <marostegui> Disable auto-learn on db21[03-35] T240823

Same on codfw:

[07:37:39] marostegui@cumin1001:~$ sudo cumin db21[03-35].codfw.wmnet 'echo "autoLearnMode=1" > /tmp/disable_learn && sudo megacli -AdpBbuCmd -SetBbuProperties -f /tmp/disable_learn -a0'
33 hosts will be targeted:
db[2103-2135].codfw.wmnet
Confirm to continue [y/n]? y
===== NODE GROUP =====
(33) db[2103-2135].codfw.wmnet
----- OUTPUT of 'echo "autoLearnM...isable_learn -a0' -----

Adapter 0: Set BBU Properties Succeeded.

Exit Code: 0x00

================
PASS:  |██████████████████████████████████████████████████████████████████████████████████| 100% (33/33) [00:00<00:00, 38.98hosts/s]
FAIL:  |                                                                                           |   0% (0/33) [00:00<?, ?hosts/s]
100.0% (33/33) success ratio (>= 100.0% threshold) for command: 'echo "autoLearnM...isable_learn -a0'.
100.0% (33/33) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
[07:38:01] marostegui@cumin1001:~$ sudo cumin db21[03-35].codfw.wmnet 'megacli -AdpBbuCmd  -a0 | grep "Auto-Learn"'
33 hosts will be targeted:
db[2103-2135].codfw.wmnet
Confirm to continue [y/n]? y
===== NODE GROUP =====
(33) db[2103-2135].codfw.wmnet
----- OUTPUT of 'megacli -AdpBbuC...rep "Auto-Learn"' -----
  Auto-Learn Mode: Disabled
================
PASS:  |██████████████████████████████████████████████████████████████████████████████████| 100% (33/33) [00:00<00:00, 37.42hosts/s]
FAIL:  |                                                                                           |   0% (0/33) [00:00<?, ?hosts/s]
100.0% (33/33) success ratio (>= 100.0% threshold) for command: 'megacli -AdpBbuC...rep "Auto-Learn"'.
100.0% (33/33) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.

The host alerted again, but compared to the initial report, it looks like it is charging (I have done a few iterations of the command and the % keeps increasing):

root@db1130:~# megacli -AdpBbuCmd -a0

BBU status for Adapter: 0

BatteryType: BBU
Voltage: 3956 mV
Current: 169 mA
Temperature: 29 C
Battery State: Optimal
BBU Firmware Status:

  Charging Status              : Charging
  Voltage                                 : OK
  Temperature                             : OK
  Learn Cycle Requested	                  : No
  Learn Cycle Active                      : No
  Learn Cycle Status                      : OK
  Learn Cycle Timeout                     : No
  I2c Errors Detected                     : No
  Battery Pack Missing                    : No
  Battery Replacement required            : No
  Remaining Capacity Low                  : No
  Periodic Learn Required                 : Yes
  Transparent Learn                       : No
  No space to cache offload               : No
  Pack is about to fail & should be replaced : No
  Cache Offload premium feature required  : No
  Module microcode update required        : No

BBU GasGauge Status: 0x0128
Relative State of Charge: 75 %
Charger Status: In Progress
Remaining Capacity: 127 mAh
Full Charge Capacity: 171 mAh
isSOHGood: Yes
  Battery backup charge time : 0 hours

BBU Capacity Info for Adapter: 0

  Relative State of Charge: 75 %
  Absolute State of charge: 0 %
  Remaining Capacity: 128 mAh
  Full Charge Capacity: 171 mAh
  Run time to empty: Battery is not being charged.
  Average time to empty: 10 Min.
  Estimated Time to full recharge: 23 Min.
  Cycle Count: 11
Max Error = 0 %
Remaining Capacity Alarm = 0 mAh
Remining Time Alarm = 0 Min

BBU Design Info for Adapter: 0

  Date of Manufacture: 00/00, 0000
  Design Capacity: 0 mAh
  Design Voltage: 0 mV
  Specification Info: 0
  Serial Number: 0
  Pack Stat Configuration: 0x0000
  Manufacture Name: 0x129
  Firmware Version   : 0.6
  Device Name:
  Device Chemistry:
  Battery FRU: N/A
Module Version = 0.6
  Transparent Learn = 1
  App Data = 0

BBU Properties for Adapter: 0

  Auto Learn Period: 90 Days
  Next Learn time: None  Learn Delay Interval:0 Hours
  Auto-Learn Mode: Disabled

Going to wait if it reaches 100% and the error clears up and doesn't happen again (as this might be the re-learn I launched earlier in the morning).

As of now, BBU is at 100%:

root@db1130:~# megacli -AdpBbuCmd -a0

BBU status for Adapter: 0

BatteryType: BBU
Voltage: 3935 mV
Current: 0 mA
Temperature: 29 C
Battery State: Optimal
BBU Firmware Status:

  Charging Status              : None
  Voltage                                 : OK
  Temperature                             : OK
  Learn Cycle Requested	                  : No
  Learn Cycle Active                      : No
  Learn Cycle Status                      : OK
  Learn Cycle Timeout                     : No
  I2c Errors Detected                     : No
  Battery Pack Missing                    : No
  Battery Replacement required            : No
  Remaining Capacity Low                  : No
  Periodic Learn Required                 : Yes
  Transparent Learn                       : No
  No space to cache offload               : No
  Pack is about to fail & should be replaced : No
  Cache Offload premium feature required  : No
  Module microcode update required        : No

BBU GasGauge Status: 0x0228
Relative State of Charge: 100 %
Charger Status: Complete
Remaining Capacity: 171 mAh
Full Charge Capacity: 171 mAh
isSOHGood: Yes
  Battery backup charge time : 0 hours

BBU Capacity Info for Adapter: 0

  Relative State of Charge: 100 %
  Absolute State of charge: 0 %
  Remaining Capacity: 171 mAh
  Full Charge Capacity: 171 mAh
  Run time to empty: Battery is not being charged.
  Average time to empty: 14 Min.
  Estimated Time to full recharge: Battery is not being charged.
  Cycle Count: 11
Max Error = 0 %
Remaining Capacity Alarm = 0 mAh
Remining Time Alarm = 0 Min

BBU Design Info for Adapter: 0

  Date of Manufacture: 00/00, 0000
  Design Capacity: 0 mAh
  Design Voltage: 0 mV
  Specification Info: 0
  Serial Number: 0
  Pack Stat Configuration: 0x0000
  Manufacture Name: 0x129
  Firmware Version   : 0.6
  Device Name:
  Device Chemistry:
  Battery FRU: N/A
Module Version = 0.6
  Transparent Learn = 1
  App Data = 0

BBU Properties for Adapter: 0

  Auto Learn Period: 90 Days
  Next Learn time: None  Learn Delay Interval:0 Hours
  Auto-Learn Mode: Disabled

Exit Code: 0x00

Mentioned in SAL (#wikimedia-operations) [2019-12-17T06:21:38Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Slowly repool db1130 T240823', diff saved to https://phabricator.wikimedia.org/P9892 and previous config saved to /var/cache/conftool/dbconfig/20191217-062136-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2019-12-17T06:31:23Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Slowly repool db1130 T240823', diff saved to https://phabricator.wikimedia.org/P9893 and previous config saved to /var/cache/conftool/dbconfig/20191217-063121-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2019-12-17T06:40:31Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Slowly repool db1130 T240823', diff saved to https://phabricator.wikimedia.org/P9894 and previous config saved to /var/cache/conftool/dbconfig/20191217-064030-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2019-12-17T07:07:11Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Fully repool db1130 T240823', diff saved to https://phabricator.wikimedia.org/P9895 and previous config saved to /var/cache/conftool/dbconfig/20191217-070709-marostegui.json

Looks like the BBU is discharging 1% per day, but normally BBUs tend to swing between 90 and 100% all the time, so so far it is normal - lets keep monitoring it for a few more days:

root@db1130:~# megacli -AdpBbuCmd -a0

BBU status for Adapter: 0

BatteryType: BBU
Voltage: 3931 mV
Current: 0 mA
Temperature: 29 C
Battery State: Optimal
BBU Firmware Status:

  Charging Status              : None
  Voltage                                 : OK
  Temperature                             : OK
  Learn Cycle Requested	                  : No
  Learn Cycle Active                      : No
  Learn Cycle Status                      : OK
  Learn Cycle Timeout                     : No
  I2c Errors Detected                     : No
  Battery Pack Missing                    : No
  Battery Replacement required            : No
  Remaining Capacity Low                  : No
  Periodic Learn Required                 : No
  Transparent Learn                       : No
  No space to cache offload               : No
  Pack is about to fail & should be replaced : No
  Cache Offload premium feature required  : No
  Module microcode update required        : No

BBU GasGauge Status: 0x0028
Relative State of Charge: 96 %
Charger Status: Complete
Remaining Capacity: 158 mAh
Full Charge Capacity: 165 mAh
isSOHGood: Yes
  Battery backup charge time : 0 hours

BBU Capacity Info for Adapter: 0

  Relative State of Charge: 96 %
  Absolute State of charge: 0 %
  Remaining Capacity: 158 mAh
  Full Charge Capacity: 165 mAh
  Run time to empty: Battery is not being charged.
  Average time to empty: 13 Min.
  Estimated Time to full recharge: Battery is not being charged.
  Cycle Count: 11
Max Error = 0 %
Remaining Capacity Alarm = 0 mAh
Remining Time Alarm = 0 Min

BBU Design Info for Adapter: 0

  Date of Manufacture: 00/00, 0000
  Design Capacity: 0 mAh
  Design Voltage: 0 mV
  Specification Info: 0
  Serial Number: 0
  Pack Stat Configuration: 0x0000
  Manufacture Name: 0x129
  Firmware Version   : 0.6
  Device Name:
  Device Chemistry:
  Battery FRU: N/A
Module Version = 0.6
  Transparent Learn = 1
  App Data = 0

BBU Properties for Adapter: 0

  Auto Learn Period: 90 Days
  Next Learn time: None  Learn Delay Interval:0 Hours
  Auto-Learn Mode: Disabled

Exit Code: 0x00

Mentioned in SAL (#wikimedia-operations) [2019-12-17T06:06:24Z] <marostegui> Upgrade db1130 T240823

This included a reboot too

Marostegui claimed this task.

I am going to close this as resolved, as it seems it has stabilized (for now at least):

root@db1130:~# megacli -AdpBbuCmd -a0

BBU status for Adapter: 0

BatteryType: BBU
Voltage: 3921 mV
Current: 0 mA
Temperature: 30 C
Battery State: Optimal
BBU Firmware Status:

  Charging Status              : None
  Voltage                                 : OK
  Temperature                             : OK
  Learn Cycle Requested	                  : No
  Learn Cycle Active                      : No
  Learn Cycle Status                      : OK
  Learn Cycle Timeout                     : No
  I2c Errors Detected                     : No
  Battery Pack Missing                    : No
  Battery Replacement required            : No
  Remaining Capacity Low                  : No
  Periodic Learn Required                 : No
  Transparent Learn                       : No
  No space to cache offload               : No
  Pack is about to fail & should be replaced : No
  Cache Offload premium feature required  : No
  Module microcode update required        : No

BBU GasGauge Status: 0x0128
Relative State of Charge: 90 %
Charger Status: Complete
Remaining Capacity: 147 mAh
Full Charge Capacity: 165 mAh
isSOHGood: Yes
  Battery backup charge time : 0 hours

BBU Capacity Info for Adapter: 0

  Relative State of Charge: 90 %
  Absolute State of charge: 0 %
  Remaining Capacity: 147 mAh
  Full Charge Capacity: 165 mAh
  Run time to empty: Battery is not being charged.
  Average time to empty: 12 Min.
  Estimated Time to full recharge: Battery is not being charged.
  Cycle Count: 11
Max Error = 0 %
Remaining Capacity Alarm = 0 mAh
Remining Time Alarm = 0 Min

BBU Design Info for Adapter: 0

  Date of Manufacture: 00/00, 0000
  Design Capacity: 0 mAh
  Design Voltage: 0 mV
  Specification Info: 0
  Serial Number: 0
  Pack Stat Configuration: 0x0000
  Manufacture Name: 0x129
  Firmware Version   : 0.6
  Device Name:
  Device Chemistry:
  Battery FRU: N/A
Module Version = 0.6
  Transparent Learn = 1
  App Data = 0

BBU Properties for Adapter: 0

  Auto Learn Period: 90 Days
  Next Learn time: None  Learn Delay Interval:0 Hours
  Auto-Learn Mode: Disabled

Nothing on HW logs:

-------------------------------------------------------------------------------
Record:      121
Date/Time:   12/16/2019 11:56:32
Source:      system
Severity:    Ok
Description: The PERC1 battery is operating normally.
-------------------------------------------------------------------------------