Page MenuHomePhabricator

db1105 rebooted itself
Closed, ResolvedPublic

Description

[15:42:33]  <+icinga-wm>	PROBLEM - Host db1105 is DOWN: PING CRITICAL - Packet loss = 100%
[15:43:11]  <+icinga-wm>	RECOVERY - Host db1105 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms
root@db1105:~# w
 13:48:06 up 5 min,  3 users,  load average: 1.28, 1.44, 0.73
Oct 18 13:33:19 db1105 kernel: [19549564.111690] perf: interrupt took too long (17572 > 17107), lowering kernel.perf_event_max_sample_rate to 11250
Oct 18 13:34:47 db1105 kernel: [19549652.133836] perf: interrupt took too long (22884 > 21965), lowering kernel.perf_event_max_sample_rate to 8500
Oct 18 13:36:44 db1105 kernel: [19549769.307864] perf: interrupt took too long (29249 > 28605), lowering kernel.perf_event_max_sample_rate to 6750
Oct 18 13:38:01 db1105 wmf-auto-restart: INFO: 2019-10-18 13:38:01,796 : No restart necessary for service rsyslog
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@Oct 18 13:42:49 db1105 rsyslogd: imuxsock: Acquired UNIX socket '/run/systemd/journal/syslog' (fd 3) from systemd.  [v8.1901.0]

Details

Related Gerrit Patches:
operations/puppet : productiondb1105: Enable notifications
operations/puppet : productiondb1105: Disable notifications

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 18 2019, 1:48 PM
Marostegui updated the task description. (Show Details)Oct 18 2019, 1:50 PM

Nothing recent on HW logs, the last entrie is from January 25th 2019

Change 544198 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1105: Disable notifications

https://gerrit.wikimedia.org/r/544198

Change 544198 merged by Marostegui:
[operations/puppet@production] db1105: Disable notifications

https://gerrit.wikimedia.org/r/544198

I am going to clean the HW logs as they are 1y old, but leaving them here for posterity

/admin1-> racadm getsel
Record:      1
Date/Time:   03/30/2017 16:19:41
Source:      system
Severity:    Ok
Description: Log cleared.
-------------------------------------------------------------------------------
Record:      2
Date/Time:   01/25/2019 07:02:55
Source:      system
Severity:    Ok
Description: A problem was detected related to the previous server boot.
-------------------------------------------------------------------------------
Record:      3
Date/Time:   01/25/2019 07:02:55
Source:      system
Severity:    Critical
Description: CPU 1 machine check error detected.
-------------------------------------------------------------------------------
Record:      4
Date/Time:   01/25/2019 07:02:55
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      5
Date/Time:   01/25/2019 07:02:55
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      6
Date/Time:   01/25/2019 07:02:55
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      7
Date/Time:   01/25/2019 07:02:55
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      8
Date/Time:   01/25/2019 07:02:55
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      9
Date/Time:   01/25/2019 07:02:55
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      10
Date/Time:   01/25/2019 07:02:55
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      11
Date/Time:   01/25/2019 07:02:55
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      12
Date/Time:   01/25/2019 07:02:55
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      13
Date/Time:   01/25/2019 07:02:55
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      14
Date/Time:   01/25/2019 07:02:55
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      15
Date/Time:   01/25/2019 07:02:55
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      16
Date/Time:   01/25/2019 07:02:56
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      17
Date/Time:   01/25/2019 07:02:56
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      18
Date/Time:   01/25/2019 07:02:56
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      19
Date/Time:   01/25/2019 07:02:56
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      20
Date/Time:   01/25/2019 07:02:56
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      21
Date/Time:   01/25/2019 07:02:56
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      22
Date/Time:   01/25/2019 07:02:56
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      23
Date/Time:   01/25/2019 07:02:56
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      24
Date/Time:   01/25/2019 07:02:56
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      25
Date/Time:   01/25/2019 07:02:57
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      26
Date/Time:   01/25/2019 07:02:57
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      27
Date/Time:   01/25/2019 07:02:57
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      28
Date/Time:   01/25/2019 07:02:57
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      29
Date/Time:   01/25/2019 07:02:57
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      30
Date/Time:   01/25/2019 07:02:57
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      31
Date/Time:   01/25/2019 07:02:57
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      32
Date/Time:   01/25/2019 07:02:57
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      33
Date/Time:   01/25/2019 07:02:57
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      34
Date/Time:   01/25/2019 07:02:57
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      35
Date/Time:   01/25/2019 07:02:57
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      36
Date/Time:   01/25/2019 07:02:57
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      37
Date/Time:   01/25/2019 07:02:58
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      38
Date/Time:   01/25/2019 07:02:58
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      39
Date/Time:   01/25/2019 07:02:58
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      40
Date/Time:   01/25/2019 07:02:58
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      41
Date/Time:   01/25/2019 07:02:58
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      42
Date/Time:   01/25/2019 07:02:58
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      43
Date/Time:   01/25/2019 07:02:58
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      44
Date/Time:   01/25/2019 07:02:58
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      45
Date/Time:   01/25/2019 07:02:58
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      46
Date/Time:   01/25/2019 07:02:58
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      47
Date/Time:   01/25/2019 07:02:58
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      48
Date/Time:   01/25/2019 07:02:59
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      49
Date/Time:   01/25/2019 07:02:59
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      50
Date/Time:   01/25/2019 07:02:59
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      51
Date/Time:   01/25/2019 07:02:59
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      52
Date/Time:   01/25/2019 07:02:59
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      53
Date/Time:   01/25/2019 07:02:59
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      54
Date/Time:   01/25/2019 07:02:59
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      55
Date/Time:   01/25/2019 07:02:59
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
/admin1->  racadm clrsel
The SEL was cleared successfully
/admin1->
Marostegui triaged this task as High priority.Oct 18 2019, 2:06 PM
Marostegui moved this task from Triage to In progress on the DBA board.

Mentioned in SAL (#wikimedia-operations) [2019-10-18T14:10:06Z] <marostegui> Run compare.py on db1105 - T235877

I am running a compare.py to check the data on s1 and s2, to make sure no data got corrupted with the crash.

Marostegui added a comment.EditedOct 18 2019, 2:29 PM

There is no trace of errors on HW logs as seen at T235877#5586910, our experience with this kind of crashes/reboots relates to storage crashes that crash the whole server.
I have seen some logs related to smartd after the boot, but I think they are normal.

I have dumped all the logs from the crontoller via megacli and there is nothing there either:

Time: Fri Oct 18 13:41:25 2019

Code: 0x0000002c
Class: 0
Locale: 0x20
Event Description: Time established as 10/18/19 13:41:25; (48 seconds since power on)
Event Data:
===========
Elapsed Time since power-on: 48
Time: Fri Oct 18 13:41:25 2019



seqNum: 0x00000638
Time: Fri Oct 18 13:42:27 2019

Code: 0x00000185
Class: 0
Locale: 0x20
Event Description: Host driver is loaded and operational
Event Data:
===========
None


seqNum: 0x00000639
Time: Fri Oct 18 13:52:47 2019

Code: 0x000000f2
Class: 0
Locale: 0x08
Event Description: Battery charge complete
Event Data:
===========
None

The BBU also looks fine:

root@db1105:~# megacli   -AdpBbuCmd  -a0

BBU status for Adapter: 0

BatteryType: BBU
Voltage: 3927 mV
Current: 0 mA
Temperature: 37 C
Battery State: Optimal
BBU Firmware Status:

  Charging Status              : None
  Voltage                                 : OK
  Temperature                             : OK
  Learn Cycle Requested	                  : No
  Learn Cycle Active                      : No
  Learn Cycle Status                      : OK
  Learn Cycle Timeout                     : No
  I2c Errors Detected                     : No
  Battery Pack Missing                    : No
  Battery Replacement required            : No
  Remaining Capacity Low                  : No
  Periodic Learn Required                 : No
  Transparent Learn                       : No
  No space to cache offload               : No
  Pack is about to fail & should be replaced : No
  Cache Offload premium feature required  : No
  Module microcode update required        : No

BBU GasGauge Status: 0x0238
Relative State of Charge: 100 %
Charger Status: Complete
Remaining Capacity: 529 mAh
Full Charge Capacity: 529 mAh
isSOHGood: Yes
  Battery backup charge time : 0 hours

BBU Capacity Info for Adapter: 0

  Relative State of Charge: 100 %
  Absolute State of charge: 0 %
  Remaining Capacity: 529 mAh
  Full Charge Capacity: 529 mAh
  Run time to empty: Battery is not being charged.
  Average time to empty: 42 Min.
  Estimated Time to full recharge: Battery is not being charged.
  Cycle Count: 5
Max Error = 0 %
Remaining Capacity Alarm = 0 mAh
Remining Time Alarm = 0 Min

BBU Design Info for Adapter: 0

  Date of Manufacture: 00/00, 0000
  Design Capacity: 460 mAh
  Design Voltage: 0 mV
  Specification Info: 0
  Serial Number: 0
  Pack Stat Configuration: 0x0000
  Manufacture Name: 0x113
  Firmware Version   : 0.4
  Device Name:
  Device Chemistry:
  Battery FRU: N/A
Module Version = 0.4
  Transparent Learn = 1
  App Data = 1

BBU Properties for Adapter: 0

  Auto Learn Period: 90 Days
  Next Learn time: None  Learn Delay Interval:0 Hours
  Auto-Learn Mode: Disabled

Exit Code: 0x00

There is a huge spike in CPU just before the crash, but that can be just a consequence of a general slowdown or some sort of weird state right before the crash

Let's upgrade its firmware and BIOS to make sure it is all up-to-date in case this happens again and we need to open a case with the vendor.
@Cmjohnson @Jclark-ctr can we arrange a day/time to get this updated? (not assigning it to anyone as I don't know who will pick this up)

Thanks

CC @jcrespo as I will go on vacation soon

The data comparison finished correctly (still no HW) logs.
I am going to give this host some weight to help out db1099:3311 so it doesn't get super cold.
@Cmjohnson let us know which day/time works for you for the upgrade, so we can depool it again in advance.

Change 544603 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1105: Enable notifications

https://gerrit.wikimedia.org/r/544603

Change 544603 merged by Marostegui:
[operations/puppet@production] db1105: Enable notifications

https://gerrit.wikimedia.org/r/544603

Mentioned in SAL (#wikimedia-operations) [2019-10-22T11:34:38Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1105:3311, db1105:3312 for firmware upgrade T235877', diff saved to https://phabricator.wikimedia.org/P9428 and previous config saved to /var/cache/conftool/dbconfig/20191022-113437-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2019-10-22T11:35:09Z] <marostegui> Stop MySQL on db1105:3311, db1105:3312 for firmware upgrade - T235877

Updated all F/W on db1105

  • Raid

-Bios

  • Backplane
  • Idrac

Thank you Chris!

Mentioned in SAL (#wikimedia-operations) [2019-10-22T12:32:58Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Slowly repool db1105:3312 and db1105:3311 after on-site maintenance T235877', diff saved to https://phabricator.wikimedia.org/P9430 and previous config saved to /var/cache/conftool/dbconfig/20191022-123257-marostegui.json

Marostegui closed this task as Resolved.Oct 22 2019, 1:06 PM

Host fully repooled in production.
Thanks Chris!