ms-be2047 spontaneous reboots
Open, NormalPublic

Description

Noticed on icinga that ms-be2047 has been rebooting without (AFAIK) an actual reboot being issued:

20:33 -icinga-wm_:#wikimedia-operations- PROBLEM - Host ms-be2047 is DOWN: PING CRITICAL - Packet loss = 100%
20:36 -icinga-wm_:#wikimedia-operations- RECOVERY - Host ms-be2047 is UP: PING OK - Packet loss = 0%, RTA = 39.08 ms
22:42 -icinga-wm:#wikimedia-operations- PROBLEM - Host ms-be2047 is DOWN: PING CRITICAL - Packet loss = 100%
22:44 -icinga-wm:#wikimedia-operations- RECOVERY - Host ms-be2047 is UP: PING OK - Packet loss = 0%, RTA = 36.21 ms
03:07 -icinga-wm:#wikimedia-operations- PROBLEM - Host ms-be2047 is DOWN: PING CRITICAL - Packet loss = 100%
03:10 -icinga-wm:#wikimedia-operations- RECOVERY - Host ms-be2047 is UP: PING OK - Packet loss = 0%, RTA = 36.19 ms
03:17 -icinga-wm:#wikimedia-operations- PROBLEM - Host ms-be2047 is DOWN: PING CRITICAL - Packet loss = 100%
03:19 -icinga-wm:#wikimedia-operations- RECOVERY - Host ms-be2047 is UP: PING OK - Packet loss = 0%, RTA = 36.16 ms
03:24 -icinga-wm:#wikimedia-operations- PROBLEM - Host ms-be2047 is DOWN: PING CRITICAL - Packet loss = 100%
03:27 -icinga-wm:#wikimedia-operations- RECOVERY - Host ms-be2047 is UP: PING OK - Packet loss = 0%, RTA = 36.19 ms

Related Objects

fgiunchedi triaged this task as Normal priority.
Restricted Application removed a project: Patch-For-Review. · View Herald TranscriptTue, Nov 20, 7:47 AM

ipmi-sel

11  | Nov-16-2018 | 16:40:53 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h
12  | Nov-16-2018 | 16:45:48 | CPU Machine Chk  | Processor                   | transition to Non-recoverable ; OEM Event Data2 code = 01h ; OEM Event Data3 code = 26h
13  | Nov-16-2018 | 16:45:48 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 0Dh
14  | Nov-16-2018 | 16:45:49 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h
15  | Nov-16-2018 | 16:45:49 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h ; OEM Event Data3 code = 00h
16  | Nov-16-2018 | 16:45:49 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h
17  | Nov-16-2018 | 16:50:55 | CPU Machine Chk  | Processor                   | transition to Non-recoverable ; OEM Event Data2 code = 01h ; OEM Event Data3 code = 34h
18  | Nov-16-2018 | 16:50:55 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 0Dh
19  | Nov-16-2018 | 16:50:55 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h
20  | Nov-16-2018 | 16:50:55 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h ; OEM Event Data3 code = 00h
21  | Nov-16-2018 | 16:50:55 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h
22  | Nov-16-2018 | 17:02:16 | CPU Machine Chk  | Processor                   | transition to Non-recoverable ; OEM Event Data2 code = 01h ; OEM Event Data3 code = 20h
23  | Nov-16-2018 | 17:02:16 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 0Dh
24  | Nov-16-2018 | 17:02:16 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h
25  | Nov-16-2018 | 17:02:16 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h ; OEM Event Data3 code = 00h
26  | Nov-16-2018 | 17:02:16 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h
27  | Nov-16-2018 | 17:07:10 | CPU Machine Chk  | Processor                   | transition to Non-recoverable ; OEM Event Data2 code = 01h ; OEM Event Data3 code = 36h
28  | Nov-16-2018 | 17:07:10 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 0Dh
29  | Nov-16-2018 | 17:07:10 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h
30  | Nov-16-2018 | 17:07:10 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h ; OEM Event Data3 code = 00h
31  | Nov-16-2018 | 17:07:10 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h
32  | Nov-16-2018 | 17:13:23 | CPU Machine Chk  | Processor                   | transition to Non-recoverable ; OEM Event Data2 code = 01h ; OEM Event Data3 code = 22h
33  | Nov-16-2018 | 17:13:23 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 0Dh
34  | Nov-16-2018 | 17:13:23 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h
35  | Nov-16-2018 | 17:13:23 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h ; OEM Event Data3 code = 00h
36  | Nov-16-2018 | 17:13:23 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h
37  | Nov-19-2018 | 15:32:10 | CPU Machine Chk  | Processor                   | transition to Non-recoverable ; OEM Event Data2 code = 01h ; OEM Event Data3 code = 36h
38  | Nov-19-2018 | 15:32:10 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 0Dh
39  | Nov-19-2018 | 15:32:10 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h
40  | Nov-19-2018 | 15:32:10 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h ; OEM Event Data3 code = 00h
41  | Nov-19-2018 | 15:32:10 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h
42  | Nov-19-2018 | 22:40:20 | CPU Machine Chk  | Processor                   | transition to Non-recoverable ; OEM Event Data2 code = 01h ; OEM Event Data3 code = 16h
43  | Nov-19-2018 | 22:40:20 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 0Dh
44  | Nov-19-2018 | 22:40:20 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h
45  | Nov-19-2018 | 22:40:20 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h ; OEM Event Data3 code = 00h
46  | Nov-19-2018 | 22:40:20 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h
47  | Nov-20-2018 | 03:05:42 | CPU Machine Chk  | Processor                   | transition to Non-recoverable ; OEM Event Data2 code = 01h ; OEM Event Data3 code = 36h
48  | Nov-20-2018 | 03:05:42 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 0Dh
49  | Nov-20-2018 | 03:05:42 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h
50  | Nov-20-2018 | 03:05:42 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h ; OEM Event Data3 code = 00h
51  | Nov-20-2018 | 03:05:42 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h
52  | Nov-20-2018 | 03:15:27 | CPU Machine Chk  | Processor                   | transition to Non-recoverable ; OEM Event Data2 code = 01h ; OEM Event Data3 code = 34h
53  | Nov-20-2018 | 03:15:27 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 0Dh
54  | Nov-20-2018 | 03:15:27 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h
55  | Nov-20-2018 | 03:15:27 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h ; OEM Event Data3 code = 00h
56  | Nov-20-2018 | 03:15:27 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h
57  | Nov-20-2018 | 03:22:37 | CPU Machine Chk  | Processor                   | transition to Non-recoverable ; OEM Event Data2 code = 01h ; OEM Event Data3 code = 24h
58  | Nov-20-2018 | 03:22:37 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 0Dh
59  | Nov-20-2018 | 03:22:37 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h
60  | Nov-20-2018 | 03:22:37 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h ; OEM Event Data3 code = 00h
61  | Nov-20-2018 | 03:22:37 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h

@Papaul have you seen this before? the host is not in service, you can take it down for troubleshooting at any time.

unfortunately mcelog doesn't seem very helpful either:

root@ms-be2047:~# journalctl -t mcelog
-- Logs begin at Tue 2018-11-20 08:15:22 UTC, end at Tue 2018-11-20 09:34:59 UTC. --
Nov 20 08:15:25 ms-be2047 mcelog[1791]: Starting Machine Check Exceptions decoder: mcelog.
Nov 20 08:15:25 ms-be2047 mcelog[1938]: Running trigger `unknown-error-trigger'
Nov 20 08:15:25 ms-be2047 mcelog[1938]: warning: 16 bytes ignored in each record
Nov 20 08:15:25 ms-be2047 mcelog[1938]: consider an update
Nov 20 08:15:25 ms-be2047 mcelog[1938]: Running trigger `unknown-error-trigger'
Nov 20 08:15:25 ms-be2047 mcelog[1938]: warning: 16 bytes ignored in each record
Nov 20 08:15:25 ms-be2047 mcelog[1938]: consider an update
Nov 20 08:15:25 ms-be2047 mcelog[1938]: warning: 16 bytes ignored in each record
Nov 20 08:15:25 ms-be2047 mcelog[1938]: consider an update
Nov 20 08:15:25 ms-be2047 mcelog[1942]: CPU 3 on socket 1 received unknown error
Nov 20 08:15:25 ms-be2047 mcelog[1941]: CPU 3 on socket 1 received unknown error
Nov 20 08:15:25 ms-be2047 mcelog[1943]: Location: CPU 3 on socket 1
Nov 20 08:15:25 ms-be2047 mcelog[1944]: Location: CPU 3 on socket 1
Nov 20 08:15:25 ms-be2047 mcelog[1938]: warning: 16 bytes ignored in each record
Nov 20 08:15:25 ms-be2047 mcelog[1938]: consider an update
jijiki added a subscriber: jijiki.Tue, Nov 20, 9:46 AM
Papaul claimed this task.Tue, Nov 20, 3:08 PM

@fgiunchedi Here are the recommendations from Dell.
Dell support didn't find any HW error on the server while looking a the TSR log. He recommended that we clear the log and do some firmware updates:

  • BIOS
  • IDRAC
  • CPLD

After updating all three, reboot the server, enter the BIOS and leave the server in BIOS mode for sometimes. IF the server reboots while in BIOS mode, HW issue.

Case information
CPU1 - PE R740XD |Warranty ProSupport | SR 982809390 |

Papaul added a comment.EditedTue, Nov 20, 4:48 PM

Log cleared
BIOS, IDRAC, CPLD updated

just after this I reboot and get

ms-be2047 login: [ 411.875775] mce: [Hardware Error]: CPU 17: Machine Check Exception: 5 Bank 14: b000000000020405
[ 411.884446] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff890189e5> [ 411.890874] {intel_idle+0x95/0x110}
mce: [Hardware Error]: TSC 1358b98bdf0
[ 411.897950] mce: [Hardware Error]: PROCESSOR 0:50654 TIME 1542732282 SOCKET 1 APIC 32 microcode 200004d
[ 411.907312] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[ 411.917503] mce: [Hardware Error]: CPU 13: Machine Check Exception: 5 Bank 3: f200000000300179
[ 411.926083] mce: [Hardware Error]: RIP !INEXACT! 33:<0000563f08c051f0>
[ 411.932698] mce: [Hardware Error]: TSC 1358b98d70c
[ 411.937594] mce: [Hardware Error]: PROCESSOR 0:50654 TIME 1542732282 SOCKET 1 APIC 30 microcode 200004d
[ 411.946956] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[ 411.957062] mce: [Hardware Error]: Machine check: Processor context corrupt
[ 411.964007] Kernel panic - not syncing: Fatal machine check
[ 411.969636] Kernel Offset: 0x7a00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)

Volans added a subscriber: Volans.Wed, Nov 21, 10:14 AM

ms-be2047 reported down by Icinga since few minutes, unable to ssh, black screen at the console so far.

The IDRAC indicate that that the system health is critical. I have contacted the Dell engineer who's working on the case.

I removed CPU1 and moved CPU2 in CPU1 slot boot the server with only with CPU2 and had the same CPU problem reported on CPU1 I requested that the parts below been sent to me.

  • Motherboard
  • H730P RAID controller card
  • SAS cable
  • RAID controller interposer board

for delivery on 11/26/18

I have replaced the system board on the system. The system is back online. I will be monitoring the system to see if we do have the same problem again. Also I have on site a new : H730P RAID controller card, SAS cable and RAID controller interposer board that Dell shipped to me.

Looks like it just happened again (timestamp UTC)

09:25 -icinga-wm:#wikimedia-operations- PROBLEM - Host ms-be2047 is DOWN: PING CRITICAL - Packet loss = 100%
09:27 -icinga-wm:#wikimedia-operations- RECOVERY - Host ms-be2047 is UP: PING OK - Packet loss = 0%, RTA = 36.15 ms
09:35 -icinga-wm:#wikimedia-operations- PROBLEM - Host ms-be2047 is DOWN: PING CRITICAL - Packet loss = 100%
09:37 -icinga-wm:#wikimedia-operations- RECOVERY - Host ms-be2047 is UP: PING OK - Packet loss = 0%, RTA = 36.14 ms
root@ms-be2047:~# ipmi-sel --interpret-oem-data --output-oem-event-strings
ID  | Date        | Time     | Name             | Type                        | Event
1   | Nov-27-2018 | 17:36:49 | SEL              | Event Logging Disabled      | Log Area Reset/Cleared
2   | Nov-28-2018 | 09:23:29 | CPU Machine Chk  | Processor                   | transition to Non-recoverable ; CPU 1 ; APIC ID 0
3   | Nov-28-2018 | 09:23:29 | MSR Info Log     | OEM Reserved                | OEM Diagnostic Data Event ; Register Offset = D04h ; Register Value = 00h
4   | Nov-28-2018 | 09:23:30 | MSR Info Log     | OEM Reserved                | OEM Diagnostic Data Event ; Register Offset = 79h ; Register Value = 01h
5   | Nov-28-2018 | 09:23:30 | MSR Info Log     | OEM Reserved                | OEM Diagnostic Data Event ; Register Offset = 0h ; Register Value = 00h
6   | Nov-28-2018 | 09:23:30 | MSR Info Log     | OEM Reserved                | OEM Diagnostic Data Event ; Register Offset = 0h ; Register Value = F2h
7   | Nov-28-2018 | 09:33:08 | CPU Machine Chk  | Processor                   | transition to Non-recoverable ; CPU 1 ; APIC ID 52
8   | Nov-28-2018 | 09:33:08 | MSR Info Log     | OEM Reserved                | OEM Diagnostic Data Event ; Register Offset = D04h ; Register Value = 00h
9   | Nov-28-2018 | 09:33:08 | MSR Info Log     | OEM Reserved                | OEM Diagnostic Data Event ; Register Offset = 89h ; Register Value = 01h
10  | Nov-28-2018 | 09:33:08 | MSR Info Log     | OEM Reserved                | OEM Diagnostic Data Event ; Register Offset = 0h ; Register Value = 00h
11  | Nov-28-2018 | 09:33:08 | MSR Info Log     | OEM Reserved                | OEM Diagnostic Data Event ; Register Offset = 0h ; Register Value = F2h

Please replace the remaining hardware sent.

I found the hardware below out of date.

IDRAC at 3.21.21.21

CPLD at 1.4.9

BIOS at 1.0.1

Suggested action plan:

  1. Clear System Event logs in IDRAC
  1. Update IDRAC to 3.21.23.22
  1. https://downloads.dell.com/FOLDER05181865M/1/iDRAC-with-Lifecycle-Controller_Firmware_K877V_WN64_3.21.23.22_A00.EXE
  1. Update CPLD to 1.0.6

https://downloads.dell.com/FOLDER05313024M/1/CPLD_Firmware_PC0N3_WN64_1.0.6_A00.EXE

  1. Update BIOS to 1.5.6

https://downloads.dell.com/FOLDER05268089M/1/BIOS_5KNGY_WN64_1.5.6.EXE

  1. Remove CPU1 and replace with CPU2
  1. Remove DIMMS in B slots
  1. Create another Support Assist form the IDRAC to review.
  1. Reboot system and check status
  1. If no failures seen put the removed CPU in the CPU 2 slot and the DIMM in the B1 and B2 slot. Reboot and check status. Run another Support Assist if failures still seen.

Did all the dell engineer recommended above. Waiting to proceed to step 10 .

Saw a crash happen today Thu Nov 29 at 22:10Z

@colewhite Thank you I saw that

Papaul added a comment.Mon, Dec 3, 5:15 AM

Replaced all the parts that was shipped to me by Dell (main board, RAID controller, RAID controller interposer board.SAS cable) swapped CPU1 with CPU2 we have the same problem on the server. I email Dell last Friday, waiting for Dell to get back in touch with me.

Papaul added a comment.Mon, Dec 3, 3:42 PM

update from Dell

Can you clear the log from the IDRAC, boot into the Life Cycle controller and run diagnostics.

I need this to provide to my team lead for review.

Papaul added a comment.Wed, Dec 5, 3:29 PM

After 16 hours of hardware diagnostics, the server came up with no error. I have a Call schedule with Dell in 2 hours to discuss about the next step to take.

Papaul added a comment.Wed, Dec 5, 6:20 PM

@fgiunchedi please see below.

Papual, while you are trying a different power source, my Linux software support would like to review the OS logs to make sure we have covered all possible causes. They are requesting the MCE and syslog’s for review if available.

Andy Johnson

Enterprise Engineer

Dell EMC| Enterprise Engineer

office + 1 800 945 3355, ext. 5135035

My work schedule is 7:00 am - 4:00 pm, Monday through Friday CST.

Mentioned in SAL (#wikimedia-operations) [2018-12-06T10:56:46Z] <volans> disable event handler on Icinga for ms-be2047 MD Raid and MegaRAID checks, it's spamming Phabricator - T209921

@Papaul @fgiunchedi Today the RAID alarm was continuously flapping and created a ton of tasks (see above) that I asked mo.brovac to close as he had access to the batch edit interface in Phabricator.
I've disabled the event handler for the 2 RAID checks in Icinga for this host. Please remember to re-enable them once fixed.

I've also set the Netbox status to FAILED. (see https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Active_-%3E_Failed )

FYI At this time I cannot SSH and from the console I get the login screen but after entering root as user it doesn't ask me for a password and gets stuck there.

Papaul added a comment.Thu, Dec 6, 4:01 PM

Andy.Johnson@dell.com

9:40 AM (21 minutes ago)

to me, faidon

Dell Customer Communication

Here is a link to the Dell Support Live Image (SLI) Version 3.0. with this we can test the hardware outside of the OS to see if it reboots.

https://www.dell.com/support/home/us/en/04/drivers/driversdetails?driverid=c31j4

Installation instructions

  1. Download SLI30_A00 and burn it into a DVD or Pendrive as bootable or mount it to iDRAC virtual DVD using iDRAC virtual media option.
  2. Connect the DVD or Pendrive on physical server or if booting through iDRAC, map the iso image using iDRAC virtual media. For DRAC based machines - 9G and 10G, you can mount Support Live Image on DRAC virtual console from DRAC GUI and for iDRAC, you can mount Support Live Image on Virtual Media option present in Virtual Console.
  3. Reboot the server
  4. At boot time press F11 key for Boot options
  5. At the Boot Options screen select the device through which you are booting

a. DVD - If booting using DVD
b. Pendrive - Hard drive c: and the connected Pen Drive(BIOS Mode)
c. Pendrive: Appropriate USB port(IN UEFI MODE)
d. Virtual DVD - Virtual DVD

  1. Upon selection SLI boot menu will be displayed, Select the appropriate option to boot the SLI image

Additional you can run a stress test on the system from the SLI

Run command ( stressapptest -s 1800 -W ) -s is seconds for 24 hour test number should be 86400

Running Stress test on the system

Stress test on the system cam out with no errors.

I talked with @fgiunchedi on IRC to re-image the server with a fresh install and go from there.

Fresh OS installed on the system. Leaving the system up again to see.

same error again at 22:47

Normal,Tue 11 Dec 2018 22:47:04,An OEM diagnostic event occurred.,
Normal,Tue 11 Dec 2018 22:47:04,An OEM diagnostic event occurred.,
Normal,Tue 11 Dec 2018 22:47:04,An OEM diagnostic event occurred.,
Normal,Tue 11 Dec 2018 22:47:04,An OEM diagnostic event occurred.,
Critical,Tue 11 Dec 2018 22:47:04,CPU 1 machine check error detected.,
Normal,Tue 11 Dec 2018 16:47:05,Log cleared.,

Given that the other hosts in this batch are fine and we've replaced the parts Dell wanted to replace what's the next step?

Mentioned in SAL (#wikimedia-operations) [2018-12-12T15:24:33Z] <godog> poweroff ms-be2044 for hardware inspection - T209921

fgiunchedi moved this task from Backlog to Radar on the User-fgiunchedi board.Thu, Dec 13, 9:37 AM

Dell will be shipping 1 New CPU by Monday.