Page MenuHomePhabricator

ms-be2047 spontaneous reboots
Closed, ResolvedPublic

Assigned To
Authored By
fgiunchedi
Nov 20 2018, 7:47 AM
Referenced Files
F28148277: New Doc 2019-02-07 10.08.44_1.pdf
Feb 7 2019, 4:27 PM
F28116675: New Doc 2018-12-17 10.53.34_2.pdf
Feb 4 2019, 4:11 PM
F27496954: 20181211_093420.jpg
Dec 11 2018, 3:37 PM
F11: profile-project.png
Dec 6 2018, 4:01 PM
F27392018: New Doc 2018-12-05 09.18.15.pdf
Dec 5 2018, 3:29 PM
F27269118: Selection_043.png
Nov 21 2018, 2:42 PM
Tokens
"The World Burns" token, awarded by fgiunchedi.

Description

Noticed on icinga that ms-be2047 has been rebooting without (AFAIK) an actual reboot being issued:

20:33 -icinga-wm_:#wikimedia-operations- PROBLEM - Host ms-be2047 is DOWN: PING CRITICAL - Packet loss = 100%
20:36 -icinga-wm_:#wikimedia-operations- RECOVERY - Host ms-be2047 is UP: PING OK - Packet loss = 0%, RTA = 39.08 ms
22:42 -icinga-wm:#wikimedia-operations- PROBLEM - Host ms-be2047 is DOWN: PING CRITICAL - Packet loss = 100%
22:44 -icinga-wm:#wikimedia-operations- RECOVERY - Host ms-be2047 is UP: PING OK - Packet loss = 0%, RTA = 36.21 ms
03:07 -icinga-wm:#wikimedia-operations- PROBLEM - Host ms-be2047 is DOWN: PING CRITICAL - Packet loss = 100%
03:10 -icinga-wm:#wikimedia-operations- RECOVERY - Host ms-be2047 is UP: PING OK - Packet loss = 0%, RTA = 36.19 ms
03:17 -icinga-wm:#wikimedia-operations- PROBLEM - Host ms-be2047 is DOWN: PING CRITICAL - Packet loss = 100%
03:19 -icinga-wm:#wikimedia-operations- RECOVERY - Host ms-be2047 is UP: PING OK - Packet loss = 0%, RTA = 36.16 ms
03:24 -icinga-wm:#wikimedia-operations- PROBLEM - Host ms-be2047 is DOWN: PING CRITICAL - Packet loss = 100%
03:27 -icinga-wm:#wikimedia-operations- RECOVERY - Host ms-be2047 is UP: PING OK - Packet loss = 0%, RTA = 36.19 ms

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Please replace the remaining hardware sent.

I found the hardware below out of date.

IDRAC at 3.21.21.21

CPLD at 1.4.9

BIOS at 1.0.1

Suggested action plan:

  1. Clear System Event logs in IDRAC
  1. Update IDRAC to 3.21.23.22
  1. https://downloads.dell.com/FOLDER05181865M/1/iDRAC-with-Lifecycle-Controller_Firmware_K877V_WN64_3.21.23.22_A00.EXE
  1. Update CPLD to 1.0.6

https://downloads.dell.com/FOLDER05313024M/1/CPLD_Firmware_PC0N3_WN64_1.0.6_A00.EXE

  1. Update BIOS to 1.5.6

https://downloads.dell.com/FOLDER05268089M/1/BIOS_5KNGY_WN64_1.5.6.EXE

  1. Remove CPU1 and replace with CPU2
  1. Remove DIMMS in B slots
  1. Create another Support Assist form the IDRAC to review.
  1. Reboot system and check status
  1. If no failures seen put the removed CPU in the CPU 2 slot and the DIMM in the B1 and B2 slot. Reboot and check status. Run another Support Assist if failures still seen.

Did all the dell engineer recommended above. Waiting to proceed to step 10 .

Saw a crash happen today Thu Nov 29 at 22:10Z

Replaced all the parts that was shipped to me by Dell (main board, RAID controller, RAID controller interposer board.SAS cable) swapped CPU1 with CPU2 we have the same problem on the server. I email Dell last Friday, waiting for Dell to get back in touch with me.

update from Dell

Can you clear the log from the IDRAC, boot into the Life Cycle controller and run diagnostics.

I need this to provide to my team lead for review.

After 16 hours of hardware diagnostics, the server came up with no error. I have a Call schedule with Dell in 2 hours to discuss about the next step to take.

@fgiunchedi please see below.

Papual, while you are trying a different power source, my Linux software support would like to review the OS logs to make sure we have covered all possible causes. They are requesting the MCE and syslog’s for review if available.

Andy Johnson

Enterprise Engineer

Dell EMC| Enterprise Engineer

office + 1 800 945 3355, ext. 5135035

My work schedule is 7:00 am - 4:00 pm, Monday through Friday CST.

Mentioned in SAL (#wikimedia-operations) [2018-12-06T10:56:46Z] <volans> disable event handler on Icinga for ms-be2047 MD Raid and MegaRAID checks, it's spamming Phabricator - T209921

@Papaul @fgiunchedi Today the RAID alarm was continuously flapping and created a ton of tasks (see above) that I asked mo.brovac to close as he had access to the batch edit interface in Phabricator.
I've disabled the event handler for the 2 RAID checks in Icinga for this host. Please remember to re-enable them once fixed.

I've also set the Netbox status to FAILED. (see https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Active_-%3E_Failed )

FYI At this time I cannot SSH and from the console I get the login screen but after entering root as user it doesn't ask me for a password and gets stuck there.

Andy.Johnson@dell.com

9:40 AM (21 minutes ago)

to me, faidon

Dell Customer Communication

Here is a link to the Dell Support Live Image (SLI) Version 3.0. with this we can test the hardware outside of the OS to see if it reboots.

https://www.dell.com/support/home/us/en/04/drivers/driversdetails?driverid=c31j4

Installation instructions

  1. Download SLI30_A00 and burn it into a DVD or Pendrive as bootable or mount it to iDRAC virtual DVD using iDRAC virtual media option.
  2. Connect the DVD or Pendrive on physical server or if booting through iDRAC, map the iso image using iDRAC virtual media. For DRAC based machines - 9G and 10G, you can mount Support Live Image on DRAC virtual console from DRAC GUI and for iDRAC, you can mount Support Live Image on Virtual Media option present in Virtual Console.
  3. Reboot the server
  4. At boot time press F11 key for Boot options
  5. At the Boot Options screen select the device through which you are booting

a. DVD - If booting using DVD
b. Pendrive - Hard drive c: and the connected Pen Drive(BIOS Mode)
c. Pendrive: Appropriate USB port(IN UEFI MODE)
d. Virtual DVD - Virtual DVD

  1. Upon selection SLI boot menu will be displayed, Select the appropriate option to boot the SLI image

Additional you can run a stress test on the system from the SLI

Run command ( stressapptest -s 1800 -W ) -s is seconds for 24 hour test number should be 86400

Running Stress test on the system

Stress test on the system cam out with no errors.

20181211_093420.jpg (2×3 px, 2 MB)

I talked with @fgiunchedi on IRC to re-image the server with a fresh install and go from there.

Fresh OS installed on the system. Leaving the system up again to see.

same error again at 22:47

Normal,Tue 11 Dec 2018 22:47:04,An OEM diagnostic event occurred.,
Normal,Tue 11 Dec 2018 22:47:04,An OEM diagnostic event occurred.,
Normal,Tue 11 Dec 2018 22:47:04,An OEM diagnostic event occurred.,
Normal,Tue 11 Dec 2018 22:47:04,An OEM diagnostic event occurred.,
Critical,Tue 11 Dec 2018 22:47:04,CPU 1 machine check error detected.,
Normal,Tue 11 Dec 2018 16:47:05,Log cleared.,

Given that the other hosts in this batch are fine and we've replaced the parts Dell wanted to replace what's the next step?

Mentioned in SAL (#wikimedia-operations) [2018-12-12T15:24:33Z] <godog> poweroff ms-be2044 for hardware inspection - T209921

Dell will be shipping 1 New CPU by Monday.

CPU 1 has been replaced. I clear also the log. The system is back up and I will be monitoring it once again.

The problem happen again twice after replacing CPU1

Normal,Mon 17 Dec 2018 20:29:29,An OEM diagnostic event occurred.,
Normal,Mon 17 Dec 2018 20:29:29,An OEM diagnostic event occurred.,
Normal,Mon 17 Dec 2018 20:29:29,An OEM diagnostic event occurred.,
Normal,Mon 17 Dec 2018 20:29:29,An OEM diagnostic event occurred.,
Critical,Mon 17 Dec 2018 20:29:29,CPU 1 machine check error detected.,
Normal,Mon 17 Dec 2018 20:25:30,An OEM diagnostic event occurred.,
Normal,Mon 17 Dec 2018 20:25:30,An OEM diagnostic event occurred.,
Normal,Mon 17 Dec 2018 20:25:30,An OEM diagnostic event occurred.,
Normal,Mon 17 Dec 2018 20:25:30,An OEM diagnostic event occurred.,
Critical,Mon 17 Dec 2018 20:25:29,CPU 1 machine check error detected.,
Normal,Mon 17 Dec 2018 19:41:01,An OEM diagnostic event occurred.,
Normal,Mon 17 Dec 2018 19:41:01,An OEM diagnostic event occurred.,
Normal,Mon 17 Dec 2018 19:41:00,An OEM diagnostic event occurred.,
Normal,Mon 17 Dec 2018 19:41:00,An OEM diagnostic event occurred.,
Critical,Mon 17 Dec 2018 19:41:00,CPU 1 machine check error detected.,
Normal,Mon 17 Dec 2018 19:23:32,An OEM diagnostic event occurred.,
Normal,Mon 17 Dec 2018 19:23:32,An OEM diagnostic event occurred.,
Normal,Mon 17 Dec 2018 19:23:32,An OEM diagnostic event occurred.,
Normal,Mon 17 Dec 2018 19:23:32,An OEM diagnostic event occurred.,
Critical,Mon 17 Dec 2018 19:23:32,CPU 1 machine check error detected.,
Normal,Mon 17 Dec 2018 18:56:05,An OEM diagnostic event occurred.,
Normal,Mon 17 Dec 2018 18:56:05,An OEM diagnostic event occurred.,
Normal,Mon 17 Dec 2018 18:56:05,An OEM diagnostic event occurred.,
Normal,Mon 17 Dec 2018 18:56:05,An OEM diagnostic event occurred.,
Critical,Mon 17 Dec 2018 18:56:05,CPU 1 machine check error detected.,
Normal,Mon 17 Dec 2018 18:24:10,An OEM diagnostic event occurred.,
Normal,Mon 17 Dec 2018 18:24:10,An OEM diagnostic event occurred.,
Normal,Mon 17 Dec 2018 18:24:10,An OEM diagnostic event occurred.,
Normal,Mon 17 Dec 2018 18:24:09,An OEM diagnostic event occurred.,
Critical,Mon 17 Dec 2018 18:24:09,CPU 1 machine check error detected.,
Normal,Mon 17 Dec 2018 17:49:28,An OEM diagnostic event occurred.,
Normal,Mon 17 Dec 2018 17:49:28,An OEM diagnostic event occurred.,
Normal,Mon 17 Dec 2018 17:49:28,An OEM diagnostic event occurred.,
Normal,Mon 17 Dec 2018 17:49:28,An OEM diagnostic event occurred.,
Critical,Mon 17 Dec 2018 17:49:28,CPU 1 machine check error detected.,
Normal,Mon 17 Dec 2018 17:27:09,Log cleared.,

Redundancy Policy on this system was set to Not redundant or on the other working system it was set to redundant so we change the settings for this system to redundant as well. Monitoring the system again

The host started sending cron spam about an hour ago. They were all from F ile "/usr/bin/swift-recon-cron", in in which " AttributeError: cffi library '_openssl' has no function, constant or global variable named 'sk_H509_NAME]ENTRY_value'". That sounds unrelated to this hardware issue but just started randomly an hour ago and kept sending mail.

Since this host is broken and not in production anyways, i scheduled a downtime in Icinga for one month and Papaul shut the host down.

I pointed this out on IRC but putting it on the ticket because I can't help myself:

While there's no OpenSSL symbol sk_H509_NAME]ENTRY_value, there is a sk_X509_NAME_ENTRY_value. Also, ASCII H (0x48) and X (0x58), as well as ] (0x5D) and _ (0x5F), are each a single flipped bit from each other.

🍋

Dell just called me. They will be shipping a new system and will arrive by the first week on January.

Received replacement server

update Netbox with new serial number

Change 487873 had a related patch set uploaded (by Papaul; owner: Papaul):
[operations/puppet@production] DHCP: Update MAC address for ms-be2047

https://gerrit.wikimedia.org/r/487873

Removed old puppet cert for ms-be2047.codfw.wmnet

Change 487873 merged by Dzahn:
[operations/puppet@production] DHCP: Update MAC address for ms-be2047

https://gerrit.wikimedia.org/r/487873

@fgiunchedi I replaced the problematic server with the new one Dell shipped to me. The OS is installed and puppet first run done. I will proceed to the disk wipe on the old server on Wednesday before shipping it back to Dell. Let me know if you have any questions.

Thanks.

@fgiunchedi I replaced the problematic server with the new one Dell shipped to me. The OS is installed and puppet first run done. I will proceed to the disk wipe on the old server on Wednesday before shipping it back to Dell. Let me know if you have any questions.

Thanks.

Thanks Papaul, I'll stress-test the host and put it in service if no problems arise.

Mentioned in SAL (#wikimedia-operations) [2019-02-05T13:55:15Z] <godog> swift codfw-prod: add ms-be2047 - T209395 T209921

Mentioned in SAL (#wikimedia-operations) [2019-02-06T09:15:42Z] <godog> swift codfw-prod: more weight for ms-be2047 - T209395 T209921

Mentioned in SAL (#wikimedia-operations) [2019-02-07T08:34:24Z] <godog> swift codfw-prod: more weight to ms-be2047 - T209395 T209921

Old server has been shipped out. Shipping information below.

Mentioned in SAL (#wikimedia-operations) [2019-02-08T10:23:38Z] <godog> swift codfw-prod: more weight to ms-be2047 - T209395 T209921

Host is in service at full weight, assigning to @Papaul for return of previous hardware

Previous hardware has been already returned since last Thursday. (See comment on Feb7) We can resolve this task.