Maniphest T209921

ms-be2047 spontaneous reboots
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	fgiunchedi
	Nov 20 2018, 7:47 AM

Description

Noticed on icinga that ms-be2047 has been rebooting without (AFAIK) an actual reboot being issued:

20:33 -icinga-wm_:#wikimedia-operations- PROBLEM - Host ms-be2047 is DOWN: PING CRITICAL - Packet loss = 100%
20:36 -icinga-wm_:#wikimedia-operations- RECOVERY - Host ms-be2047 is UP: PING OK - Packet loss = 0%, RTA = 39.08 ms
22:42 -icinga-wm:#wikimedia-operations- PROBLEM - Host ms-be2047 is DOWN: PING CRITICAL - Packet loss = 100%
22:44 -icinga-wm:#wikimedia-operations- RECOVERY - Host ms-be2047 is UP: PING OK - Packet loss = 0%, RTA = 36.21 ms
03:07 -icinga-wm:#wikimedia-operations- PROBLEM - Host ms-be2047 is DOWN: PING CRITICAL - Packet loss = 100%
03:10 -icinga-wm:#wikimedia-operations- RECOVERY - Host ms-be2047 is UP: PING OK - Packet loss = 0%, RTA = 36.19 ms
03:17 -icinga-wm:#wikimedia-operations- PROBLEM - Host ms-be2047 is DOWN: PING CRITICAL - Packet loss = 100%
03:19 -icinga-wm:#wikimedia-operations- RECOVERY - Host ms-be2047 is UP: PING OK - Packet loss = 0%, RTA = 36.16 ms
03:24 -icinga-wm:#wikimedia-operations- PROBLEM - Host ms-be2047 is DOWN: PING CRITICAL - Packet loss = 100%
03:27 -icinga-wm:#wikimedia-operations- RECOVERY - Host ms-be2047 is UP: PING OK - Packet loss = 0%, RTA = 36.19 ms

Details

	Subject	Repo	Branch	Lines +/-
	DHCP: Update MAC address for ms-be2047	operations/puppet	production	+1 -1

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		fgiunchedi	T209395 rack/setup/install new ms-be servers ms-be204[4-9] ,ms-be2050
		Resolved		Papaul	T209921 ms-be2047 spontaneous reboots

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

fgiunchedi merged a task: T210697: ms-be2047 rebooting itself.Nov 29 2018, 9:41 AM

fgiunchedi added a subscriber: • Marostegui.

Please replace the remaining hardware sent.

I found the hardware below out of date.

IDRAC at 3.21.21.21

CPLD at 1.4.9

BIOS at 1.0.1

Suggested action plan:

Clear System Event logs in IDRAC

Update IDRAC to 3.21.23.22

https://downloads.dell.com/FOLDER05181865M/1/iDRAC-with-Lifecycle-Controller_Firmware_K877V_WN64_3.21.23.22_A00.EXE

Update CPLD to 1.0.6

https://downloads.dell.com/FOLDER05313024M/1/CPLD_Firmware_PC0N3_WN64_1.0.6_A00.EXE

Update BIOS to 1.5.6

https://downloads.dell.com/FOLDER05268089M/1/BIOS_5KNGY_WN64_1.5.6.EXE

Remove CPU1 and replace with CPU2

Remove DIMMS in B slots

Create another Support Assist form the IDRAC to review.

Reboot system and check status

If no failures seen put the removed CPU in the CPU 2 slot and the DIMM in the B1 and B2 slot. Reboot and check status. Run another Support Assist if failures still seen.

Did all the dell engineer recommended above. Waiting to proceed to step 10 .

Saw a crash happen today Thu Nov 29 at 22:10Z

@colewhite Thank you I saw that

Replaced all the parts that was shipped to me by Dell (main board, RAID controller, RAID controller interposer board.SAS cable) swapped CPU1 with CPU2 we have the same problem on the server. I email Dell last Friday, waiting for Dell to get back in touch with me.

update from Dell

Can you clear the log from the IDRAC, boot into the Life Cycle controller and run diagnostics.

I need this to provide to my team lead for review.

After 16 hours of hardware diagnostics, the server came up with no error. I have a Call schedule with Dell in 2 hours to discuss about the next step to take.

New Doc 2018-12-05 09.18.15.pdf1 MBDownload

@fgiunchedi please see below.

Papual, while you are trying a different power source, my Linux software support would like to review the OS logs to make sure we have covered all possible causes. They are requesting the MCE and syslog’s for review if available.

Andy Johnson

Enterprise Engineer

Dell EMC| Enterprise Engineer

office + 1 800 945 3355, ext. 5135035

My work schedule is 7:00 am - 4:00 pm, Monday through Friday CST.

Mentioned in SAL (#wikimedia-operations) [2018-12-06T10:56:46Z] <volans> disable event handler on Icinga for ms-be2047 MD Raid and MegaRAID checks, it's spamming Phabricator - T209921

• mobrovac mentioned this in T211320: Degraded RAID on ms-be2047.Dec 6 2018, 11:24 AM

• mobrovac mentioned this in T211319: Degraded RAID on ms-be2047.

• mobrovac mentioned this in T211318: Degraded RAID on ms-be2047.

• mobrovac mentioned this in T211317: Degraded RAID on ms-be2047.

• mobrovac mentioned this in T211316: Degraded RAID on ms-be2047.

• mobrovac mentioned this in T211315: Degraded RAID on ms-be2047.

• mobrovac mentioned this in T211313: Degraded RAID on ms-be2047.

• mobrovac mentioned this in T211311: Degraded RAID on ms-be2047.

• mobrovac mentioned this in T211310: Degraded RAID on ms-be2047.

• mobrovac mentioned this in T211309: Degraded RAID on ms-be2047.

• mobrovac mentioned this in T211308: Degraded RAID on ms-be2047.

• mobrovac mentioned this in T211307: Degraded RAID on ms-be2047.

• mobrovac mentioned this in T211304: Degraded RAID on ms-be2047.

• mobrovac mentioned this in T211303: Degraded RAID on ms-be2047.

• mobrovac mentioned this in T211302: Degraded RAID on ms-be2047.Dec 6 2018, 11:26 AM

• mobrovac mentioned this in T211301: Degraded RAID on ms-be2047.

• mobrovac mentioned this in T211300: Degraded RAID on ms-be2047.

• mobrovac mentioned this in T211299: Degraded RAID on ms-be2047.

• mobrovac mentioned this in T211298: Degraded RAID on ms-be2047.

• mobrovac mentioned this in T211296: Degraded RAID on ms-be2047.

• mobrovac mentioned this in T211295: Degraded RAID on ms-be2047.

• mobrovac mentioned this in T211293: Degraded RAID on ms-be2047.

• mobrovac mentioned this in T211292: Degraded RAID on ms-be2047.

• mobrovac mentioned this in T211291: Degraded RAID on ms-be2047.

• mobrovac mentioned this in T211290: Degraded RAID on ms-be2047.

• mobrovac mentioned this in T211289: Degraded RAID on ms-be2047.

• mobrovac mentioned this in T211288: Degraded RAID on ms-be2047.

• mobrovac mentioned this in T211287: Degraded RAID on ms-be2047.

• mobrovac mentioned this in T211286: Degraded RAID on ms-be2047.

• mobrovac mentioned this in T211285: Degraded RAID on ms-be2047.

• mobrovac mentioned this in T211283: Degraded RAID on ms-be2047.

• mobrovac mentioned this in T211284: Degraded RAID on ms-be2047.

• mobrovac mentioned this in T211282: Degraded RAID on ms-be2047.

• mobrovac mentioned this in T211281: Degraded RAID on ms-be2047.

• mobrovac mentioned this in T211279: Degraded RAID on ms-be2047.

• mobrovac mentioned this in T211278: Degraded RAID on ms-be2047.

@Papaul @fgiunchedi Today the RAID alarm was continuously flapping and created a ton of tasks (see above) that I asked mo.brovac to close as he had access to the batch edit interface in Phabricator.
I've disabled the event handler for the 2 RAID checks in Icinga for this host. Please remember to re-enable them once fixed.

I've also set the Netbox status to FAILED. (see https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Active_-%3E_Failed )

FYI At this time I cannot SSH and from the console I get the login screen but after entering root as user it doesn't ask me for a password and gets stuck there.

Andy.Johnson@dell.com

9:40 AM (21 minutes ago)

to me, faidon

Dell Customer Communication

Here is a link to the Dell Support Live Image (SLI) Version 3.0. with this we can test the hardware outside of the OS to see if it reboots.

https://www.dell.com/support/home/us/en/04/drivers/driversdetails?driverid=c31j4

Installation instructions

Download SLI30_A00 and burn it into a DVD or Pendrive as bootable or mount it to iDRAC virtual DVD using iDRAC virtual media option.
Connect the DVD or Pendrive on physical server or if booting through iDRAC, map the iso image using iDRAC virtual media. For DRAC based machines - 9G and 10G, you can mount Support Live Image on DRAC virtual console from DRAC GUI and for iDRAC, you can mount Support Live Image on Virtual Media option present in Virtual Console.
Reboot the server
At boot time press F11 key for Boot options
At the Boot Options screen select the device through which you are booting

a. DVD - If booting using DVD
b. Pendrive - Hard drive c: and the connected Pen Drive(BIOS Mode)
c. Pendrive: Appropriate USB port(IN UEFI MODE)
d. Virtual DVD - Virtual DVD

Upon selection SLI boot menu will be displayed, Select the appropriate option to boot the SLI image

Additional you can run a stress test on the system from the SLI

Run command ( stressapptest -s 1800 -W ) -s is seconds for 24 hour test number should be 86400

Running Stress test on the system

Stress test on the system cam out with no errors.

I talked with @fgiunchedi on IRC to re-image the server with a fresh install and go from there.

Fresh OS installed on the system. Leaving the system up again to see.

same error again at 22:47

Normal,Tue 11 Dec 2018 22:47:04,An OEM diagnostic event occurred.,
Normal,Tue 11 Dec 2018 22:47:04,An OEM diagnostic event occurred.,
Normal,Tue 11 Dec 2018 22:47:04,An OEM diagnostic event occurred.,
Normal,Tue 11 Dec 2018 22:47:04,An OEM diagnostic event occurred.,
Critical,Tue 11 Dec 2018 22:47:04,CPU 1 machine check error detected.,
Normal,Tue 11 Dec 2018 16:47:05,Log cleared.,

Given that the other hosts in this batch are fine and we've replaced the parts Dell wanted to replace what's the next step?

Mentioned in SAL (#wikimedia-operations) [2018-12-12T15:24:33Z] <godog> poweroff ms-be2044 for hardware inspection - T209921

MoritzMuehlenhoff subscribed.Dec 12 2018, 5:34 PM

fgiunchedi added a project: User-fgiunchedi.Dec 13 2018, 9:02 AM

fgiunchedi mentioned this in T209395: rack/setup/install new ms-be servers ms-be204[4-9] ,ms-be2050.

fgiunchedi moved this task from Backlog to Radar on the User-fgiunchedi board.Dec 13 2018, 9:37 AM

Dell will be shipping 1 New CPU by Monday.

CPU 1 has been replaced. I clear also the log. The system is back up and I will be monitoring it once again.

The problem happen again twice after replacing CPU1

Normal,Mon 17 Dec 2018 20:29:29,An OEM diagnostic event occurred.,
Normal,Mon 17 Dec 2018 20:29:29,An OEM diagnostic event occurred.,
Normal,Mon 17 Dec 2018 20:29:29,An OEM diagnostic event occurred.,
Normal,Mon 17 Dec 2018 20:29:29,An OEM diagnostic event occurred.,
Critical,Mon 17 Dec 2018 20:29:29,CPU 1 machine check error detected.,
Normal,Mon 17 Dec 2018 20:25:30,An OEM diagnostic event occurred.,
Normal,Mon 17 Dec 2018 20:25:30,An OEM diagnostic event occurred.,
Normal,Mon 17 Dec 2018 20:25:30,An OEM diagnostic event occurred.,
Normal,Mon 17 Dec 2018 20:25:30,An OEM diagnostic event occurred.,
Critical,Mon 17 Dec 2018 20:25:29,CPU 1 machine check error detected.,
Normal,Mon 17 Dec 2018 19:41:01,An OEM diagnostic event occurred.,
Normal,Mon 17 Dec 2018 19:41:01,An OEM diagnostic event occurred.,
Normal,Mon 17 Dec 2018 19:41:00,An OEM diagnostic event occurred.,
Normal,Mon 17 Dec 2018 19:41:00,An OEM diagnostic event occurred.,
Critical,Mon 17 Dec 2018 19:41:00,CPU 1 machine check error detected.,
Normal,Mon 17 Dec 2018 19:23:32,An OEM diagnostic event occurred.,
Normal,Mon 17 Dec 2018 19:23:32,An OEM diagnostic event occurred.,
Normal,Mon 17 Dec 2018 19:23:32,An OEM diagnostic event occurred.,
Normal,Mon 17 Dec 2018 19:23:32,An OEM diagnostic event occurred.,
Critical,Mon 17 Dec 2018 19:23:32,CPU 1 machine check error detected.,
Normal,Mon 17 Dec 2018 18:56:05,An OEM diagnostic event occurred.,
Normal,Mon 17 Dec 2018 18:56:05,An OEM diagnostic event occurred.,
Normal,Mon 17 Dec 2018 18:56:05,An OEM diagnostic event occurred.,
Normal,Mon 17 Dec 2018 18:56:05,An OEM diagnostic event occurred.,
Critical,Mon 17 Dec 2018 18:56:05,CPU 1 machine check error detected.,
Normal,Mon 17 Dec 2018 18:24:10,An OEM diagnostic event occurred.,
Normal,Mon 17 Dec 2018 18:24:10,An OEM diagnostic event occurred.,
Normal,Mon 17 Dec 2018 18:24:10,An OEM diagnostic event occurred.,
Normal,Mon 17 Dec 2018 18:24:09,An OEM diagnostic event occurred.,
Critical,Mon 17 Dec 2018 18:24:09,CPU 1 machine check error detected.,
Normal,Mon 17 Dec 2018 17:49:28,An OEM diagnostic event occurred.,
Normal,Mon 17 Dec 2018 17:49:28,An OEM diagnostic event occurred.,
Normal,Mon 17 Dec 2018 17:49:28,An OEM diagnostic event occurred.,
Normal,Mon 17 Dec 2018 17:49:28,An OEM diagnostic event occurred.,
Critical,Mon 17 Dec 2018 17:49:28,CPU 1 machine check error detected.,
Normal,Mon 17 Dec 2018 17:27:09,Log cleared.,

Redundancy Policy on this system was set to Not redundant or on the other working system it was set to redundant so we change the settings for this system to redundant as well. Monitoring the system again

The host started sending cron spam about an hour ago. They were all from F ile "/usr/bin/swift-recon-cron", in in which " AttributeError: cffi library '_openssl' has no function, constant or global variable named 'sk_H509_NAME]ENTRY_value'". That sounds unrelated to this hardware issue but just started randomly an hour ago and kept sending mail.

Since this host is broken and not in production anyways, i scheduled a downtime in Icinga for one month and Papaul shut the host down.

• Marostegui unsubscribed.Dec 20 2018, 6:48 PM

I pointed this out on IRC but putting it on the ticket because I can't help myself:

While there's no OpenSSL symbol sk_H509_NAME]ENTRY_value, there is a sk_X509_NAME_ENTRY_value. Also, ASCII H (0x48) and X (0x58), as well as ] (0x5D) and _ (0x5F), are each a single flipped bit from each other.

🍋

Dell just called me. They will be shipping a new system and will arrive by the first week on January.

fgiunchedi merged a task: T212439: swift-recon-cron - cffi library '_openssl' has no function, constant or global variable named 'sk_H509_NAME]ENTRY_value'.Dec 21 2018, 8:28 AM

fgiunchedi added a subscriber: • GTirloni.

fgiunchedi awarded a token.Dec 21 2018, 8:49 AM

• Phabricator_maintenance moved this task from Backlog to Acknowledged on the SRE board.Jan 26 2019, 10:42 PM

Received replacement server

New Doc 2018-12-17 10.53.34_2.pdf399 KBDownload

update Netbox with new serial number

Change 487873 had a related patch set uploaded (by Papaul; owner: Papaul):
[operations/puppet@production] DHCP: Update MAC address for ms-be2047

https://gerrit.wikimedia.org/r/487873

gerritbot added a project: Patch-For-Review.Feb 4 2019, 4:27 PM

Removed old puppet cert for ms-be2047.codfw.wmnet

Change 487873 merged by Dzahn:
[operations/puppet@production] DHCP: Update MAC address for ms-be2047

https://gerrit.wikimedia.org/r/487873

@fgiunchedi I replaced the problematic server with the new one Dell shipped to me. The OS is installed and puppet first run done. I will proceed to the disk wipe on the old server on Wednesday before shipping it back to Dell. Let me know if you have any questions.

Thanks.

In T209921#4925508, @Papaul wrote:

@fgiunchedi I replaced the problematic server with the new one Dell shipped to me. The OS is installed and puppet first run done. I will proceed to the disk wipe on the old server on Wednesday before shipping it back to Dell. Let me know if you have any questions.

Thanks.

Thanks Papaul, I'll stress-test the host and put it in service if no problems arise.

fgiunchedi moved this task from Radar to Doing on the User-fgiunchedi board.Feb 5 2019, 10:26 AM

Mentioned in SAL (#wikimedia-operations) [2019-02-05T13:55:15Z] <godog> swift codfw-prod: add ms-be2047 - T209395 T209921

• GTirloni unsubscribed.Feb 5 2019, 2:02 PM

Mentioned in SAL (#wikimedia-operations) [2019-02-06T09:15:42Z] <godog> swift codfw-prod: more weight for ms-be2047 - T209395 T209921

Mentioned in SAL (#wikimedia-operations) [2019-02-07T08:34:24Z] <godog> swift codfw-prod: more weight to ms-be2047 - T209395 T209921

Old server has been shipped out. Shipping information below.

New Doc 2019-02-07 10.08.44_1.pdf433 KBDownload

Mentioned in SAL (#wikimedia-operations) [2019-02-08T10:23:38Z] <godog> swift codfw-prod: more weight to ms-be2047 - T209395 T209921

Host is in service at full weight, assigning to @Papaul for return of previous hardware

Previous hardware has been already returned since last Thursday. (See comment on Feb7) We can resolve this task.

	F28148277: New Doc 2019-02-07 10.08.44_1.pdf
	Feb 7 2019, 4:27 PM

	F28116675: New Doc 2018-12-17 10.53.34_2.pdf
	Feb 4 2019, 4:11 PM

	F27392018: New Doc 2018-12-05 09.18.15.pdf
	Dec 5 2018, 3:29 PM

	F27496954: 20181211_093420.jpg
	Dec 11 2018, 3:37 PM

	F11: profile-project.png
	Dec 6 2018, 4:01 PM

	F27269118: Selection_043.png
	Nov 21 2018, 2:42 PM

ms-be2047 spontaneous rebootsClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

ms-be2047 spontaneous reboots
Closed, ResolvedPublic
Actions

Related Objects
Search...