Page MenuHomePhabricator

pc2006 rebooted itself
Closed, ResolvedPublic

Description

pc2006 crashed and rebooted itself.
Investigation is needed:

09:30 < icinga-wm> PROBLEM - mysqld processes on pc2006 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld
root@pc2006:~# uptime
 09:37:55 up 10 min,  1 user,  load average: 0.00, 0.03, 0.03

Event Timeline

Change 449010 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-codfw.php: Depool pc2006

https://gerrit.wikimedia.org/r/449010

Change 449010 merged by jenkins-bot:
[operations/mediawiki-config@master] db-codfw.php: Depool pc2006

https://gerrit.wikimedia.org/r/449010

Mentioned in SAL (#wikimedia-operations) [2018-07-29T09:56:16Z] <marostegui@deploy1001> Synchronized wmf-config/db-codfw.php: Depool pc2006 T200641 (duration: 00m 57s)

Marostegui added a project: ops-codfw.
Marostegui added a subscriber: Papaul.

Looks like it had some memory errors:

/admin1/system1/logs1/log1-> show record1

	properties
		CreationTimestamp = 20180729082203.000000-300
		ElementName = System Event Log Entry
		RecordData = Correctable memory error rate exceeded for DIMM_B7.
		RecordFormat = string Description
		RecordID = 7
	associations
	targets
	verbs
		cd
		show
		help
		version
/admin1/system1/logs1/log1-> show record2

	properties
		CreationTimestamp = 20180729080029.000000-300
		ElementName = System Event Log Entry
		RecordData = Correctable memory error rate exceeded for DIMM_B7.
		RecordFormat = string Description
		RecordID = 6

These errors are on the same module as T139283. We should probably try to upgrade BIOS and firmwares if there're new versions to upgrade to. @Papaul can you check?
Meanwhile I will leave the host depooled. I have upgrade the system, kernel etc and rebooted the host, which came back just fine.
MySQL is also started.

Marostegui triaged this task as Medium priority.Jul 30 2018, 2:03 PM

There was alread a BIOS upgrade at T139714, I would contact directly support as suggested by robh here: https://phabricator.wikimedia.org/T139283#2430289

That sounds good to me. This is a racadm getsel so it can be sent to support:

/admin1-> racadm getsel
Record:      1
Date/Time:   <System Boot>
Source:      system
Severity:    Ok
Description: Log cleared.
-------------------------------------------------------------------------------
Record:      2
Date/Time:   05/19/2017 15:54:44
Source:      system
Severity:    Ok
Description: An OS graceful shut-down occurred.
-------------------------------------------------------------------------------
Record:      3
Date/Time:   05/19/2017 15:54:44
Source:      system
Severity:    Ok
Description: OEM software event.
-------------------------------------------------------------------------------
Record:      4
Date/Time:   07/19/2017 01:56:20
Source:      system
Severity:    Critical
Description: The chassis is open while the power is off.
-------------------------------------------------------------------------------
Record:      5
Date/Time:   07/19/2017 01:56:25
Source:      system
Severity:    Ok
Description: The chassis is closed while the power is off.
-------------------------------------------------------------------------------
Record:      6
Date/Time:   07/29/2018 09:00:29
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_B7.
-------------------------------------------------------------------------------
Record:      7
Date/Time:   07/29/2018 09:22:03
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_B7.
-------------------------------------------------------------------------------

Mentioned in SAL (#wikimedia-operations) [2018-08-06T14:56:33Z] <marostegui> Stop MySQL for onsite maintenance - T200641

This server has pro support as mentioned in T139283

Hi Papaul,

This server (7D3H282) has pro support so technically I cannot help you with it but if you will update the bios to 2.1.7:

http://www.dell.com/support/home/us/en/19/Drivers/DriversDetails?driverId=V99PP&fileId=3549897043&osCode=W12R2&productCode=poweredge-r630&languageCode=en&categoryId=BI

This should fix it. All of the 13th gen need to have the Bios's updated to keep them from getting memory errors.

Dell can not replace the Bad DIMM.

The server BIOS is 2.3.4 and there is a new version out 2.8.0. we can just update the BIOS for now

Thanks @Papaul - let's upgrade the BIOS then.

@Marostegui Bios update to 2.8.0 . It is all yours

Selection_030.png (398×673 px, 30 KB)

Thanks - I have started MySQL and will leave it running during the night.
There is not much we can do with this host as per T200641#4481566 - and we will at some point replace these hosts with new ones (T195878)

If everything goes fine, I will repool it tomorrow and close this task.

Mentioned in SAL (#wikimedia-operations) [2018-08-07T07:05:06Z] <marostegui@deploy1001> Synchronized wmf-config/db-codfw.php: Repool pc2006 T200641 (duration: 00m 49s)

I have repooled the host so going to consider this resolved as there is not much else we can do - I am going to create a task to get pc2004 and pc2005's BIOS upgrade before we do the DC failover as we can easily depool the hosts.