Page MenuHomePhabricator

pc2005 crashed: CPU2 internal error
Closed, ResolvedPublic

Description

Creating this task for the record:
pc2005 crashed and was unresponsive and this is what the ILO shows:

	properties
		CreationTimestamp = 20171228103236.000000-360
		ElementName = System Event Log Entry
		RecordData = CPU 2 has an internal error (IERR).
		RecordFormat = string Description
		RecordID = 2

After a reboot the server came up fine.

I suggest we update its firmware (if it needs an update)

This server is still under warranty, so if it happens again after an upgrade, I suggest we try to contact the vendor and get the CPU2 replaced.

@Papaul can you check if a firmware update is needed?
Thanks!

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Marostegui renamed this task from pc2005 CPU2 internal error to pc2005 crashed: CPU2 internal error.Dec 28 2017, 2:28 PM
Marostegui triaged this task as Medium priority.
Marostegui moved this task from Triage to Blocked external/Not db team on the DBA board.

I think these servers are leased CC @RobH

Lease versus purchase has no change in warranty support, just in our tracking of hardware. This should be able to be processed as a normal under warranty server. (Leasing just means we cannot replace with on the shelf parts or upgrade the system)

Change 401744 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: "Depool" pc2005 for maintenance

https://gerrit.wikimedia.org/r/401744

Change 401744 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: "Depool" pc2005 for maintenance

https://gerrit.wikimedia.org/r/401744

Mentioned in SAL (#wikimedia-operations) [2018-01-03T15:48:30Z] <jynus> stop pc2005's database for maintenance T183750

@Papaul pc2005 server is up, but mysql is depooled and down, downtime'd for a day on incinga and can be brought down at any time now

Moved CPU1 to CPU2
Upgrade IDRAC from version 2.21 to 2.50
Upgrade BIOS from version 2.1.7 to 2.6.0

leaving the task open for now to see if the problem will show up on CPU1.
IF we have the problem on CPU1 = bad CPU
IF we have the problem on CPU2= bad main board

@jcrespo server is back up

Pooling it back, as it will not be too dangerous.

Moved CPU1 to CPU2
Upgrade IDRAC from version 2.21 to 2.50
Upgrade BIOS from version 2.1.7 to 2.6.0

leaving the task open for now to see if the problem will show up on CPU1.
IF we have the problem on CPU1 = bad CPU
IF we have the problem on CPU2= bad main board

@jcrespo server is back up

Probably once the server is repooled we can close the task as it might take weeks or months to see it happening again.
If it happens we can reopen and request the replacement