mw1299 is down (jobrunner-canary, now up but depooled)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Reedy
	Feb 7 2019, 11:20 PM

Description

23:18:45 ['/usr/bin/scap', 'pull', '--no-update-l10n', '--include', 'README', 'mw1268.eqiad.wmnet', 'mw1314.eqiad.wmnet', 'mw2255.codfw.wmnet', 'mw2290.codfw.wmnet', 'mw2216.codfw.wmnet', 'mw1251.eqiad.wmnet', 'mw2188.codfw.wmnet', 'mw1320.eqiad.wmnet', 'mw1285.eqiad.wmnet'] on mw1299.eqiad.wmnet returned [255]: ssh: connect to host mw1299.eqiad.wmnet port 22: Connection timed out

Manuel rebooted it a while ago, but seems to be dead again

12:34 marostegui: Powercycle mw1299 as it is down and not responding

Event Timeline

Reedy created this task.Feb 7 2019, 11:20 PM

Depending what's up with it... It might want depooling and removing from the scap host lists

Mentioned in SAL (#wikimedia-operations) [2019-02-08T01:07:59Z] <mutante> powercycle crashed mw1299 via mgmt (garbled console output) (T215569)

20:12 < mutante> [mw1299:~] $ depool
20:12 < mutante> Depooling all services on mw1299.eqiad.wmnet

it's back up and running right now but depooled because this isn't the first time it happened on this machine

Dzahn renamed this task from mw1299 is down to mw1299 is down (jobrunner-canary, now up but depooled).Feb 8 2019, 1:31 AM

[puppetmaster1001:~] $ sudo -i confctl depool --hostname mw1299.eqiad.wmnet

eqiad/jobrunner/apache2/mw1299.eqiad.wmnet: pooled changed yes => no
eqiad/jobrunner/nginx/mw1299.eqiad.wmnet: pooled changed yes => no
eqiad/videoscaler/apache2/mw1299.eqiad.wmnet: pooled changed yes => no
eqiad/videoscaler/nginx/mw1299.eqiad.wmnet: pooled changed yes => no
WARNING:conftool.announce:conftool action : set/pooled=no; selector: name=mw1299.eqiad.wmnet

[puppetmaster1001:~] $ confctl select name=mw1299.eqiad.wmnet get
{"mw1299.eqiad.wmnet": {"pooled": "no", "weight": 10}, "tags": "dc=eqiad,cluster=jobrunner,service=apache2"}
{"mw1299.eqiad.wmnet": {"pooled": "no", "weight": 10}, "tags": "dc=eqiad,cluster=jobrunner,service=nginx"}
{"mw1299.eqiad.wmnet": {"pooled": "no", "weight": 10}, "tags": "dc=eqiad,cluster=videoscaler,service=apache2"}
{"mw1299.eqiad.wmnet": {"pooled": "no", "weight": 10}, "tags": "dc=eqiad,cluster=videoscaler,service=nginx"}

/admin1/system1/logs1/log1-> show record27

	properties
		CreationTimestamp = 20190208014959.000000-360
		ElementName = System Event Log Entry
		RecordData = CPU 1 machine check error detected.
		RecordFormat = string Description
		RecordID = 111

Mentioned in SAL (#wikimedia-operations) [2019-02-08T06:29:02Z] <marostegui> powercycle mw1299 - T215569

This host is under warranty until April 14, 2019 so we might want to try to debug this before it expires in case we need some replacement CPU or mainboard.

And if crashed again with the same error:

/admin1/system1/logs1/log1-> show record13

	properties
		CreationTimestamp = 20190208071154.000000-360
		ElementName = System Event Log Entry
		RecordData = CPU 1 machine check error detected.
		RecordFormat = string Description
		RecordID = 152
/admin1/system1/logs1/log1-> show record14

	properties
		CreationTimestamp = 20190208071154.000000-360
		ElementName = System Event Log Entry
		RecordData = A problem was detected related to the previous server boot.
		RecordFormat = string Description
		RecordID = 151

And it is full of:

/admin1/system1/logs1/log1-> show record1

	properties
		CreationTimestamp = 20190208071154.000000-360
		ElementName = System Event Log Entry
		RecordData = An OEM diagnostic event occurred.
		RecordFormat = string Description
		RecordID = 164

The host is stuck again (no ping, no ssh, nothing in console but [ 2451.381422] m, nothing new on getsel or getraclog, forcing a reboot,

Mentioned in SAL (#wikimedia-operations) [2019-02-10T19:59:00Z] <volans|off> force rebooting mw1299, stuck again - T215569

The host already re-crashed, I'm leaving it as is for now. I've ack'ed the alerts on icinga.

MoritzMuehlenhoff reassigned this task from RobH to • Cmjohnson.Feb 11 2019, 8:14 AM

racadm sel
Record: 29
Date/Time: 02/02/2019 21:20:29
Source: system
Severity: Critical

Description: CPU 1 machine check error detected.

Ticket open for a new CPU

You have successfully submitted request SR986247109.

The self-dispatch was approved and the part should hopefully be here by tomorrow.

I replaced CPU1 with new. Powered the server on. Assigning to @RobH to coordinate re-pooling and resolving

Return shipping
USPS 9202 3946 5301 2440 9937
FEDEX 9611918 2393026 777743762

I've synced with @jijiki who is returning this to service and will comment on here.

Mentioned in SAL (#wikimedia-operations) [2019-02-13T17:25:30Z] <jijiki> Pooling mw1299 back - T215569

Server is pooled.

mw1299 is down (jobrunner-canary, now up but depooled)Closed, ResolvedPublicActions

Description

Event Timeline

Description: CPU 1 machine check error detected.

mw1299 is down (jobrunner-canary, now up but depooled)
Closed, ResolvedPublic
Actions