Page MenuHomePhabricator

mw1299 is down (jobrunner-canary, now up but depooled)
Closed, ResolvedPublic

Description

23:18:45 ['/usr/bin/scap', 'pull', '--no-update-l10n', '--include', 'README', 'mw1268.eqiad.wmnet', 'mw1314.eqiad.wmnet', 'mw2255.codfw.wmnet', 'mw2290.codfw.wmnet', 'mw2216.codfw.wmnet', 'mw1251.eqiad.wmnet', 'mw2188.codfw.wmnet', 'mw1320.eqiad.wmnet', 'mw1285.eqiad.wmnet'] on mw1299.eqiad.wmnet returned [255]: ssh: connect to host mw1299.eqiad.wmnet port 22: Connection timed out

Manuel rebooted it a while ago, but seems to be dead again

12:34 marostegui: Powercycle mw1299 as it is down and not responding

Event Timeline

Reedy created this task.Feb 7 2019, 11:20 PM
Reedy added a comment.Feb 7 2019, 11:23 PM

Depending what's up with it... It might want depooling and removing from the scap host lists

Mentioned in SAL (#wikimedia-operations) [2019-02-08T01:07:59Z] <mutante> powercycle crashed mw1299 via mgmt (garbled console output) (T215569)

Dzahn added a subscriber: Dzahn.Feb 8 2019, 1:13 AM
20:12 < mutante> [mw1299:~] $ depool
20:12 < mutante> Depooling all services on mw1299.eqiad.wmnet
Dzahn added a comment.Feb 8 2019, 1:15 AM

it's back up and running right now but depooled because this isn't the first time it happened on this machine

Dzahn renamed this task from mw1299 is down to mw1299 is down (jobrunner-canary, now up but depooled).Feb 8 2019, 1:31 AM
Dzahn added a comment.Feb 8 2019, 1:41 AM
[puppetmaster1001:~] $ sudo -i confctl depool --hostname mw1299.eqiad.wmnet

eqiad/jobrunner/apache2/mw1299.eqiad.wmnet: pooled changed yes => no
eqiad/jobrunner/nginx/mw1299.eqiad.wmnet: pooled changed yes => no
eqiad/videoscaler/apache2/mw1299.eqiad.wmnet: pooled changed yes => no
eqiad/videoscaler/nginx/mw1299.eqiad.wmnet: pooled changed yes => no
WARNING:conftool.announce:conftool action : set/pooled=no; selector: name=mw1299.eqiad.wmnet

[puppetmaster1001:~] $ confctl select name=mw1299.eqiad.wmnet get
{"mw1299.eqiad.wmnet": {"pooled": "no", "weight": 10}, "tags": "dc=eqiad,cluster=jobrunner,service=apache2"}
{"mw1299.eqiad.wmnet": {"pooled": "no", "weight": 10}, "tags": "dc=eqiad,cluster=jobrunner,service=nginx"}
{"mw1299.eqiad.wmnet": {"pooled": "no", "weight": 10}, "tags": "dc=eqiad,cluster=videoscaler,service=apache2"}
{"mw1299.eqiad.wmnet": {"pooled": "no", "weight": 10}, "tags": "dc=eqiad,cluster=videoscaler,service=nginx"}
/admin1/system1/logs1/log1-> show record27

	properties
		CreationTimestamp = 20190208014959.000000-360
		ElementName = System Event Log Entry
		RecordData = CPU 1 machine check error detected.
		RecordFormat = string Description
		RecordID = 111

Mentioned in SAL (#wikimedia-operations) [2019-02-08T06:29:02Z] <marostegui> powercycle mw1299 - T215569

Marostegui assigned this task to RobH.Feb 8 2019, 6:49 AM
Marostegui added a subscriber: RobH.

This host is under warranty until April 14, 2019 so we might want to try to debug this before it expires in case we need some replacement CPU or mainboard.

And if crashed again with the same error:

/admin1/system1/logs1/log1-> show record13

	properties
		CreationTimestamp = 20190208071154.000000-360
		ElementName = System Event Log Entry
		RecordData = CPU 1 machine check error detected.
		RecordFormat = string Description
		RecordID = 152
/admin1/system1/logs1/log1-> show record14

	properties
		CreationTimestamp = 20190208071154.000000-360
		ElementName = System Event Log Entry
		RecordData = A problem was detected related to the previous server boot.
		RecordFormat = string Description
		RecordID = 151

And it is full of:

/admin1/system1/logs1/log1-> show record1

	properties
		CreationTimestamp = 20190208071154.000000-360
		ElementName = System Event Log Entry
		RecordData = An OEM diagnostic event occurred.
		RecordFormat = string Description
		RecordID = 164
Volans added a subscriber: Volans.Feb 10 2019, 7:57 PM

The host is stuck again (no ping, no ssh, nothing in console but [ 2451.381422] m, nothing new on getsel or getraclog, forcing a reboot,

Mentioned in SAL (#wikimedia-operations) [2019-02-10T19:59:00Z] <volans|off> force rebooting mw1299, stuck again - T215569

The host already re-crashed, I'm leaving it as is for now. I've ack'ed the alerts on icinga.

racadm sel
Record: 29
Date/Time: 02/02/2019 21:20:29
Source: system
Severity: Critical

Description: CPU 1 machine check error detected.

Ticket open for a new CPU

You have successfully submitted request SR986247109.

The self-dispatch was approved and the part should hopefully be here by tomorrow.

Cmjohnson reassigned this task from Cmjohnson to RobH.Feb 13 2019, 5:04 PM
Cmjohnson added a subscriber: Cmjohnson.

I replaced CPU1 with new. Powered the server on. Assigning to @RobH to coordinate re-pooling and resolving

Return shipping
USPS 9202 3946 5301 2440 9937
FEDEX 9611918 2393026 777743762

RobH reassigned this task from RobH to jijiki.Feb 13 2019, 5:20 PM
RobH added a subscriber: jijiki.

I've synced with @jijiki who is returning this to service and will comment on here.

Mentioned in SAL (#wikimedia-operations) [2019-02-13T17:25:30Z] <jijiki> Pooling mw1299 back - T215569

jijiki closed this task as Resolved.Feb 13 2019, 5:33 PM

Server is pooled.