mw1415 (canary appserver) is down, incl. mgmt
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Dzahn
	May 6 2022, 1:52 AM

Description

01:43 <+icinga-wm> PROBLEM - Host mw1415 is DOWN: PING CRITICAL - Packet loss = 100%
01:43 <+icinga-wm> PROBLEM - Host mw1415.mgmt is DOWN: PING CRITICAL - Packet loss = 100%

ssh root@mw1415.mgmt.eqiad.wmnet
channel 0: open failed: connect failed: Connection timed out

01:51 <+logmsgbot> !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=eqiad,name=mw1415.eqiad.wmnet

Related Objects

Mentioned In: T310225: mw1415 fatals due to serving responses from 1.39.0-wmf.10 (was DBQueryError: Unknown column page_restrictions)
Mentioned Here: T310225: mw1415 fatals due to serving responses from 1.39.0-wmf.10 (was DBQueryError: Unknown column page_restrictions)

Event Timeline

Dzahn created this task.May 6 2022, 1:52 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 6 2022, 1:52 AM

Peachey88 updated the task description. (Show Details)May 6 2022, 8:24 AM

Peachey88 updated the task description. (Show Details)

RhinosF1 subscribed.May 6 2022, 9:29 AM

Dzahn added a project: ops-eqiad.May 9 2022, 5:04 PM

@Cmjohnson or @Jclark-ctr This server just went down, server itself AND mgmt at the same time. So we can't add much here.

But it's only been purchased in 2021.So that should be under warranty.

It's been depooled so you can check it out anytime.

IPMI from remote also fails: Error: Unable to establish IPMI v2 / RMCP+ session

Dzahn triaged this task as Medium priority.May 9 2022, 5:15 PM

• Cmjohnson moved this task from Backlog to High Priority Task on the ops-eqiad board.May 10 2022, 5:49 PM

@Dzahn The server is dead, it will not power on, I attempted to get to basic start-up, 1 DIMM, 1 CPU, and still will not power on, Historically a main board swap is required, I will submit a ticket to Dell.

You have successfully submitted request SR1096030919.

@Cmjohnson Alright, gotcha! Thanks for the updates and Dell request.

Dzahn changed the task status from Open to In Progress.Jun 6 2022, 6:58 PM

Dell tech should be here tomorrow or Thursday to fix.

21:13 < mutante> !log mw1415 - scap pull, restart apache, /usr/local/sbin/restart-php7.2-fpm (INFO: The server is depooled from all services. Restarting the service directly)

Mentioned in SAL (#wikimedia-operations) [2022-06-08T21:41:10Z] <mutante> repooled mw1415 after restarting apache and php-fpm, seeing all Icinga alerts recover etc T307755 T310225

Stashbot mentioned this in T310225: mw1415 fatals due to serving responses from 1.39.0-wmf.10 (was DBQueryError: Unknown column page_restrictions).Jun 8 2022, 9:41 PM

This caused T310225 because setting it to pooled=inactive does not mean monitoring will stop checking it and when this came back unexpectedly it caused new alerts for 500s on this box, which had not received scap updates. But setting it to pooled=no would have meant deployers would have gotten warnings about an unreachable host for a month. The deeper issue is there is no right status to set hosts to while they are waiting for hardware repair.

Dzahn closed this task as Resolved.Jun 8 2022, 9:45 PM

Dzahn claimed this task.

mw1415 (canary appserver) is down, incl. mgmtClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

mw1415 (canary appserver) is down, incl. mgmt
Closed, ResolvedPublic
Actions