Page MenuHomePhabricator

mw1415 (canary appserver) is down, incl. mgmt
Closed, ResolvedPublic

Description

01:43 <+icinga-wm> PROBLEM - Host mw1415 is DOWN: PING CRITICAL - Packet loss = 100%
01:43 <+icinga-wm> PROBLEM - Host mw1415.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
ssh root@mw1415.mgmt.eqiad.wmnet
channel 0: open failed: connect failed: Connection timed out
01:51 <+logmsgbot> !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=eqiad,name=mw1415.eqiad.wmnet

Event Timeline

Peachey88 updated the task description. (Show Details)

@Cmjohnson or @Jclark-ctr This server just went down, server itself AND mgmt at the same time. So we can't add much here.

But it's only been purchased in 2021.So that should be under warranty.

It's been depooled so you can check it out anytime.

IPMI from remote also fails: Error: Unable to establish IPMI v2 / RMCP+ session

Dzahn triaged this task as Medium priority.May 9 2022, 5:15 PM

@Dzahn The server is dead, it will not power on, I attempted to get to basic start-up, 1 DIMM, 1 CPU, and still will not power on, Historically a main board swap is required, I will submit a ticket to Dell.

You have successfully submitted request SR1096030919.

@Cmjohnson Alright, gotcha! Thanks for the updates and Dell request.

Dzahn changed the task status from Open to In Progress.Jun 6 2022, 6:58 PM

Dell tech should be here tomorrow or Thursday to fix.

21:13 < mutante> !log mw1415 - scap pull, restart apache, /usr/local/sbin/restart-php7.2-fpm (INFO: The server is depooled from all services. Restarting the service directly)

Mentioned in SAL (#wikimedia-operations) [2022-06-08T21:41:10Z] <mutante> repooled mw1415 after restarting apache and php-fpm, seeing all Icinga alerts recover etc T307755 T310225

This caused T310225 because setting it to pooled=inactive does not mean monitoring will stop checking it and when this came back unexpectedly it caused new alerts for 500s on this box, which had not received scap updates. But setting it to pooled=no would have meant deployers would have gotten warnings about an unreachable host for a month. The deeper issue is there is no right status to set hosts to while they are waiting for hardware repair.

Dzahn claimed this task.