Page MenuHomePhabricator

hw troubleshooting: mw2336.codfw.wmnet and its mgmt are down
Closed, ResolvedPublicRequest

Description

  • - Provide FQDN of system.
  • - If other than a hard drive issue, please depool the machine (and confirm that it’s been depooled) for us to work on it. If not, please provide time frame for us to take the machine down.
  • - Put system into a failed state in Netbox.
  • - Provide urgency of request, along with justification (redundancy, dependencies, etc): not urgent
  • - Describe issue and/or attach hardware failure log. (Refer to https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook if you need help)
  • - Assign correct project tag and appropriate owner (based on above). Also, please ensure the service owners of the host(s) are added as subscribers to provide any additional input.

mw2336.codfw.wmnet and mw2336.codfw.mgmt.wmnet are down:

14:41:15 <+icinga-wm> PROBLEM - Host mw2336 is DOWN: PING CRITICAL - Packet loss = 100%
14:43:57 <+icinga-wm> PROBLEM - Host mw2336.mgmt is DOWN: PING CRITICAL - Packet loss = 100%

I have depooled the host. This appserver is redundant and not urgent.

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2021-07-26T16:06:58Z] <legoktm> depooled mw2336.codfw.mwnet, mgmt is down too. T287394

Ladsgroup renamed this task from hw troubleshooting: mw2336.codfw.wmnet and it's mgmt are down to hw troubleshooting: mw2336.codfw.wmnet and its mgmt are down.Jul 26 2021, 4:08 PM

Reset and upgrade IDRAC. Server is back up online.

Mentioned in SAL (#wikimedia-operations) [2021-07-26T17:41:58Z] <legoktm> ran scap pull and repooled mw2336.codfw.wmnet - T287394