Page MenuHomePhabricator

hw troubleshooting: server hardlocking for cloudmetrics1002.eqiad.wmnet
Open, MediumPublicRequest

Description

  • - cloudmetrics1002.eqiad.wmnet https://netbox.wikimedia.org/dcim/devices/183/
  • - Machine still in service. Coordinate with WMCS / Ping in #wikimedia-cloud-admin when ready
  • - Put system into a failed state in Netbox.
  • - Provide urgency of request, along with justification (redundancy, dependencies, etc). Machine runs alerting and metrics for cloudVPS hardware. Cloudmetrics1001 is even older and is due for replacement. The machine in question has failed multiple times over the last couple months, requiring a power cycle: https://phabricator.wikimedia.org/T275605
  • - Describe issue and/or attach hardware failure log. (Refer to https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook if you need help) Disk I/O ceases, CPU load spikes. System eventually locks and doesn't respond. Nothing has been found in the logs before or in the post booting process.
  • - Assign correct project tag and appropriate owner (based on above). Also, please ensure the service owners of the host(s) are added as subscribers to provide any additional input.

Event Timeline

nskaggs renamed this task from hw troubleshooting: <type of hardware failure> for <fqdn of server> to hw troubleshooting: server hardlocking for cloudmetrics1002.eqiad.wmnet.Tue, May 4, 2:55 PM
nskaggs created this task.
nskaggs mentioned this in Unknown Object (Task).Tue, May 4, 3:22 PM

Change 684983 had a related patch set uploaded (by Bstorm; author: Bstorm):

[operations/puppet@production] ceph alerts: fix hardcoded use of a single prometheus server

https://gerrit.wikimedia.org/r/684983

Change 684983 merged by Bstorm:

[operations/puppet@production] ceph alerts: fix hardcoded use of a single prometheus server

https://gerrit.wikimedia.org/r/684983

Change 684990 had a related patch set uploaded (by Bstorm; author: Bstorm):

[operations/puppet@production] cloudmetrics: fail over to cloudmetrics1001

https://gerrit.wikimedia.org/r/684990

Change 684990 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudmetrics: fail over to cloudmetrics1001

https://gerrit.wikimedia.org/r/684990