Page MenuHomePhabricator

hw troubleshooting: server hardlocking for cloudmetrics1002.eqiad.wmnet
Closed, ResolvedPublicRequest

Description

  • - cloudmetrics1002.eqiad.wmnet https://netbox.wikimedia.org/dcim/devices/183/
  • - Machine still in service. Coordinate with WMCS / Ping in #wikimedia-cloud-admin when ready
  • - Put system into a failed state in Netbox.
  • - Provide urgency of request, along with justification (redundancy, dependencies, etc). Machine runs alerting and metrics for cloudVPS hardware. Cloudmetrics1001 is even older and is due for replacement. The machine in question has failed multiple times over the last couple months, requiring a power cycle: https://phabricator.wikimedia.org/T275605
  • - Describe issue and/or attach hardware failure log. (Refer to https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook if you need help) Disk I/O ceases, CPU load spikes. System eventually locks and doesn't respond. Nothing has been found in the logs before or in the post booting process.
  • - Assign correct project tag and appropriate owner (based on above). Also, please ensure the service owners of the host(s) are added as subscribers to provide any additional input.

Event Timeline

nskaggs renamed this task from hw troubleshooting: <type of hardware failure> for <fqdn of server> to hw troubleshooting: server hardlocking for cloudmetrics1002.eqiad.wmnet.May 4 2021, 2:55 PM
nskaggs created this task.
nskaggs mentioned this in Unknown Object (Task).May 4 2021, 3:22 PM

Change 684983 had a related patch set uploaded (by Bstorm; author: Bstorm):

[operations/puppet@production] ceph alerts: fix hardcoded use of a single prometheus server

https://gerrit.wikimedia.org/r/684983

Change 684983 merged by Bstorm:

[operations/puppet@production] ceph alerts: fix hardcoded use of a single prometheus server

https://gerrit.wikimedia.org/r/684983

Change 684990 had a related patch set uploaded (by Bstorm; author: Bstorm):

[operations/puppet@production] cloudmetrics: fail over to cloudmetrics1001

https://gerrit.wikimedia.org/r/684990

Change 684990 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudmetrics: fail over to cloudmetrics1001

https://gerrit.wikimedia.org/r/684990

Change 690329 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] bacula: Do not ignore people2002 and ignore cloudmetrics1002

https://gerrit.wikimedia.org/r/690329

Change 690329 merged by Jcrespo:

[operations/puppet@production] bacula: Do not ignore people2002 and ignore cloudmetrics1002

https://gerrit.wikimedia.org/r/690329

^I have paused monitoring of cloudmetrics1002 on bacula, so it doesn't alter unnecessarily due to stale backups. Please remember to remove it from the ignore list- backups will continue but they will not be monitored, otherwise.

dcaro updated the task description. (Show Details)

Machine out of service and marked as failed in netbox, feel free to take it out/debug/troubleshoot it :)

Mentioned in SAL (#wikimedia-cloud) [2021-06-08T23:19:32Z] <bd808> Downtimed cloudmetrics1002 in icinga until 2021-06-30 23:59:01 (T281881)

I was able to update Firmware host is back up now

Mentioned in SAL (#wikimedia-cloud) [2021-06-09T17:33:07Z] <arturo> removed icinga downtime for cloudmetrics1002 -- to see if hardware is healthy (T281881)

Update: I just checked the server -- seems fine. We decided to remove the icinga downtime and see if we detect any more hardware crashes in the next few days.

Please @jcrespo enable backups on that server again.

Change 700748 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] bacula: Remove sretest1002 and cloudmetrics1002 from the backup ignore list

https://gerrit.wikimedia.org/r/700748

Change 700748 merged by Jcrespo:

[operations/puppet@production] bacula: Remove sretest1002 and cloudmetrics1002 from the backup ignore list

https://gerrit.wikimedia.org/r/700748

Please @jcrespo enable backups on that server again.

@aborrero I did this yesterday- although please note backups were not removed, they were still being attempted- we just disabled its monitoring because of its failures.

thanks @jcrespo

For the record, I just marked the server in netbox as Active.