- - cloudmetrics1002.eqiad.wmnet https://netbox.wikimedia.org/dcim/devices/183/
- - Machine still in service. Coordinate with WMCS / Ping in #wikimedia-cloud-admin when ready
- - Put system into a failed state in Netbox.
- - Provide urgency of request, along with justification (redundancy, dependencies, etc). Machine runs alerting and metrics for cloudVPS hardware. Cloudmetrics1001 is even older and is due for replacement. The machine in question has failed multiple times over the last couple months, requiring a power cycle: https://phabricator.wikimedia.org/T275605
- - Describe issue and/or attach hardware failure log. (Refer to https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook if you need help) Disk I/O ceases, CPU load spikes. System eventually locks and doesn't respond. Nothing has been found in the logs before or in the post booting process.
- - Assign correct project tag and appropriate owner (based on above). Also, please ensure the service owners of the host(s) are added as subscribers to provide any additional input.
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | RobH | T161750 eqiad: (1) hardware access request for dedicated labmon1002 | |||
Unknown Object (Task) | |||||
Resolved | • chasemp | T165784 rack/setup/install labmon1002 | |||
Resolved | Request | Jclark-ctr | T281881 hw troubleshooting: server hardlocking for cloudmetrics1002.eqiad.wmnet |
Event Timeline
Change 684983 had a related patch set uploaded (by Bstorm; author: Bstorm):
[operations/puppet@production] ceph alerts: fix hardcoded use of a single prometheus server
Change 684983 merged by Bstorm:
[operations/puppet@production] ceph alerts: fix hardcoded use of a single prometheus server
Change 684990 had a related patch set uploaded (by Bstorm; author: Bstorm):
[operations/puppet@production] cloudmetrics: fail over to cloudmetrics1001
Change 684990 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] cloudmetrics: fail over to cloudmetrics1001
Change 690329 had a related patch set uploaded (by Jcrespo; author: Jcrespo):
[operations/puppet@production] bacula: Do not ignore people2002 and ignore cloudmetrics1002
Change 690329 merged by Jcrespo:
[operations/puppet@production] bacula: Do not ignore people2002 and ignore cloudmetrics1002
^I have paused monitoring of cloudmetrics1002 on bacula, so it doesn't alter unnecessarily due to stale backups. Please remember to remove it from the ignore list- backups will continue but they will not be monitored, otherwise.
Machine out of service and marked as failed in netbox, feel free to take it out/debug/troubleshoot it :)
Mentioned in SAL (#wikimedia-cloud) [2021-06-08T23:19:32Z] <bd808> Downtimed cloudmetrics1002 in icinga until 2021-06-30 23:59:01 (T281881)
Mentioned in SAL (#wikimedia-cloud) [2021-06-09T17:33:07Z] <arturo> removed icinga downtime for cloudmetrics1002 -- to see if hardware is healthy (T281881)
Update: I just checked the server -- seems fine. We decided to remove the icinga downtime and see if we detect any more hardware crashes in the next few days.
Please @jcrespo enable backups on that server again.
Change 700748 had a related patch set uploaded (by Jcrespo; author: Jcrespo):
[operations/puppet@production] bacula: Remove sretest1002 and cloudmetrics1002 from the backup ignore list
Change 700748 merged by Jcrespo:
[operations/puppet@production] bacula: Remove sretest1002 and cloudmetrics1002 from the backup ignore list
@aborrero I did this yesterday- although please note backups were not removed, they were still being attempted- we just disabled its monitoring because of its failures.