- - cloudmetrics1002.eqiad.wmnet https://netbox.wikimedia.org/dcim/devices/183/
- - Machine still in service. Coordinate with WMCS / Ping in #wikimedia-cloud-admin when ready
- - Put system into a failed state in Netbox.
- - Provide urgency of request, along with justification (redundancy, dependencies, etc). Machine runs alerting and metrics for cloudVPS hardware. Cloudmetrics1001 is even older and is due for replacement. The machine in question has failed multiple times over the last couple months, requiring a power cycle: https://phabricator.wikimedia.org/T275605
- - Describe issue and/or attach hardware failure log. (Refer to https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook if you need help) Disk I/O ceases, CPU load spikes. System eventually locks and doesn't respond. Nothing has been found in the logs before or in the post booting process.
- - Assign correct project tag and appropriate owner (based on above). Also, please ensure the service owners of the host(s) are added as subscribers to provide any additional input.
Customize query in gerrit
|Resolved||RobH||T161750 eqiad: (1) hardware access request for dedicated labmon1002|
|Unknown Object (Task)|
|Resolved||• chasemp||T165784 rack/setup/install labmon1002|
|Resolved||Request||Jclark-ctr||T281881 hw troubleshooting: server hardlocking for cloudmetrics1002.eqiad.wmnet|
^I have paused monitoring of cloudmetrics1002 on bacula, so it doesn't alter unnecessarily due to stale backups. Please remember to remove it from the ignore list- backups will continue but they will not be monitored, otherwise.