hw troubleshooting: server hardlocking for cloudmetrics1002.eqiad.wmnet
Closed, ResolvedPublicRequest
Actions

Assigned To

Authored By

	• nskaggs
	May 4 2021, 2:55 PM

Description

- cloudmetrics1002.eqiad.wmnet https://netbox.wikimedia.org/dcim/devices/183/
- Machine still in service. Coordinate with WMCS / Ping in #wikimedia-cloud-admin when ready
- Put system into a failed state in Netbox.
- Provide urgency of request, along with justification (redundancy, dependencies, etc). Machine runs alerting and metrics for cloudVPS hardware. Cloudmetrics1001 is even older and is due for replacement. The machine in question has failed multiple times over the last couple months, requiring a power cycle: https://phabricator.wikimedia.org/T275605
- Describe issue and/or attach hardware failure log. (Refer to https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook if you need help) Disk I/O ceases, CPU load spikes. System eventually locks and doesn't respond. Nothing has been found in the logs before or in the post booting process.
- Assign correct project tag and appropriate owner (based on above). Also, please ensure the service owners of the host(s) are added as subscribers to provide any additional input.

Details

Subject	Repo	Branch	Lines +/-
bacula: Remove sretest1002 and cloudmetrics1002 from the backup ignore list	operations/puppet	production	+1 -4
bacula: Do not ignore people2002 and ignore cloudmetrics1002	operations/puppet	production	+2 -1
cloudmetrics: fail over to cloudmetrics1001	operations/puppet	production	+16 -16
ceph alerts: fix hardcoded use of a single prometheus server	operations/puppet	production	+3 -3

Customize query in gerrit

Related Objects
Search...

Status	Subtype	Assigned	Task
Resolved		RobH	T161750 eqiad: (1) hardware access request for dedicated labmon1002
			Unknown Object (Task)
Resolved		• chasemp	T165784 rack/setup/install labmon1002
Resolved	Request	Jclark-ctr	T281881 hw troubleshooting: server hardlocking for cloudmetrics1002.eqiad.wmnet

Event Timeline

• nskaggs renamed this task from hw troubleshooting: <type of hardware failure> for <fqdn of server> to hw troubleshooting: server hardlocking for cloudmetrics1002.eqiad.wmnet.May 4 2021, 2:55 PM

• nskaggs created this task.

• nskaggs added subscribers: cloud-services-team (Hardware), cloud-services-team.

• nskaggs added a project: cloud-services-team (Hardware).May 4 2021, 3:07 PM

• nskaggs moved this task from Backlog to Hardware faults on the cloud-services-team (Hardware) board.

• nskaggs removed a subscriber: cloud-services-team (Hardware).

• nskaggs mentioned this in Unknown Object (Task).May 4 2021, 3:22 PM

Maintenance_bot added a project: SRE.May 4 2021, 3:45 PM

• Bstorm subscribed.May 4 2021, 4:02 PM

Change 684983 had a related patch set uploaded (by Bstorm; author: Bstorm):

[operations/puppet@production] ceph alerts: fix hardcoded use of a single prometheus server

https://gerrit.wikimedia.org/r/684983

gerritbot added a project: Patch-For-Review.May 4 2021, 4:11 PM

Change 684983 merged by Bstorm:

[operations/puppet@production] ceph alerts: fix hardcoded use of a single prometheus server

https://gerrit.wikimedia.org/r/684983

Change 684990 had a related patch set uploaded (by Bstorm; author: Bstorm):

[operations/puppet@production] cloudmetrics: fail over to cloudmetrics1001

https://gerrit.wikimedia.org/r/684990

wiki_willy assigned this task to Jclark-ctr.May 4 2021, 7:29 PM

Aklapper removed a subscriber: cloud-services-team.May 5 2021, 8:53 AM

• Cmjohnson moved this task from Backlog to Hardware Failure / Troubleshoot on the ops-eqiad board.May 6 2021, 5:22 PM

aborrero added a parent task: T165784: rack/setup/install labmon1002.May 9 2021, 10:55 AM

aborrero merged a task: T275605: cloudmetrics1002: mysterious issue.

aborrero added subscribers: aborrero, Jclark-ctr, RhinosF1 and 4 others.

Change 684990 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudmetrics: fail over to cloudmetrics1001

https://gerrit.wikimedia.org/r/684990

Maintenance_bot removed a project: Patch-For-Review.May 9 2021, 12:10 PM

Change 690329 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] bacula: Do not ignore people2002 and ignore cloudmetrics1002

https://gerrit.wikimedia.org/r/690329

gerritbot added a project: Patch-For-Review.May 13 2021, 7:28 AM

Change 690329 merged by Jcrespo:

[operations/puppet@production] bacula: Do not ignore people2002 and ignore cloudmetrics1002

https://gerrit.wikimedia.org/r/690329

^I have paused monitoring of cloudmetrics1002 on bacula, so it doesn't alter unnecessarily due to stale backups. Please remember to remove it from the ignore list- backups will continue but they will not be monitored, otherwise.

Maintenance_bot removed a project: Patch-For-Review.May 13 2021, 8:10 AM

dcaro updated the task description. (Show Details)Jun 2 2021, 3:29 PM

dcaro updated the task description. (Show Details)

Machine out of service and marked as failed in netbox, feel free to take it out/debug/troubleshoot it :)

Mentioned in SAL (#wikimedia-cloud) [2021-06-08T23:19:32Z] <bd808> Downtimed cloudmetrics1002 in icinga until 2021-06-30 23:59:01 (T281881)

I was able to update Firmware host is back up now

Mentioned in SAL (#wikimedia-cloud) [2021-06-09T17:33:07Z] <arturo> removed icinga downtime for cloudmetrics1002 -- to see if hardware is healthy (T281881)

Update: I just checked the server -- seems fine. We decided to remove the icinga downtime and see if we detect any more hardware crashes in the next few days.

Please @jcrespo enable backups on that server again.

Change 700748 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] bacula: Remove sretest1002 and cloudmetrics1002 from the backup ignore list

https://gerrit.wikimedia.org/r/700748

gerritbot added a project: Patch-For-Review.Jun 22 2021, 5:37 AM

Change 700748 merged by Jcrespo:

[operations/puppet@production] bacula: Remove sretest1002 and cloudmetrics1002 from the backup ignore list

https://gerrit.wikimedia.org/r/700748

Maintenance_bot removed a project: Patch-For-Review.Jun 22 2021, 10:10 AM

In T281881#7146744, @aborrero wrote:

Please @jcrespo enable backups on that server again.

@aborrero I did this yesterday- although please note backups were not removed, they were still being attempted- we just disabled its monitoring because of its failures.

thanks @jcrespo

For the record, I just marked the server in netbox as Active.

hw troubleshooting: server hardlocking for cloudmetrics1002.eqiad.wmnetClosed, ResolvedPublicRequestActions

Description

Details

Related ObjectsSearch...

Event Timeline

hw troubleshooting: server hardlocking for cloudmetrics1002.eqiad.wmnet
Closed, ResolvedPublicRequest
Actions

Related Objects
Search...