Page MenuHomePhabricator

cloudmetrics1002: mysterious issue
Closed, DuplicatePublic

Description

Today some services running on cloudmetrics1002 failed. When tried to access the server by either ssh or the serial console, it failed. We had to force-reboot the system.

There is no information on logs about what happened. A hardware issue is suspected.

Event Timeline

This happened again just now. I forced a reboot via mgmt and it seems to be back. The perf graphs show the same shape as they did on Monday.

Mentioned in SAL (#wikimedia-cloud) [2021-04-29T15:11:52Z] <dcaro> hard rebooting cloudmetrics1002, got hung again (T275605)

Paged again for this this morning.

Mentioned in SAL (#wikimedia-operations) [2021-04-30T15:25:22Z] <bstorm> hard rebooting cloudmetrics1002 T275605

Icinga downtime set by dcaro@cumin1001 for 2:00:00 1 host(s) and their services with reason: Flaky host

cloudmetrics1002.eqiad.wmnet

Icinga downtime set by dcaro@cumin1001 for 2 days, 0:00:00 1 host(s) and their services with reason: Flaky host

cloudmetrics1002.eqiad.wmnet

Mentioned in SAL (#wikimedia-cloud) [2021-05-04T13:19:46Z] <dcaro> rebooting cloudmetrics1002, got stuck again (T275605)

I am looking at updating firmware first will have to download latest version of spp off site. with more stable internet

Mentioned in SAL (#wikimedia-operations) [2021-05-09T10:52:40Z] <aborrero@cumin1001> START - Cookbook sre.hosts.downtime for 180 days, 0:00:00 on cloudmetrics1002.eqiad.wmnet with reason: T275605

Mentioned in SAL (#wikimedia-operations) [2021-05-09T10:52:50Z] <aborrero@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 180 days, 0:00:00 on cloudmetrics1002.eqiad.wmnet with reason: T275605

Mentioned in SAL (#wikimedia-cloud) [2021-05-09T10:53:06Z] <arturo> icinga-downtime cloudmetrics1002 for 3 months (T275605)

aborrero triaged this task as Medium priority.May 9 2021, 10:53 AM
aborrero moved this task from Backlog to Hardware faults on the cloud-services-team (Hardware) board.

Change 687585 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/dns@master] grafana-labs: point to cloudmetrics1001

https://gerrit.wikimedia.org/r/687585

Change 687585 merged by Arturo Borrero Gonzalez:

[operations/dns@master] grafana-labs: point to cloudmetrics1001

https://gerrit.wikimedia.org/r/687585

Change 688213 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/dns@master] prometheus-labmon: point to cloudmetrics1001

https://gerrit.wikimedia.org/r/688213

Change 688213 merged by Arturo Borrero Gonzalez:

[operations/dns@master] prometheus-labmon: point to cloudmetrics1001

https://gerrit.wikimedia.org/r/688213

@Jclark-ctr @wiki_willy were we able to upgrade this firmware and see if that helped?