Page MenuHomePhabricator

cloudmetrics1004 potential hardware problem
Closed, ResolvedPublic

Description

Today 2022-01-21 the cloudmetrics1004 server experienced some kind of hardware problem between 06:00 UTC and 08:00 UTC.

The SSH shell was unavailable and eventually @Andrew had to force-reboot the server from the mgmt console.

Nothing special on /var/log/syslog or similar. Neither on the host metrics.

image.png (1×2 px, 232 KB)

This has the potential to be related to T297814: cloudmetrics1003 seizes up under load

Event Timeline

This resembles the more-frequent issues that we've seen on 1003 (T297814) -- it's not exactly a crash, the system just gets so slow that things start to time out.

Assigning this to @Cmjohnson. However, I also reached out to @MoritzMuehlenhoff to take a peak at this and T297814 later next week. Since there aren't any hardware errors in the logs, I asked Moritz too see if he could run any other type of diagnostic checks to pinpoint if there really is a hardware issue, or maybe even determine if the type of hardware specs being used is appropriate for what's being run on these servers at full load.

Thanks,
Willy

Assigning this to @Cmjohnson. However, I also reached out to @MoritzMuehlenhoff to take a peak at this and T297814 later next week. Since there aren't any hardware errors in the logs, I asked Moritz too see if he could run any other type of diagnostic checks to pinpoint if there really is a hardware issue, or maybe even determine if the type of hardware specs being used is appropriate for what's being run on these servers at full load.

So, given that we've seen that on both cloudmetrics1003 and cloudmetrics1004 it doesn't seem to be a case of a single server being broken. We might see two major failure scenarios here:

  1. We're triggering some error in one of the userspace daemons (Carbor or whatever) which makes the load sky rocket to the point that the problem becomes unusable and sshd failing to open new connections
  2. We're triggering a kernel bug which has the same effect

How often does it happen, is there a way to reliably reproduce it?

To narrow down 1. we could add a system timer which writes out the content of "top -b -n1" to a file every 15 seconds or so. If the server goes down, we should be able to identify if the number of processes rose dramatically or whether a single process stands out load-wise.

One option to exclude 2. is to run cloudmetrics* with a more recent kernel. Either by running bullseye with Linux 5.10 (the main graphite* hosts are already running Bullseye) or alternatively by using the 5.10 kernel from buster-backports temporarily.

I can easily reproduce the issue on cloudmetrics1003 by putting the post into service, e.g. with https://gerrit.wikimedia.org/r/c/operations/puppet/+/747667

On 1003 the freeze inevitably arrives within 24 hours. On 1004 I've only seen the issue occur once, ever. 1004 is still the active server and mostly holding up fine.

I haven't been able to reproduce the issue artificially (e.g. by simulating IO load) but I also haven't put a ton if time into attempting this.

Change 756722 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Make cloudmetrics the primary again

https://gerrit.wikimedia.org/r/756722

Change 756722 merged by Andrew Bogott:

[operations/puppet@production] Make cloudmetrics the primary again

https://gerrit.wikimedia.org/r/756722

Change 756957 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/dns@master] wmnet: make cloudmetrics1001 the backed of grafana/graphite endpoints

https://gerrit.wikimedia.org/r/756957

Change 756958 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] wmcs: monitoring: make cloudmetrics1001 the primary

https://gerrit.wikimedia.org/r/756958

Change 756958 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] wmcs: monitoring: make cloudmetrics1001 the primary

https://gerrit.wikimedia.org/r/756958

Change 756957 merged by Arturo Borrero Gonzalez:

[operations/dns@master] wmnet: make cloudmetrics1001 the backend of grafana/graphite endpoints

https://gerrit.wikimedia.org/r/756957

Mentioned in SAL (#wikimedia-cloud) [2022-01-25T10:49:46Z] <arturo> made cloudmetrics1001/1002 primary/backup respectively (T299744, T297814, T300011)

Change 757101 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/dns@master] Revert \"wmnet: make cloudmetrics1001 the backend of grafana/graphite endpoints\"

https://gerrit.wikimedia.org/r/757101

Change 757101 merged by Andrew Bogott:

[operations/dns@master] Revert \"wmnet: make cloudmetrics1001 the backend of grafana/graphite endpoints\"

https://gerrit.wikimedia.org/r/757101

Resolving task since the kernel upgrade seems to have fixed this. Thanks, Willly