Page MenuHomePhabricator

Add Icinga check for CPU frequency on Dell R320
Open, MediumPublic

Description

Our Dell R320 servers have a hardware bug which makes them sometimes drop to less than 200 MHz CPU frequency. This is hard to notice except when observing indirect effects of the system being slow, so we should have a separate Icinga check as long as we're using those server type:

[18:33] <bblack> simple shell command to check for cpu0 speed < 800Mhz: test $(grep MHz /proc/cpuinfo |head -1|cut -d: -f2|cut -d\. -f1) -gt 800 || echo fail

Related tasks of earlier incidents: T163158 and T147905

Event Timeline

Dzahn updated the task description. (Show Details)
Dzahn updated the task description. (Show Details)

Change 348966 had a related patch set uploaded (by Dzahn):
[operations/puppet@production] Icinga: add simple plugin to check CPU frequency

https://gerrit.wikimedia.org/r/348966

Change 348966 merged by Dzahn:
[operations/puppet@production] Icinga: add simple plugin to check CPU frequency

https://gerrit.wikimedia.org/r/348966

Change 348976 had a related patch set uploaded (by Dzahn):
[operations/puppet@production] base: add icinga check for CPU frequency on Dell R320

https://gerrit.wikimedia.org/r/348976

Change 348976 merged by Dzahn:
[operations/puppet@production] base: add icinga check for CPU frequency on Dell R320

https://gerrit.wikimedia.org/r/348976

in T332764 we are looking at migrating checks from nagios to icinga and at the same time trying to understand if they are still valid or if they could be improved. When considering this check i wonder if its still of value. i looked at logstash and alerts for as long as logstash has logs. If this is still valid then:

  • is the value of 600 still sound. or can we calculate that value dynamically based some other data ideally something exporter by promethues
  • is it worth while having this for all servers and not just the R320
jbond triaged this task as Medium priority.
Dzahn removed Dzahn as the assignee of this task.Mar 24 2023, 6:45 PM
Dzahn subscribed.

So, a few notes:

  • We no longer have any Dell R320s in production -- netbox reports 0 instances. https://netbox.wikimedia.org/dcim/device-types/37/
  • I manually queried Thanos for any instances of CPU frequency <= 200 MHz in the past month -- see P51432 for my work. I found only a handful of occurrences -- seriously, about a dozen 30 second intervals across a handful of machines -- and they were of two types
    • jobrunner machines in codfw, which isn't running jobrunner jobs right now
    • one of the stat machines in eqiad, which are often very idle
  • If we were to run into this issue again, we are graphing CPU frequency on the Host Overview dashboard, so it won't be like this is totally opaque.

Given all this I think we should just consider this task obsolete and remove any existing alerting we have.