Add Icinga check for CPU frequency on Dell R320
Open, MediumPublic
Actions

Assigned To

None

Authored By

	• MoritzMuehlenhoff
	Apr 18 2017, 4:42 PM

Description

Our Dell R320 servers have a hardware bug which makes them sometimes drop to less than 200 MHz CPU frequency. This is hard to notice except when observing indirect effects of the system being slow, so we should have a separate Icinga check as long as we're using those server type:

[18:33] <bblack> simple shell command to check for cpu0 speed < 800Mhz: test $(grep MHz /proc/cpuinfo |head -1|cut -d: -f2|cut -d\. -f1) -gt 800 || echo fail

Related tasks of earlier incidents: T163158 and T147905

Details

	Subject	Repo	Branch	Lines +/-
	base: add icinga check for CPU frequency on Dell R320	operations/puppet	production	+16 -0
	Icinga: add simple plugin to check CPU frequency	operations/puppet	production	+23 -0

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Dzahn	T162850 CPU throttling on DELL PowerEdge R320
		Open		None	T163220 Add Icinga check for CPU frequency on Dell R320

Event Timeline

• MoritzMuehlenhoff created this task.Apr 18 2017, 4:42 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 18 2017, 4:42 PM

Dzahn claimed this task.Apr 18 2017, 7:38 PM

Dzahn updated the task description. (Show Details)

Dzahn added a parent task: T162850: CPU throttling on DELL PowerEdge R320.Apr 18 2017, 7:41 PM

Change 348966 had a related patch set uploaded (by Dzahn):
[operations/puppet@production] Icinga: add simple plugin to check CPU frequency

https://gerrit.wikimedia.org/r/348966

gerritbot added a project: Patch-For-Review.Apr 19 2017, 5:09 PM

Change 348966 merged by Dzahn:
[operations/puppet@production] Icinga: add simple plugin to check CPU frequency

https://gerrit.wikimedia.org/r/348966

Change 348976 had a related patch set uploaded (by Dzahn):
[operations/puppet@production] base: add icinga check for CPU frequency on Dell R320

https://gerrit.wikimedia.org/r/348976

Change 348976 merged by Dzahn:
[operations/puppet@production] base: add icinga check for CPU frequency on Dell R320

https://gerrit.wikimedia.org/r/348976

https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=CPU+Freq

Screenshot from 2017-04-19 18:43:42.png (631×1 px, 133 KB)

in T332764 we are looking at migrating checks from nagios to icinga and at the same time trying to understand if they are still valid or if they could be improved. When considering this check i wonder if its still of value. i looked at logstash and alerts for as long as logstash has logs. If this is still valid then:

is the value of 600 still sound. or can we calculate that value dynamically based some other data ideally something exporter by promethues
is it worth while having this for all servers and not just the R320

jbond reopened this task as Open.Mar 24 2023, 5:49 PM

jbond triaged this task as Medium priority.

jbond mentioned this in T332764: Port base host checks from Icinga to Alertmanager.

Maintenance_bot removed a project: Patch-For-Review.Mar 24 2023, 6:10 PM

Dzahn removed Dzahn as the assignee of this task.Mar 24 2023, 6:45 PM

Dzahn subscribed.

lmata added a project: Observability-Metrics.May 2 2023, 1:23 PM

lmata moved this task from Inbox to Radar on the observability board.Jul 18 2023, 8:59 PM

So, a few notes:

We no longer have any Dell R320s in production -- netbox reports 0 instances. https://netbox.wikimedia.org/dcim/device-types/37/
I manually queried Thanos for any instances of CPU frequency <= 200 MHz in the past month -- see P51432 for my work. I found only a handful of occurrences -- seriously, about a dozen 30 second intervals across a handful of machines -- and they were of two types
- jobrunner machines in codfw, which isn't running jobrunner jobs right now
- one of the stat machines in eqiad, which are often very idle
If we were to run into this issue again, we are graphing CPU frequency on the Host Overview dashboard, so it won't be like this is totally opaque.

Given all this I think we should just consider this task obsolete and remove any existing alerting we have.

	F7653575: Screenshot from 2017-04-19 18:43:42.png
	Apr 20 2017, 1:45 AM

Add Icinga check for CPU frequency on Dell R320Open, MediumPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Add Icinga check for CPU frequency on Dell R320
Open, MediumPublic
Actions

Related Objects
Search...