Maniphest T193272

Prometheus vs. CPU usage vs. hyperthreading
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	Andrew
	Apr 27 2018, 8:19 PM

Description

A while ago Hashar observed that the CPU graphs on https://grafana.wikimedia.org/dashboard/db/labs-capacity-planning?orgId=1 were off by a factor of two, and he created the additional 'CPU x 2 - 1 day moving median' graph. His argument (summarized in https://phabricator.wikimedia.org/T179378#4144265) is that our monitors mistake hyperthreading hosts for having twice as many physical CPUs and so understate load. I've been doing my best to ignore this possibility but recently moved a VM off of labvirt1006 (which was, according to the 'normal' graph not overloaded but according to the 2x graph was overloaded) and the user of that VM immediately reported that its performance instantly improved. So, my questions: Is @hashar right and our CPU metrics are wrong for all hosts with hyperthreading enabled? And, if so, can we fix that somewhere deeper in the infrastructure so we don't need to have a hacked '2x' graph to detect actual problems?

Related Objects

Mentioned In: T179378: some labvirt servers are at full CPU capacity

Event Timeline

Andrew created this task.Apr 27 2018, 8:19 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 27 2018, 8:19 PM

Andrew mentioned this in T179378: some labvirt servers are at full CPU capacity.May 1 2018, 2:51 PM

https://www.percona.com/blog/2015/01/15/hyper-threading-double-cpu-throughput/

http://perfdynamics.blogspot.com/2014/01/monitoring-cpu-utilization-under-hyper.html

As part of SRE clinic duty, I'm reviewing all unassigned, needs triage tasks in SRE and attempting to review if any are critical, or if they are normal priority.

This task appears to be normal priority, and I have set it such. If any one on this task disagrees, please comment and correct. Anything with a high priority or above typically requires response ahead of other items, so please ensure you have supporting documentation on why those priorities should be used.

Thanks!

In a Prometheus world the cpu utilization is calculated from the number of seconds each cpu has spent in each mode, from the numbers in /proc/stat. e.g. https://grafana.wikimedia.org/dashboard/db/host-overview uses that in the cpu utilization, divided by the number of cores to normalize the graph at 100%. There's also more information on https://www.robustperception.io/understanding-machine-cpu-usage/. AFAICS the graphs in labs-capacity-planning are using graphite/diamond as their source, were you looking to port the dashboard to Prometheus instead?