Page MenuHomePhabricator

Prometheus vs. CPU usage vs. hyperthreading
Closed, DeclinedPublic

Description

A while ago Hashar observed that the CPU graphs on https://grafana.wikimedia.org/dashboard/db/labs-capacity-planning?orgId=1 were off by a factor of two, and he created the additional 'CPU x 2 - 1 day moving median' graph. His argument (summarized in https://phabricator.wikimedia.org/T179378#4144265) is that our monitors mistake hyperthreading hosts for having twice as many physical CPUs and so understate load. I've been doing my best to ignore this possibility but recently moved a VM off of labvirt1006 (which was, according to the 'normal' graph not overloaded but according to the 2x graph was overloaded) and the user of that VM immediately reported that its performance instantly improved. So, my questions: Is @hashar right and our CPU metrics are wrong for all hosts with hyperthreading enabled? And, if so, can we fix that somewhere deeper in the infrastructure so we don't need to have a hacked '2x' graph to detect actual problems?

Event Timeline

RobH triaged this task as Medium priority.May 3 2018, 4:47 PM
RobH subscribed.

As part of SRE clinic duty, I'm reviewing all unassigned, needs triage tasks in SRE and attempting to review if any are critical, or if they are normal priority.

This task appears to be normal priority, and I have set it such. If any one on this task disagrees, please comment and correct. Anything with a high priority or above typically requires response ahead of other items, so please ensure you have supporting documentation on why those priorities should be used.

Thanks!

In a Prometheus world the cpu utilization is calculated from the number of seconds each cpu has spent in each mode, from the numbers in /proc/stat. e.g. https://grafana.wikimedia.org/dashboard/db/host-overview uses that in the cpu utilization, divided by the number of cores to normalize the graph at 100%. There's also more information on https://www.robustperception.io/understanding-machine-cpu-usage/. AFAICS the graphs in labs-capacity-planning are using graphite/diamond as their source, were you looking to port the dashboard to Prometheus instead?

Vvjjkkii renamed this task from Prometheus vs. CPU usage vs. hyperthreading to n2daaaaaaa.Jul 1 2018, 1:14 AM
Vvjjkkii raised the priority of this task from Medium to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
Samwilson renamed this task from n2daaaaaaa to Prometheus vs. CPU usage vs. hyperthreading.Jul 1 2018, 7:17 AM
Samwilson lowered the priority of this task from High to Medium.
Samwilson updated the task description. (Show Details)
Samwilson added a subscriber: Aklapper.

Boldly resolving, it seems things are working as intended