A while ago Hashar observed that the CPU graphs on https://grafana.wikimedia.org/dashboard/db/labs-capacity-planning?orgId=1 were off by a factor of two, and he created the additional 'CPU x 2 - 1 day moving median' graph. His argument (summarized in https://phabricator.wikimedia.org/T179378#4144265) is that our monitors mistake hyperthreading hosts for having twice as many physical CPUs and so understate load. I've been doing my best to ignore this possibility but recently moved a VM off of labvirt1006 (which was, according to the 'normal' graph not overloaded but according to the 2x graph was overloaded) and the user of that VM immediately reported that its performance instantly improved. So, my questions: Is @hashar right and our CPU metrics are wrong for all hosts with hyperthreading enabled? And, if so, can we fix that somewhere deeper in the infrastructure so we don't need to have a hacked '2x' graph to detect actual problems?
As part of SRE clinic duty, I'm reviewing all unassigned, needs triage tasks in Operations and attempting to review if any are critical, or if they are normal priority.
This task appears to be normal priority, and I have set it such. If any one on this task disagrees, please comment and correct. Anything with a high priority or above typically requires response ahead of other items, so please ensure you have supporting documentation on why those priorities should be used.
In a Prometheus world the cpu utilization is calculated from the number of seconds each cpu has spent in each mode, from the numbers in /proc/stat. e.g. https://grafana.wikimedia.org/dashboard/db/host-overview uses that in the cpu utilization, divided by the number of cores to normalize the graph at 100%. There's also more information on https://www.robustperception.io/understanding-machine-cpu-usage/. AFAICS the graphs in labs-capacity-planning are using graphite/diamond as their source, were you looking to port the dashboard to Prometheus instead?