Change Details

A while ago Hashar observed that the CPU graphs on https://grafana.wikimedia.org/dashboard/db/labs-capacity-planning?orgId=1 were off by a factor of two, and he created the additional 'CPU x 2 - 1 day moving median' graph. His argument (summarized in https://phabricator.wikimedia.org/T179378#4144265) is that our monitors mistake hyperthreading hosts for having twice as many physical CPUs and so understate load. I've been doing my best to ignore this possibility but recently moved a VM off of labvirt1006 (which was, according to the 'normal' graph not overloaded but according to the 2x graph was overloaded) and the user of that VM immediately reported that its performance instantly improved. So, my questions: Is @hashar right and our CPU metrics are wrong for all hosts with hyperthreading enabled? And, if so, can we fix that somewhere deeper in the infrastructure so we don't need to have a hacked '2x' graph to detect actual problems?26570726f6475636520796f757220627567207573696e67206120726563656e742076657273696f6e206f662074686520736f6674776172652c20746f2068652077696b6920636f6e74656e74206c616e67756167652e0a0a5468616e6b20796f752e0a546167730a436865636b557365720ad70a436f6e6e65637465642d4f70656e2d48657269746167652d42617463682d75706c6f61647320285241c42d4b4d425f315f323031372d3032290ad70a54616d696c2d53697465730ad70a47616d6570726573730ad70a48617368746167730ad70a4a4144450ad70a4b6172746f456469746f720ad70a4c616e67756167652d323031382d4170722d4a756e650ad70a4e65772d456469746f722d457870657269656e6365730ad70a4d61696c0ad70a5443422d5465616d0ad70a53756273637269626572730a4465736372697074696f6e20507265766965770a436f6e74656e77a6f6e652073657474696e6720696e20796f75722070726f66696c652c20636c69636b20746f207265636f6e63696c652e