/proc/stat has a steal metric defined as:
steal
Stolen time, which is the time spent in other operating systems when running in a virtualized environment
Which I understand it is the amount of time spent running other instances tasks instead of that instance requests :/ Would be a good indication that the underlying Compute host is starving CPU wise.
I crafted a Grafana table that shows the 10 top instances current steal value ( table). On March 22th most of them were on labvirt1004 (see also T161006) which was CPU starved.
Host | Instance | Steal % |
---|---|---|
labvirt1004 | cdh3-5.analytics.eqiad.wmflabs | 18% |
labvirt1004 | citoid-jessie-test.services.eqiad.wmflabs | 16% |
labvirt1004 | ws-web.wikistream.eqiad.wmflabs | 14% |
labvirt1004 | tools-k8s-master-01.tools.eqiad.wmflabs | 13% |
labvirt1010 | novaproxy-01.project-proxy.eqiad.wmflabs | 13% |
labvirt1004 | labs-dynamicproxy-test.openstack.eqiad.wmflabs | 13% |
labvirt1004 | deployment-zotero01.deployment-prep.eqiad.wmflabs | 13% |
labvirt1004 | language-cx3.language.eqiad.wmflabs | 12% |
labvirt1008 | wikimetrics-test.wikimetrics.eqiad.wmflabs | 11% |
labvirt1008 | deployment-changeprop.deployment-prep.eqiad.wmflabs | 11% |
An example as of May 2nd 2017:
Instance | Steal % |
---|---|
deployment-cache-text04.deployment-prep.eqiad.wmflabs | 27% |
tools-flannel-etcd-02.tools.eqiad.wmflabs | 23% |
videodev.video.eqiad.wmflabs | 22% |
deployment-aqs02.deployment-prep.eqiad.wmflabs | 21% |
deployment-ms-be04.deployment-prep.eqiad.wmflabs | 20% |
netdata-2.netdata.eqiad.wmflabs | 18% |
xmlrcs.huggle.eqiad.wmflabs | 18% |
togetherjs.visualeditor.eqiad.wmflabs | 16% |
deployment-sentry01.deployment-prep.eqiad.wmflabs | 16% |
vitalsigns-01.dashiki.eqiad.wmflabs | 15% |