Page MenuHomePhabricator

Investigate instances with high "steal" CPU
Closed, ResolvedPublic

Description

/proc/stat has a steal metric defined as:

steal
Stolen time, which is the time spent in other operating systems when running in a virtualized environment

Which I understand it is the amount of time spent running other instances tasks instead of that instance requests :/ Would be a good indication that the underlying Compute host is starving CPU wise.

I crafted a Grafana table that shows the 10 top instances current steal value ( table). On March 22th most of them were on labvirt1004 (see also T161006) which was CPU starved.

Host Instance Steal %
labvirt1004cdh3-5.analytics.eqiad.wmflabs18%
labvirt1004citoid-jessie-test.services.eqiad.wmflabs16%
labvirt1004ws-web.wikistream.eqiad.wmflabs14%
labvirt1004tools-k8s-master-01.tools.eqiad.wmflabs13%
labvirt1010novaproxy-01.project-proxy.eqiad.wmflabs13%
labvirt1004labs-dynamicproxy-test.openstack.eqiad.wmflabs13%
labvirt1004deployment-zotero01.deployment-prep.eqiad.wmflabs13%
labvirt1004language-cx3.language.eqiad.wmflabs12%
labvirt1008wikimetrics-test.wikimetrics.eqiad.wmflabs11%
labvirt1008deployment-changeprop.deployment-prep.eqiad.wmflabs11%

An example as of May 2nd 2017:

InstanceSteal %
deployment-cache-text04.deployment-prep.eqiad.wmflabs27%
tools-flannel-etcd-02.tools.eqiad.wmflabs23%
videodev.video.eqiad.wmflabs22%
deployment-aqs02.deployment-prep.eqiad.wmflabs21%
deployment-ms-be04.deployment-prep.eqiad.wmflabs20%
netdata-2.netdata.eqiad.wmflabs18%
xmlrcs.huggle.eqiad.wmflabs18%
togetherjs.visualeditor.eqiad.wmflabs16%
deployment-sentry01.deployment-prep.eqiad.wmflabs16%
vitalsigns-01.dashiki.eqiad.wmflabs15%

Event Timeline

bd808 triaged this task as Medium priority.Mar 26 2017, 7:38 PM
bd808 moved this task from Triage to Backlog on the Cloud-Services board.

The steal CPU definitely reflects CPU over usage on a labvirt machine (was seen via T165753 as well). Maybe the metric could be used to build a metric about the labs performances.

hashar claimed this task.

I had filled this task to understand what is high steal CPU. It is out of control of the instances. There is no actionable here :]