Page MenuHomePhabricator

Prometheus graph incorrectly sums CPU user and CPU guest
Open, MediumPublic


When looking at the Prometheus CPU graph for a machine running guests, the CPU graph is misleading. The CPU guest is summed with CPU user although user contains guest already. That cause the guest metric to be taken in account twice.

An example is an OpenStack compute node (but Ganetichost probably have the same issue):

prometheus_cpu.png (284×586 px, 83 KB)

proc(5) documentation about /proc/stat does not mention guest being included in user. But /proc/[pid]/stat does:

*cutime* ... This includes guest time, cguest_time (time spent running a virtual CPU).

From Linux:

void account_guest_time(struct task_struct *p, u64 cputime)
	u64 *cpustat = kcpustat_this_cpu->cpustat;

	/* Add guest time to process. */
	p->utime += cputime;
	account_group_user_time(p, cputime);
	p->gtime += cputime;

	/* Add guest time to cpustat. */
	if (task_nice(p) > 0) {
		cpustat[CPUTIME_NICE] += cputime;
		cpustat[CPUTIME_GUEST_NICE] += cputime;
	} else {
		cpustat[CPUTIME_USER] += cputime;
		cpustat[CPUTIME_GUEST] += cputime;

Eg guest is added to user and guest_nice is added to guest.

Not sure what kind of magic needs to happen in the CPU Graph metrics. Maybe user and nice can be tweaked as:

  • user = user - guest
  • nice = nice - guest_nice

Event Timeline

hashar triaged this task as Medium priority.Jun 16 2017, 11:05 AM

The prometheus-machine-stats dashboard has been flagged for deletion via T178690: Better organization for SRE grafana dashboards. I guess the replacement is and it has the same issue: user and guest_user are stacked (and I am assuming nice and nice_user are affected).

Can be seen on one of the ganeti hosts for example ganeti1005