Page MenuHomePhabricator

Prometheus graph incorrectly sums CPU user and CPU guest
Open, MediumPublic

Description

When looking at the Prometheus CPU graph for a machine running guests, the CPU graph is misleading. The CPU guest is summed with CPU user although user contains guest already. That cause the guest metric to be taken in account twice.

An example is an OpenStack compute node (but Ganetichost probably have the same issue):
https://grafana.wikimedia.org/dashboard/db/prometheus-machine-stats?var-server=labvirt1004&var-datasource=eqiad%20prometheus%2Fops

proc(5) documentation about /proc/stat does not mention guest being included in user. But /proc/[pid]/stat does:

*cutime* ... This includes guest time, cguest_time (time spent running a virtual CPU).

From Linux:

kernel/sched/cputime.c
void account_guest_time(struct task_struct *p, u64 cputime)
{
	u64 *cpustat = kcpustat_this_cpu->cpustat;

	/* Add guest time to process. */
	p->utime += cputime;
	account_group_user_time(p, cputime);
	p->gtime += cputime;

	/* Add guest time to cpustat. */
	if (task_nice(p) > 0) {
		cpustat[CPUTIME_NICE] += cputime;
		cpustat[CPUTIME_GUEST_NICE] += cputime;
	} else {
		cpustat[CPUTIME_USER] += cputime;
		cpustat[CPUTIME_GUEST] += cputime;
	}
}

Eg guest is added to user and guest_nice is added to guest.

Not sure what kind of magic needs to happen in the CPU Graph metrics. Maybe user and nice can be tweaked as:

  • user = user - guest
  • nice = nice - guest_nice

Event Timeline

hashar created this task.Mar 23 2017, 5:02 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 23 2017, 5:02 PM
hashar triaged this task as Medium priority.Jun 16 2017, 11:05 AM

That is still valid. https://grafana.wikimedia.org/dashboard/db/prometheus-machine-stats sums all CPU metrics but guest (at least) is part of user.

fgiunchedi moved this task from Inbox to Backlog on the observability board.Mon, Jul 20, 2:08 PM