Convince nova-scheduler to pay attention to CPU metrics
Closed, ResolvedPublic

Description

There's some reason to think that our metrics have been lying to us about CPU usage on labvirt hosts, as per

http://www.blueshiftblog.com/?p=3822

If that's true then a few of our virt nodes are probably CPU starved. For now it should be possible to shuffle around some exec nodes and get CPU usage under 50% for all hosts.

Andrew created this task.Mar 21 2017, 4:16 PM
Restricted Application added a project: Labs. · View Herald TranscriptMar 21 2017, 4:16 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
hashar added a subscriber: hashar.Mar 21 2017, 4:26 PM

Went creating a lame graph that for each labvirt node graph the CPU usage * 2:

https://grafana.wikimedia.org/dashboard/db/labs-capacity-planning?panelId=91&fullscreen&from=now-7d&to=now

diamond get the CPU utilization from /proc/stat and divide it by the number of CPU exposed which is double the count when using HT. So a 100% usage end up showing as 50%.

Maybe Nova has a scheduler to take CPU usage in account, but haven't found anything like that :(

greg added a subscriber: greg.Mar 21 2017, 8:21 PM

labvirt1004 had its load bump since March 7th

and that fits nicely with a shift of CPU from labvirt1001 to labvirt1004:

I thought it could have been due to labvirt1001 / labvirt1002 being removed from the scheduler pool on March 3rd ( d0ade4b0eaf977f53ad556fb6b6ace5576ca5c16 ). That is a possibilty but it does not fit.

labvirt1004 has doubled on March 7 between 23:00 and midnight UTC (note graph multiply value by 2)

I lack data from the OpenStack side but a theory would be that a lot of Nodepool instances ends up being scheduled on the same host. Maybe because that is the one having the less vCPU allocated or the favorite candidate. With instances having little load that makes sense, but the CI instances typically consume a lot of CPU when being used.

Would need to dig in the scheduler logs if there is any such thing.

It is actually possible to explicitly tell the scheduler to not put multiple nodepool instances on the same labvirt. That would work if the total number of nodepool instances is always < the number of labvirts, which I'm not sure is true.

If it /is/ true and you wanted to try it, you'd just add an additional argument to the VM creation command:

--hint group=a24491db-b671-4f3b-b70d-7c987c620b2b

(That's an 'anti-affinity' group that I just now created.)

I'd be interested in giving this a try, but it should probably be at a time when we're both around and watching.

I guess that prevents the scheduler to select a compute node that already has an instance in that antiaffinity group isn't it ?

We have a pool of 25 instances so that exceed the 14 labvirt we have :(

Nodepool can pass some meta properties to the instance, we used that to pass show=true a while ago. But I guess that is unrelated to the nova boot hint :/

Change 344051 had a related patch set uploaded (by Andrew Bogott):
[operations/puppet] Nova scheduler: Prefer virthosts with lower CPU usage

https://gerrit.wikimedia.org/r/344051

To summarize the wild guesses I made to andrew over IRC:

The scheduler possibly weights the hosts, the default being all_weighers which seems to spread RAM usage (eg a Compute node with low memory will most probably be chosen). There are other weights playing a role as well.

scheduler_host_subset_size = 1 (default) that says: A value of 1 chooses the first host returned by the weighing functions.

So if the weighter ends up always emitting the same host at the first position, it will always be picked. Assuming CPU load is not taken in account and that host weight is not changed much by having 8-9 Nodepool instances added to it, it will eventually sky rocket the CPU :(

IoOpsFilter that tentatively filter out hosts having high io. Per Andrew it is misleading, that is IO in the sense of OpenStack operations such as building an instance (cough nodepool), snapshotting etc. Possibly it could be used to filter out Compte nodes that already have X instances being build.

@Andrew smarter option: influence the scheduler based on the actual CPU usage metrics:
cpu.frequency - Current CPU frequency

cpu.user.time - CPU user mode time
cpu.kernel.time - CPU kernel time
cpu.idle.time - CPU idle time
cpu.iowait.time - CPU I/O wait time
cpu.user.percent - CPU user mode percentage
cpu.kernel.percent - CPU kernel percentage
cpu.idle.percent - CPU idle percentage
cpu.iowait.percent - CPU I/O wait percentage
cpu.percent - Generic CPU utilization

"if a host has high cpu.idle.time then it's a good candidate! But if it has high iowait.time then it's bad."

That would potentially move instances solely based on the host having the least CPU consumption. If a host append to have full RAM, it would be discarded by the RAMFilter. The actual IO is not taken in account, but iowait/system can probably be added in the loop.

And a paper that happens to mention the case we have https://01.org/sites/default/files/utilization_based_scheduing_in_openstack_compute_nova_1.docx that shows up the default weighters is all about spreading RAM usage and does not take in account I/O or CPU usage :]

I have spawned at 10:41 UTC an instance integration-c1.integration (24fe397e-7bd3-4c12-bde3-3e211c5f2671) with 32GB of RAM. It has been scheduled on labvirt1004. Might cause the load to shift to another labvirt.

Seems Linux kernel in a guest is smart enough to find out instruction execution is being delayed by other instances on the same host (cpu steal). Filled subtask to investigate that T161118. 8 of the top 10 instances are running on labvirt1004.

The scheduler now spread the Nodepool instances across multiple Compute nodes and CI was responsive again yesterday. Looks like the hack to artificially consume 32GB of RAM on labvirt1004 did the trick.

Change 344689 had a related patch set uploaded (by Andrew Bogott):
[operations/puppet@production] Nova.conf compute_monitors=virt_driver

https://gerrit.wikimedia.org/r/344689

Andrew triaged this task as "Normal" priority.Mar 24 2017, 9:04 PM
Andrew changed the title from "Rebalance tools exec nodes with an eye towards CPU usage" to "Convince nova-scheduler to pay attention to CPU metrics".Mar 24 2017, 9:10 PM
bd808 moved this task from Triage to In Progress on the Labs board.Mar 26 2017, 7:40 PM

Change 344689 merged by Andrew Bogott:
[operations/puppet@production] Nova scheduler: Use relative cpu percentages when scheduling.

https://gerrit.wikimedia.org/r/344689

Change 344051 abandoned by Andrew Bogott:
Nova scheduler: Prefer virthosts with lower CPU usage

Reason:
Dropping in favor of https://gerrit.wikimedia.org/r/#/c/344689/

https://gerrit.wikimedia.org/r/344051

Mentioned in SAL (#wikimedia-releng) [2017-03-29T15:18:16Z] <hashar> Delete a 32GB instance integration-ci - T161006

Change 345381 had a related patch set uploaded (by Andrew Bogott):
[operations/puppet@production] nova scheduler: scheduler_host_subset_size = 2

https://gerrit.wikimedia.org/r/345381

Change 345381 merged by Andrew Bogott:
[operations/puppet@production] nova scheduler: scheduler_host_subset_size = 2

https://gerrit.wikimedia.org/r/345381

Andrew closed this task as "Resolved".Mar 30 2017, 6:24 PM

Seems to be working.

Thanks a ton Andrew!

antoine-approve

Mentioned in SAL (#wikimedia-releng) [2017-04-14T12:29:21Z] <hashar> Delete integration-c1 instance (32GB RAM) on labvirt1004. It was used as a workaround for T161006