Page MenuHomePhabricator

labtestvirt2003: test different power management / CPU setups for faster kvm
Closed, ResolvedPublic

Description

Some cloudvirt machines have very slow CPU for example cloudvirt1004, cloudvirt1005, cloudvirt1012. They have some not so recent CPU but that itself does not really explains why they would be twice, if not three times slower in raw CPU power.

On investigating, that seems to affect HP Proliant machines and @hashar suspects that could be due to HP power management sytem see eg https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-c03031625&docLocale=en_US

labtestvirt2003.codfw.wmnet is a test machine and it could be used to measure CPU performance. The tests to conduct would be to run the bash oneliner below directly on the machine, and eventually under KVM. Then check the HP Bios settings for power management, try a different profile and rerun the benchmark?

The server is a ProLiant DL360 Gen9 with Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz.

Some tools @hashar used to run math on 1 thread, the first one should be sufficient

  • time $(i=1; while (( i < 2000000 )); do (( i ++ )); done)
  • sysbench --test=cpu run
  • stress-ng --cpu 1 --cpu-ops=4000

That CPU is found on lot of MediaWiki application server. On mw1307.eqiad.wmnet the shell oneliner takes 7.5 - 8 seconds.

Event Timeline

aborrero triaged this task as Medium priority.Jun 5 2019, 9:27 AM
aborrero moved this task from Inbox to Soon! on the cloud-services-team (Kanban) board.

Some initial baseline data:

hosttypetime test avgstress-ngsysbench
labtestvirt2003baremetal6.76s26m 4.38s10.07s
labtestvirt2003qemu/kvm6.85s26m 8.83s10.14s
cloudvirt1004baremetal10.80sn/an/a
cloudvirt1004qemu/kvm [1]11.34s40 mins, 13.80 secs15.04

(note that labtestvirt2003 is completely idle, while cloudvirt1004 is under a sustained load with ~15 running virtual machines. Since we're not pinning system resources, the additional noise on the hypervisors will effect the test results. )

Once I get access to labtestvirt2003's IPMI/iLO I'll collect some performance metrics for the different power profiles.

stress-ng --cpu 1 --cpu-ops=400000 <- turns out that 400k is way too many operations, I am not sure why I have indicated that. Anyway lets skip that command, the others are enough to estimate the raw CPU power.

I have run again the time benchmark on a few hosts. Instances on cloudvirt1005 are no more affected but the ones on cloudvirt1008 / cloudvirt1012 are. So that might "just" be CPU saturation or some contention when too many VMs are running. What puzzles me is that on the parent task T223971, the cloudvirt1005 that was showing slow CPU apparently had low load/cpu usage :-\

Results of different power regulator settings on labtestvirt2003.codfw.wmnet.

regulatortime testsysbench
baremetalmin16.4124.22
kvmmin16.4224.29
baremetalmax6.7310.00
kvmmax6.8010.01
baremetal*dynamic6.8010.14
kvm*dynamic6.8710.17

( * dynamic is the default value.)

The high performance max profile is only slightly better than the default dynamic.

Closing this task. The default (dynamic) power regulator settings are not impacting the virtual machine performance.

Sorry I have lot track of this task and the other. At least we have some raw metric that definitely show the regulator being set to minimum cause the CPU to be wayyy slower. There is another task about auditing the kernel cpu governor which is T225713. That might relate.

Thank you @JHedden !