Page MenuHomePhabricator

Old cloudvirt (with Intel Xeon) are half the speed of newer ones (Intel Sky Lake)
Closed, DeclinedPublic

Description

The job mediawiki-core-code-coverage-docker usually takes 2h / 2h30 based on https://integration.wikimedia.org/ci/job/mediawiki-core-code-coverage-docker/buildTimeTrend

integration-slave-docker-1054 is usually less than two hours. The job times out after four hours on 1055 and 1056 :(

After some investigations below, the slow instances are on cloudvirt1005 / cloudvirt1006 which have Intel Xeon E3-12xx v2 (Ivy Bridge) at 2.7Ghz. The fasted builds are on instances having a Skylake processor at 2.3Ghz.

I can imagine a newer architecture offers improvement, but a for a single thread CPU bound doing simple maths, I would expect the Ivy Bridge Xeon at 2.7Ghz to be faster than the Skylake one at 2.3Ghz.

I tried a couple small CPU benchmarks which I ran on instances:

  • Compute 10k prime numbers using 64 bit with the sysbench package and running: sysbench --test=cpu run.
  • Do some basic math with stress-ng and running: stress-ng --cpu 1 --cpu-ops=400000

Results:

ModelIntel Xeon E3-12xx v2 (Ivy Bridge)Intel Core Processor (Skylake)
cpu MHz2,7 Ghz2,3 GHz
bogoMips53904590
sysbench duration~ 16 seconds~ 9 seconds
stress-ng~ 18 seconds~ 10 seconds

Rearranged, with max turbo:

HostModelbase speedMax turbobogoMipsSysbench
cloudvirt1006Xeon E5-2697 v22,700 MHz3,500 MHz5,38615.13s
cloudvirt1025Xeon Gold 61402,300 MHz3,700 MHz4,5909.24s
cobaltXeon E5-2623 v33,000 MHz3,500 MHz6,0009,30s
contint1001Xeon E5-2640 v32,600 MHz3,400 MHz5,2009.37s
@hashari7-8550U1,800 MHz4,000 MHz4,0007,43s
@hashar #2i5-4250U1,300 MHz2,600 MHz3,80011,8s

Note how despite the bogoMips and CPU speed being higher on the Xeon Ivy Bridge, it performs twice slower.

I really dont get why the Intel Xeon is so slow :-/ Maybe it is an oddity due to kvm or a BIOS / hardware configuration issue. One would have to run the same benchmarks on the real servers for comparison?

Event Timeline

https://integration.wikimedia.org/ci/job/mediawiki-core-code-coverage-docker/4262/ started at May 21, 2019 3:00:00 AM and took 4 hours

It ran on integration-slave-docker-1056 which is on cloudvirt1005.eqiad.wmnet

Via https://grafana-labs.wikimedia.org/dashboard/db/cloud-vps-project-board , I can see the instance had 12.5% CPU usage for the duration of the job. Which correspond to a single CPU being used at 100% for all the duration of the job. There is no steal CPU indicated.

There is no indication of CPU/io saturation on cloudvirt1005.


https://integration.wikimedia.org/ci/job/mediawiki-core-code-coverage-docker/4263/ started at May 21, 2019 7:10:43 AM and took 2 hours and a half.

it ran on integration-slave-docker-1043 which is on cloudvirt1022.eqiad.wmnet

Seems all fine as well but it is magically faster :-(

Which leaves me with the CPU that would be slower on 1055 and 1056?

hashar@integration-cumin:~$ sudo cumin --trace --force 'name:docker' 'egrep "(model)" /proc/cpuinfo|sort|uniq'
===== NODE GROUP =====                                                                                                                                                                                               
(8) integration-slave-docker-[1021,1048-1054].integration.eqiad.wmflabs                                                                                                                                              
model           : 94                                                                                                                                                                                                 
model name      : Intel Core Processor (Skylake)
===== NODE GROUP =====                                                                                                                                                                                               
(2) integration-slave-docker-[1055-1056].integration.eqiad.wmflabs                                                                                                                                                   
model           : 58                                                                                                                                                                                                 
model name      : Intel Xeon E3-12xx v2 (Ivy Bridge)
===== NODE GROUP =====                                                                                                                                                                                               
(4) integration-slave-docker-[1034,1040-1041,1043].integration.eqiad.wmflabs                                                                                                                                         
model           : 61                                                                                                                                                                                                 
model name      : Intel Core Processor (Broadwell)

And looking at the build times:

modelmodel nametimecloudvirt
94Intel Core Processor (Skylake)1h50m1023 1025 1026 1027 1028 1029
58Intel Xeon E3-12xx v2 (Ivy Bridge)4h00 (build times out)1005 1006
61Intel Core Processor (Broadwell)2h30m1016 1017 1022

A few weeks ago, I have deleted integration-slave-docker-1037 since it was notoriously slow when running the job wmf-quibble-vendor-mysql-hhvm-docker (T222023). But I do not know on which cloudvirt it happened to be scheduled :-(

Looking at https://integration.wikimedia.org/ci/job/wmf-quibble-vendor-mysql-hhvm-docker/buildTimeTrend the build takes roughly 13 - 18 minutes. But those scheduled on 1055 or 1056 takes 25-30 minutes.

Mentioned in SAL (#wikimedia-releng) [2019-05-21T11:23:18Z] <hashar> Depooling integration-slave-docker-1055 and integration-slave-docker-1056 : CPU is too slow # T223971

I tried a small cpu benchmark using the sysbench package and running: sysbench --test=cpu run. It computes 10k prime numbers using 64 bit integers and stress-ng with 400k operations (stress-ng --cpu 1 --cpu-ops=400000).

ModelIntel Xeon E3-12xx v2 (Ivy Bridge)Intel Core Processor (Skylake)
cpu MHz2,7 Ghz2,3 GHz
bogoMips53904590
sysbench duration~ 16 seconds~ 9 seconds
stress-ng~ 18 seconds~ 10 seconds

I really dont get why the Intel Xeon is so slow :-/ Maybe it is an oddity due to kvm/qemu. One would have to run the same benchmarks on the real servers instead.

hashar renamed this task from Investigate slow down of mediawiki-core-code-coverage-docker on some Jenkins instances to Old cloudvirt (with Intel Xeon) are twice slower than new ones (Intel Sky Lake).May 21 2019, 12:59 PM
hashar updated the task description. (Show Details)

Today I have noticed that instances scheduled on our oldest cloudvirt machines are most probably slower than expected. They use Intel Xeon E3-12xx v2 (Ivy Bridge) and I suspect there is either a bios option missing or KVM is not finely tuned.

When an instance is scheduled on a recent cloudvirt machine, tasks ends up being twice faster which don't make sense since those CPUs are not twice faster.

Years ago, when I first migrated Jenkins jobs to WMCS, I did noticed the jobs took longer to run by a noticeable margin. I just assumed at the time it was some openstack/kvm/whatever overhead. So potentially the CPU slowdown has been there for ever.

I guess what would help is to run the same benchmark directly on the cloudvirt hosts:

apt -y install sysbench
sysbench --test=cpu run|grep 'Total time'

And report the result of a few runs. Maybe the Intel Xeon is really way slower than a Sky Lake processor but that would really surprise me. Candidate cloudvirt/cpu would be:

Cloud virtCPUInstance
cloudvirt1006.eqiad.wmnetIntel Xeonintegration-slave-docker-1055
cloudvirt1025.eqiad.wmnetSky Lakeintegration-slave-docker-1054

If running the benchmark command on the host is faster on cloudvirt1006, that means would probably mean we have a configuration issue in kvm/libvirt etc.. Else that would mean the Intel Xeon is slower despite having more bogoMips ?!

I am also interested in the exact specs of the processor. They are not fully exposed to the guest VM.

@hashar what is your hoped for outcome here? I'm not sure I understand why you are confused that 5 year old servers (cloudvirt1006) are slower than servers purchased in the last year (cloudvirt1025).

I wonder if there should be two separate sets of flavours, one for each type of host. Probably wouldn't want an instance set up on one type migrated to the other. It sounds like right now if you see docs/examples that say a particular flavour should be used (perhaps on the basis of VCPUs), it's useless due to it actually coming down to the luck of what host you get scheduled on?

I wonder if there should be two separate sets of flavours, one for each type of host. Probably wouldn't want an instance set up on one type migrated to the other. It sounds like right now if you see docs/examples that say a particular flavour should be used (perhaps on the basis of VCPUs), it's useless due to it actually coming down to the luck of what host you get scheduled on?

A while back Andrew added some logic to OpenStack to have newly created instances to be created on the least loaded compute nodes.

OpenStack seems to have some way to partition hosts in aggregate, each aggregate quarying an specific propery (eg: ssd=true). Then a new flavor is created that can only be scheduled on an aggregate carrying that property (eg: m1.large-ssd flavor). https://docs.openstack.org/nova/rocky/admin/configuration/schedulers.html#host-aggregates But that is arguably a lot of configuration tweaking and would put more burden on dispatching the VM accross servers.

@hashar what is your hoped for outcome here? I'm not sure I understand why you are confused that 5 year old servers (cloudvirt1006) are slower than servers purchased in the last year (cloudvirt1025).

My issue is that when doing simple maths (purely CPU bound) on a single thread, I would expect the old CPU to be faster since it is at 2.7GHz while the newer one is at 2.3GHz. That is a very lenient approach at the problem since that discounts a lot of how CPU acts nowadays compared to 30 years ago. One sure thing, for a single thread load, I would not expect the older CPU to be twice slower.

The sysbench is very straightforward, it iterates from 3 to 10000 and does:

for(c=3; c < max_prime; c++)  
{
  t = sqrt(c);
  for(l = 2; l <= t; l++)
    if (c % l == 0)
      break;
  if (l > t )
    n++; 
}

T223971#5200409 shows the above code takes 16 seconds on the old CPU versus 9 seconds on a newer one. Hence why I am terribly confused at the pure CPU performance.

If I look up Intel Xeon E5-2697 v2 @ 2.70GHz (old) versus Intel Xeon Gold 6140 @ 2.30GHz (new) in a benchmark database, they end up with roughly the same single thread rating: https://www.cpubenchmark.net/compare/Intel-Xeon-E5-2697-v2-vs-Intel-Xeon-Gold-6140/2009vs3132 (respectively scores of 1732 and 1736).

Thus my expected outcome would be for sysbench --test=cpu run to be roughly similar between the two CPUs. Not a two time difference.

Would it be possible to run the benchmark directly on the servers to rule out kvm/qemu? A oneliner would be:

apt -y install sysbench && sysbench --test-cpu run && apt -y purge sysbench

On cloudvirt1006.eqiad.wmnet and cloudvirt1025.eqiad.wmnet - or if you feel brave on all cloudvirt by using cumin :-]

If the benchmark is way faster on cloudvirt1006.eqiad.wmnet than on an instance it hosts integration-slave-docker-1055, that would indicate a potential issue with kvm/qemu/libvirt etc. Else I would blame some BIOS settings but at that point I will be willing to give up.

Here is the result:

root@cloudvirt1006:~# sysbench --test=cpu run
sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1

Doing CPU performance benchmark

Threads started!
Done.

Maximum prime number checked in CPU test: 10000


Test execution summary:
    total time:                          15.1276s
    total number of events:              10000
    total time taken by event execution: 15.1261
    per-request statistics:
         min:                                  1.19ms
         avg:                                  1.51ms
         max:                                  8.33ms
         approx.  95 percentile:               2.44ms

Threads fairness:
    events (avg/stddev):           10000.0000/0.00
    execution time (avg/stddev):   15.1261/0.00
root@cloudvirt1025:~# sysbench --test=cpu run
sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1

Doing CPU performance benchmark

Threads started!
Done.

Maximum prime number checked in CPU test: 10000


Test execution summary:
    total time:                          9.2490s
    total number of events:              10000
    total time taken by event execution: 9.2481
    per-request statistics:
         min:                                  0.78ms
         avg:                                  0.92ms
         max:                                  5.63ms
         approx.  95 percentile:               1.01ms

Threads fairness:
    events (avg/stddev):           10000.0000/0.00
    execution time (avg/stddev):   9.2481/0.00

Note that kernel versions are a bit different:

root@cloudvirt1025:~# uname -a
Linux cloudvirt1025 4.9.0-8-amd64 #1 SMP Debian 4.9.130-2 (2018-10-27) x86_64 GNU/Linux
root@cloudvirt1006:/home/aborrero# uname -a
Linux cloudvirt1006 4.9.0-8-amd64 #1 SMP Debian 4.9.144-3.1 (2019-02-19) x86_64 GNU/Linux

Thanks! So the benchmarks on the real hardware looks similar to what I get on the instance. I guess that rules out kvm/qemu/labvirt etc.

Ran it on a couple production host and two of my machines at home:

HostModelbase speedMax turbobogoMipsSysbench
cloudvirt1006Xeon E5-2697 v22,700 MHz3,500 MHz5,38615.13s
cloudvirt1025Xeon Gold 61402,300 MHz3,700 MHz4,5909.24s
cobaltXeon E5-2623 v33,000 MHz3,500 MHz6,0009,30s
contint1001Xeon E5-2640 v32,600 MHz3,400 MHz5,2009.37s
@hashari7-8550U1,800 MHz4,000 MHz4,0007,43s
@hashar #2i5-4250U1,300 MHz2,600 MHz3,80011,8s

Eventually I found a MediaWiki application server with a CPU older than the one on cloudvirt1006: mw2139.codfw.wmnet has a Xeon CPU E5-2450 0 based on Sandy Bridge.

Comparison based on:

Hostmw2139cloudvirt1006
CPUE5-2450E5-2697 v2
Speed2.10 GHz2.70 GHz
Turbo2.9 GHz3.5 GHz
Time10s17s

On the benchmark page, the single thread rating for mw2139 cpu is 1074 while the one from cloudvirt has 1732 (higher is better).

contint1001 has a more recent Xeon E5-2640 v3 and runs the busy loop in 8.5s

So I am not sure what is happening on the old cloudvirt servers, but given their CPU they should perform better than the old mw2139.codfw.wmnet?

Mentioned in SAL (#wikimedia-releng) [2019-06-03T15:57:44Z] <hashar> Deleting integration-slave-docker-1055 and integration-slave-docker-1056 . CPU is way too slow T223971

I have deleted the affected instances and created two new ones hoping for them to be scheduled on cloudvirt not being cursed with slow CPU. To no luck, I got integration-slave-docker-1058 on cloudvirt1004 and integration-slave-docker-1059 on cloudvirt1005. Both being slow :-\

Could I get them moved to some faster cloudvirt servers please? Seems they would fit on cloudvirt1012 and later ids.

The old cloudvirt are apparently HP ProLiant DL380p Gen8 (https://wikitech.wikimedia.org/wiki/HP_DL380p). HP has some built in power management system and I have found a few report that it might cause the CPU to be slower than expected:

https://v-strange.de/index.php/19-hp-hardware/200-hp-power-management-better-switch-it-off
https://helgeklein.com/blog/2013/05/the-effects-of-power-savings-mode-on-vcpu-performance/

HP doc mentions power management https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-c03031625&docLocale=en_US recommending to set it to "high performance".

For gen9 our install doc states to disable the power management:
In bios:

  • select service options
  • Set Processor Power Monitoring and choose disabled
  • Press enter, ignore warning message regarding modification by pressing enter again. Select disabled and press enter again.

https://wikitech.wikimedia.org/wiki/Platform-specific_documentation/HP_DL3N0_Gen9#Setting_proper_power_option

So very naively, I am wondering if it could just be about tweaking/disabling the HP power management in the BIOS and leave its management to the OS

Mentioned in SAL (#wikimedia-cloud) [2019-06-04T08:56:52Z] <arturo> reallocating integration-slave-docker-1059 and integration-slave-docker-1058 to cloudvirt1012 (T223971)

So on cloudvirt1012 the instances are shown in cpuinfo: Intel Core Processor (Haswell, no TSX). They are now even slower than they used to be on cloudvirt1004 or cloudvirt1005.

Mentioned in SAL (#wikimedia-cloud) [2019-06-04T08:56:52Z] <arturo> reallocating integration-slave-docker-1059 and integration-slave-docker-1058 to cloudvirt1012 (T223971)

Sorry I have messed up. I would need those instances on another later cloudvirt since cloudvirt1012 has slow CPUs as well :-\

hashar triaged this task as High priority.Jun 5 2019, 7:32 AM

Maybe it can be reproduced on the test machine labtestvirt2003.codfw.wmnet which is an HP as well, although it is gen 9 (ProLiant DL360 Gen9). It has an Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz. Being a test machine, I would guess it is easier to apply different power management settings there to confirm/infirm my theory.

Also raising priority because that prevents us from reinstalling instances :\

Mentioned in SAL (#wikimedia-cloud) [2019-06-05T08:56:52Z] <arturo> move integration-slave-docker-1059 and integration-slave-docker-1058 to cloudvirt1028 (T223971)

Could this difference be caused by prevention of those design flaws called Spectre & Meltdown?

Result of a testing bios settings on labtestvirt2003.codfw.wmnet which is a ProLiant DL360 Gen9 with an Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz. Running:

  • time $(i=1; while (( i < 2000000 )); do (( i ++ )); done)
  • sysbench --test=cpu run

Results of different power regulator settings on labtestvirt2003.codfw.wmnet.

regulatortime testsysbench
baremetalmin16.4124.22
kvmmin16.4224.29
baremetalmax6.7310.00
kvmmax6.8010.01
baremetal*dynamic6.8010.14
kvm*dynamic6.8710.17

( * dynamic is the default value.)

The high performance max profile is only slightly better than the default dynamic.

There is no meaningful difference between running on baremetal or inside kvm. Changing the bios CPU regulator to minimum does dramatically affect performance though.

I have also noticed T225713: CPU scaling governor audit which is about auditing the CPU governor on our machines and might show up light to this trouble. I would probably write some script to collect metrics over a long run and see whether we can identify a pattern.

And from the list of benchmark, the cloudvirt1006 machine performs way worst than my old/cheap intel nuc at home:

HostModelbase speedMax turbobogoMipsSysbench
cloudvirt1006Xeon E5-2697 v22,700 MHz3,500 MHz5,38615.13s
@hashar #2i5-4250U1,300 MHz2,600 MHz3,80011,8s

https://www.cpu-monkey.com/en/compare_cpu-intel_xeon_e5_2697_v2-86-vs-intel_core_i5_4250u-3 . Though cloudvirt is on Ivy Bridge and my machine is on Haswell, it is really unclear why it would be THAT slower :-\ Guess we can wait for the outcome of the CPU scaling audit from T225713.

sysbench is that tool for measuring high load on mysql? I bought my box in May 2009 and sysbench (version 1.0.11) says "total time: 10.0012s" and "execution time (avg/stddev): 9.9984" (2 Quad-Core AMD Opteron(tm) Processor 2382 at 2613 MHz, 5230.44 bogomips)

Seems that I have no need for new hardware …

re: T225713: CPU scaling governor audit what was uncovered is that at least on hp boxes when "power control" is set to anything other than "os control" then pcc-cpufreq is loaded as the scaling driver, however that driver doesn't really scale with >4 cpus. This is currently the case on cloudvirt1006

cloudvirt1006:~$ cat /sys/devices/system/cpu/cpufreq/policy0/scaling_driver 
pcc-cpufreq
cloudvirt1006:~$ cat /sys/devices/system/cpu/cpufreq/policy0/scaling_governor 
ondemand

Although when "os control" is set then intel_pstate scaling driver is loaded and things should be significantly better. In the other task we're not at the "change bios settings" stage yet but we'll get there soon, at any rate let's coordinate on this!

Mentioned in SAL (#wikimedia-releng) [2019-09-25T19:36:27Z] <hashar> Deleting integration-agent-docker-1007 it is too slow ( T223971 )

Mentioned in SAL (#wikimedia-releng) [2019-09-27T20:01:13Z] <hashar> Marking integration-agent-docker-1015 offline due to cloudvirt1004 being wayyyyy too slow T223971

Bstorm lowered the priority of this task from High to Medium.Feb 25 2020, 5:25 PM
Reedy renamed this task from Old cloudvirt (with Intel Xeon) are twice slower than new ones (Intel Sky Lake) to Old cloudvirt (with Intel Xeon) are half the speed of newer ones (Intel Sky Lake).Feb 25 2020, 5:26 PM

This is likely to be fixed with the introduction of a number of things. One is to change to the performance governor, but in testing @JHedden found that in many cases resource contention is a stronger determinant of such an issue. We also are going to be able to do better load balancing soon with ceph.

This is likely to be fixed with the introduction of a number of things. One is to change to the performance governor, but in testing @JHedden found that in many cases resource contention is a stronger determinant of such an issue. We also are going to be able to do better load balancing soon with ceph.

I understand contention can be an issue. Though in this case there are strong indication that the issue is with the underlying hardware and/or T225713

The raw CPU performances are worth than my old Intel NUC at home, even though those old cloudvirt have processor that largely outperform my machine. At least on paper.

We're in the process of replacing most of these hosts. Everything is slowed down by COVID but at least we have some orders in.

That most probably comes from the CPU scaling BIOS setting described at T225713 . Then there is no incentive to get this fixed so declining.