Page MenuHomePhabricator

Kernel CPU mitigations on WMCS Stretch instance cause a 20% slowdown
Closed, DeclinedPublic

Description

Fork from T236675.

I have noticed a 20% increase in execution time running Docker container on a Stretch instance compared to a Jessie instance.

That disappear on Stretch when setting the kernel boot option mitigations=off


  • The benchmark is done running a python script (jenkins-job-builder) for integration/config
  • The instances are on the same underlying machine (cloudvirt1028)
  • There is barely any load on the server or on the instances at the time conducting the test
  • Jessie has kernel 4.9.110-3+deb9u5~deb8u1
  • Stretch has kernel 4.9.189-3+deb9u1
  • I have ruled out nslcd vs sssd
  • It is not related to the python version, they are the same in the container
  • it is not glibc / libpthreads related or so. Based on comparison between perf reports
  • it is not related to the Docker version

I have initially and intuitively thought about the Spectre meltdown and other CPU issues mitigations, but dismissed that early and went wasting time comparing other parameters.

Eventually, I have rebooted a machine with kernel boot option mitigations=off based on informations at https://wiki.ubuntu.com/SecurityTeam/KnowledgeBase/MDS

Result on three instances all on cloudvirt1028:

HostOSKernelKernel optionDuration
integration-agent-jessie-docker-1001Jessie4.9.110defaults55s (good)
integration-agent-jessie-docker-1001Jessie4.9.189defaults55s (good)
integration-agent-jessie-docker-1001Jessie4.9.189mitigations=off54s (good)
integration-agent-1008-dockerStretch4.9.189defaults1m6s
integration-agent-1005-dockerStretch4.9.189defaults1m7s
integration-agent-1005-dockerStretch4.9.189mitigations=off55s (good)

Event Timeline

hashar created this task.Oct 30 2019, 9:39 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 30 2019, 9:40 AM
jbond added a subscriber: jbond.Oct 30 2019, 9:58 AM
hashar updated the task description. (Show Details)Oct 30 2019, 10:04 AM

With help from SRE people (specially Moritz). I have used spectre-meltdown-checker 0.42 from Stretch and ran it on both instances:

$ diff -U0 jessie stretch
--- jessie	2019-10-30 11:35:43.043507046 +0100
+++ stretch	2019-10-30 11:35:37.411478275 +0100
@@ -4 +4 @@
-Kernel is Linux 4.9.0-0.bpo.11-amd64 #1 SMP Debian 4.9.189-3+deb9u1~deb8u1 (2019-09-30) x86_64
+Kernel is Linux 4.9.0-11-amd64 #1 SMP Debian 4.9.189-3+deb9u1 (2019-09-20) x86_64
@@ -87 +87 @@
-* SSB mitigation currently active for selected processes: NO (no process found using SSB mitigation through prctl)
+* SSB mitigation currently active for selected processes: YES (systemd-journald systemd-logind systemd-timesyncd systemd-udevd)

Both report kernel 4.9.189-3+deb9u1. But Stretch has SSB mitigation enabled on systemd process.

hashar added a comment.EditedOct 30 2019, 12:13 PM

TLDR: The slowdown on Stretch goes out when setting spec_store_bypass_disable=off

Which is reported by spectre-meltdown-checker as:

CVE-2018-3639 aka 'Variant 4, speculative store bypass'
* Mitigated according to the /sys interface:  NO  (Vulnerable)
* Kernel supports disabling speculative store bypass (SSB):  YES  (found in /proc/self/status)
* SSB mitigation is enabled and active:  NO 
> STATUS:  VULNERABLE  (your CPU and kernel both support SSBD but the mitigation is not active)

No idea why jessie is not affected by the slow down.


The kernel boot option mitigations=off sets several flags.

mitigations=off
        Disable all optional CPU mitigations.  This
        improves system performance, but it may also
        expose users to several CPU vulnerabilities.
        Equivalent to: nopti [X86]
                       nospectre_v1 [X86]
                       nospectre_v2 [X86]
                       spectre_v2_user=off [X86]
                       spec_store_bypass_disable=off [X86]
                       l1tf=off [X86]
                       mds=off [X86]

I went setting them each one by one in /etc/default/grub , then update-grub and reboot. Ran test, rinse repeat.

Done on integration-agent-docker-1005 with kernel 4.9.189-3+deb9u1. Result:

DurationBoot option
1m8(nopti)
1m8(nospectre_v1 nospectre_v2) # v1 does not show in ps of init
1m8(spectre_v2_user=off) noop # does not show in ps of init
55s(spec_store_bypass_disable=off) - CVE-2018-3639:KO
1m8(l1tf=off)
1m6(mds=off)

Moritz are you aware of any difference between Jessie and Stretch kernels?

On Jessie spec_store_bypass_disable on or off does not change anything.
On Stretch spec_store_bypass_disable=off makes it way faster.

There's no difference, the 4.9.189-3+deb9u1 and 4.9.189-3+deb9u1~deb8u1 are identical feature-wise.

hashar added a comment.EditedOct 30 2019, 3:51 PM

That is what I have guessed looking at the source and /boot/config* files :-\

I just can not find why spec_store_bypass_disable=off barely makes a difference on Jessie while it has a large impact on Stretch.

Could it be related to the gcc version used for the kernel? Reviewing the Debian changelog, I noticed the Jessie one is not compiled with gcc 6 since it is not in Jessie and thus uses gcc 4.9

From /proc/version

DebianKernel packageGccGcc package
Jessie4.9.189-3+deb9u1~deb8u1gcc version 4.9.2Debian 4.9.2-10+deb8u2
Stretch4.9.189-3+deb9u1gcc version 6.3.0 20170516Debian 6.3.0-18+deb9u1
Stretch4.19.67-2+deb10u1~bpo9+1gcc version 6.3.0 20170516Debian 6.3.0-18+deb9u1

(4.19 kernel on Stretch is affected as well)

Obviously the jessie kernel is built with the jessie GCC. but there's no toolchain change which would explain the difference. You can try install the jessie kernel on stretch and vice-versa for further tests, though.

Thanks!

So on a Stretch instance I grabbed the Jessie kernel and installed it:

wget http://security.debian.org/debian-security/pool/updates/main/l/linux-4.9/linux-image-4.9.0-0.bpo.11-amd64_4.9.189-3+deb9u1~deb8u1_amd64.deb
dpkg -i linux-image-4.9.0-0.bpo.11-amd64_4.9.189-3+deb9u1~deb8u1_amd64.deb

Looked at menuentry in /boot/grub/grub.cfg to figure out which one to reboot on. Ended up with 4 so:

sudo grub-reboot '1>4'
cat|sudo reboot
stretch$ uname -a
Linux integration-agent-docker-1005 4.9.0-0.bpo.11-amd64 #1 SMP Debian 4.9.189-3+deb9u1~deb8u1 (2019-09-30) x86_64 GNU/Linux

stretch$ cat /proc/version 
Linux version 4.9.0-0.bpo.11-amd64 (debian-kernel@lists.debian.org) (gcc version 4.9.2 (Debian 4.9.2-10+deb8u2) ) #1 SMP Debian 4.9.189-3+deb9u1~deb8u1 (2019-09-30)

So I have a gcc4.9. compiled kernel on Stretch \o/

sudo spectre-meltdown-checker states everything is fine.

Ran my test again and it still takes 1m8s

Passed spec_store_bypass_disable=off and sudo grub-reboot '1>4 && reboot. Confirmed the kernel uses 4.9 and spectre-meltdown-checker reports an issue.

Ran test again and it is down to 57s.

The Jessie kernel does not change anything compared to the Stretch ones (4.9/4.19). They all show slowness with spec_store_bypass_disable=on which disappear when set to off.

I guess that rules out the kernel / gcc version. The issue must be somewhere up in the stack bah :-\

hashar closed this task as Declined.Oct 30 2019, 5:10 PM

I am dismissing this one for now. There is spec_store_bypass_disable=off does make the Docker containers faster. But without a container the Stretch instances are still slow regardless of the setting.

38s Jessie instance, jessie kernel
39s Jessie instance, jessie kernel, bypass=off
44s Stretch instance, jessie kernel, bypass=off
45s Stretch instance, jessie kernel
44s Stretch instance, stretch kernel
43s Stretch instance, stretch kernel (1008)

hashar added a comment.EditedOct 30 2019, 8:50 PM

It is not the kernel, but Docker.

Somehow spec_store_bypass_disable=on and 18.09.7~3-0~debian-stretch do not play well.

Restricted Application added a project: acl*security. · View Herald TranscriptOct 31 2019, 9:19 AM

Made it public, there is nothing secret.