Page MenuHomePhabricator

Investigate Docker slowness between 18.06.2 and 18.09.7
Closed, ResolvedPublic

Description

Update April 7th 2020: the perf regression seems to come from runc using https://docs.docker.com/engine/security/seccomp/

The CI job mwcore-phpunit-coverage-master generates PHPUnit code coverage for mediawiki/core. It runs twice per day at 3am and 3pm UTC.

Since we have migrated to WMCS Stretch instance, the job routinely timeout after the 5 hours limit is reached. The build is done using a Docker container which has not changed otherwise. So something seems off on Stretch VM compared to Jessie ones.

The root cause is the Docker upgrade that happened at the same time:

- 18.06.2~ce~3-0~debian
+ 5:18.09.7~3-0~debian-stretch

It has been rolled back on October 30th 2019.

T236881 determined that Docker 18.09.7 is slowed down when the Kernel CPU mitigation spec_store_bypass_disable=on is present. Turning it off restore performance. 18.06.2 is not affected.

Parent task T232706 has various timing for that specific job. Example for builds that happened on the same host (cloudvirt1028):
It is constantly faster on the Jessie instance (which is on cloudvirt1028):

On a Jessie instance it usually takes less than 3 hours:

Success#2452 hr 58 minintegration-agent-jessie-docker-1001
Success#2422 hr 43 minintegration-agent-jessie-docker-1001
Success#2342 hr 40 minintegration-agent-jessie-docker-1001
Success#2232 hr 50 minintegration-agent-jessie-docker-1001
Success#2102 hr 48 minintegration-agent-jessie-docker-1001
Success#2094 hr 3 minintegration-agent-jessie-docker-1001

On a Stretch instance it constantly takes more than 3 hours and 30 minutes:

Success#2553 hr 57 minintegration-agent-docker-1008
Success#2524 hr 4 minintegration-agent-docker-1005
Success#2504 hr 11 minintegration-agent-docker-1005
Success#2463 hr 55 minintegration-agent-docker-1005
Success#2393 hr 45 minintegration-agent-docker-1005
Success#2333 hr 39 minintegration-agent-docker-1008
Success#2254 hr 1 minintegration-agent-docker-1008
Success#2144 hr 14 minintegration-agent-docker-1005

Repro

To reproduce it, I run the CI releng/tox container against the integration/config repository. It is incentive and fast enough for quick benchmarks.

A) set it up:

install -d -m777 cache src
sudo docker run --rm -it --volume "$(pwd)/cache:/cache" --volume "$(pwd)/src:/src" -e ZUUL_URL=https://gerrit.wikimedia.org/r/ -e ZUUL_PROJECT=integration/config -e ZUUL_BRANCH=master -e ZUUL_REF=master docker-registry.wikimedia.org/releng/tox:0.4.0 --notest

B) run:

time sudo docker run --rm -it --volume "$(pwd)/cache:/cache" --volume "$(pwd)/src:/src" -e ZUUL_URL=https://gerrit.wikimedia.org/r/ -e ZUUL_PROJECT=integration/config -e ZUUL_BRANCH=master -e ZUUL_REF=master -e TOX_TESTENV_PASSENV='*' --entrypoint=tox docker-registry.wikimedia.org/releng/tox:0.4.0

Event Timeline

Another job that shows the difference is the job running tox for integration/config.git https://integration.wikimedia.org/ci/job/integration-config-tox-docker/

It takes ~ 90 seconds on Stretch instances but 70 seconds on Jessie, or ~ 30% slower
The Mediawiki coverage job takes roughly 4 hours on Stretch but 2h50 on Jessie or ~ 40% slower.

We should be able to benchmark using the following sequence:

Prepare cache for pip using tox --notest

install -d -m777 cache src
docker run --rm -it --volume "$(pwd)/cache:/cache" --volume "$(pwd)/src:/src" -e ZUUL_URL=https://gerrit.wikimedia.org/r/ -e ZUUL_PROJECT=integration/config -e ZUUL_BRANCH=master -e ZUUL_REF=master docker-registry.wikimedia.org/releng/tox:0.4.0 --notest

Times tox:

time docker run --rm -it --volume "$(pwd)/cache:/cache" --volume "$(pwd)/src:/src" -e ZUUL_URL=https://gerrit.wikimedia.org/r/ -e ZUUL_PROJECT=integration/config -e ZUUL_BRANCH=master -e ZUUL_REF=master docker-registry.wikimedia.org/releng/tox:0.4.0

@hashar you could try setting these hiera keys in your project to switch your stretch instances to using sssd for NSS LDAP operations which we found to be faster in Toolforge:

sudo_flavor: sudo
profile::ldap::client::labs::client_stack: sssd

These are the default for Buster instances, but still opt-in for Stretch. The NSS integration might make a difference if anything in the tests is doing a lot of stat() calls or other lookups that would fetch /etc/{password,groups} information from the LDAP backend.

Ah thank you @bd808, although the operations are done in a Docker container maybe that ends up using the nslcd. Will give that a try and compare.

Also I though I had the same Docker on all instances but I was wrong:

Jessie18.06.2-ce
Stretch18.09.7

Which might make a difference.

Mentioned in SAL (#wikimedia-releng) [2019-10-29T16:31:09Z] <hashar> integration-agent-docker-1005 change sudo and ldap client (to sssd) # T236675

I have switched integration-agent-docker-1005 to sssd and will be able to check whether it performs differently than integration-agent-docker-1008 (Stretch, same Docker, same cloudvirt).

sssd does not make any noticeable difference: integration-agent-docker-1005 and integration-agent-docker-1008 perform the same.

I then wanted to rule out Docker entirely since Jessie and Stretch instances have different versions. I have used:

git clone https://gerrit.wikimedia.org/r/integration/config.git
cd config
tox -e jenkins-jobs -- test jjb -o foo
OSDuration
Stretch66 s
Jessie55 s
hashar changed the task status from Open to Stalled.Oct 30 2019, 9:39 AM
hashar triaged this task as Medium priority.

I have found the issue. Lets follow up in private task T236881

Mentioned in SAL (#wikimedia-releng) [2019-10-30T20:44:18Z] <hashar> integration-agent-docker-1005 downgrade docker to docker-ce 18.06.2~ce~3-0~debian , the one we had on Jessie # T236675

hashar changed the task status from Stalled to Open.Oct 30 2019, 8:54 PM

On the Stretch instance, I have manually installed Docker 18.06.2 using the package we had on Jessie. It is not affected by the slowdown.

So that is an issue somewhere between that and the Docker 18.09.7~3-0 package we are using on Stretch.

Conclusion: rollback the fleet of Stretch instances to 18.06.2 and investigate after.

Change 547313 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/puppet@production] contint: downgrade docker on Stretch to match Jessie

https://gerrit.wikimedia.org/r/547313

hashar renamed this task from Investigate Stretch VM being slower than Jessie VM to Investigate Docker slowness between 18.06.2 and 18.09.7.Oct 31 2019, 9:20 AM
hashar updated the task description. (Show Details)

I guess I can do some profiling using perf record for both case and hope that perf diff would shows anything standing up. Else I guess I would attempt to bisect the affected Docker version. Then 18.06 -> 18.09 is a major change, 18.06 had the Docker daemon spawning the containers while 18.09 delegates that to containerd.io which is managed by systemd.

Versions:

Dockercontainerd
19.03.31.2.10
19.031.2.6
18.09.41.2.5
18.09.11.2.2

containerd changelog https://github.com/containerd/containerd/releases

https://github.com/containerd/containerd/releases/tag/v1.2.10
https://github.com/containerd/containerd/releases/tag/v1.2.9
https://github.com/containerd/containerd/releases/tag/v1.2.8
https://github.com/containerd/containerd/releases/tag/v1.2.7
https://github.com/containerd/containerd/releases/tag/v1.2.6
https://github.com/containerd/containerd/releases/tag/v1.2.5
https://github.com/containerd/containerd/releases/tag/v1.2.4
https://github.com/containerd/containerd/releases/tag/v1.2.3
https://github.com/containerd/containerd/releases/tag/v1.2.2
https://github.com/containerd/containerd/releases/tag/v1.2.1
https://github.com/containerd/containerd/releases/tag/v1.2.0

Or that is somewhere in runc ... :-\ That is not going to be fun.

Packages for Stretch ( stretch-wikimedia/thirdparty/ci ) have been updated for Jenkins ( T239586#5711373 ) and that also updated docker-e and container.io

docker-ce5:19.03.5
containerd.io1.2.10-3

Maybe that fix the performance issue we had with 5:18.09.7~3-0~debian-stretch

Change 547313 merged by Jcrespo:
[operations/puppet@production] contint: downgrade docker on Stretch to match Jessie

https://gerrit.wikimedia.org/r/547313

On integration-agent-docker-1005 which has docker-ce 18.09.0~3-0~debian-stretch I have:

  • moved runc to runc-back
  • docker-runc binary from 18.06.2 as runc-new and symlinked runc to it

I ran a container and it complained about lack of https://en.wikipedia.org/wiki/Seccomp Seccomp (wikipedia):

docker: Error response from daemon: OCI runtime create failed: container_linux.go:348: starting container process caused "seccomp: config provided but seccomp not supported": unknown.

I have passed --privileged to docker run to skip setting up seccomp entirely and workaround that error. And surely suddenly my benchmark is fast again!

runc version 1.0.0-rc5+dev.docker-18.06
commit: a592beb5bc4c4092b1b1bac971afed27687340c5
spec: 1.0.0

I have changed the symlink back to the runc from Docker 18.09.0:

runc version 1.0.0-rc10
commit: dc9208a3303feef5b3839f4323d9beb36df0a9dd
spec: 1.0.1-dev

Ran with --privileged and my benchmark is fast again.

So whatever seccomp is happens to slow down the containers dramatically. At least on Stretch.

On my Buster system at home the different is even more dramatic:

docker 18.09.7~3
containerd.io 1.2.10-3

tests  total
23.115s  26.410s

With --privileged or --security-opt seccomp=unconfined:

tests  total
13.001s  15.557s

Docker seccomp profile has a whitelist of syscalls: https://github.com/moby/moby/blob/master/profiles/seccomp/default.json

I tried a seccomp profile which allow all syscalls by default:

allow.json
{
	"defaultAction": "SCMP_ACT_ALLOW",
	"architectures": [
		"SCMP_ARCH_X86_64",
		"SCMP_ARCH_X86",
		"SCMP_ARCH_X32"
	],
	"syscalls": [
	]
}

Which can then be used with: docker run --security-opt seccomp=allow.json. That is still performance crippled locally Kernel 5.4.0, libseccomp2 2.3.3-4. Tried with both latest upstream Docker and the one from Buster.

Change 587304 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] Disable seccomp in Docker

https://gerrit.wikimedia.org/r/587304

I asked on the internal ops list about how seccom can introduces such a performance penalty.

Mentioned in SAL (#wikimedia-releng) [2020-04-08T16:00:40Z] <James_F> Updating wikipeg-npm-php-docker with seccomp over-ride for final T236675 testing

Mentioned in SAL (#wikimedia-releng) [2020-04-08T16:01:24Z] <James_F> Updating composer-package-php72-docker with seccomp over-ride for final T236675 testing

Mentioned in SAL (#wikimedia-releng) [2020-04-08T16:04:27Z] <James_F> Updating all docker JJB jobs with seccomp over-ride for T236675

Change 587304 merged by jenkins-bot:
[integration/config@master] Disable seccomp in Docker

https://gerrit.wikimedia.org/r/587304

Change 593806 had a related patch set uploaded (by Jforrester; owner: Jforrester):
[operations/puppet@production] contint: On stretch, use the docker we have

https://gerrit.wikimedia.org/r/593806

Change 593806 merged by RLazarus:
[operations/puppet@production] contint: On stretch, use the docker we have

https://gerrit.wikimedia.org/r/593806