Investigate Docker slowness between 18.06.2 and 18.09.7
Closed, ResolvedPublic
Actions

Description

Update April 7th 2020: the perf regression seems to come from runc using https://docs.docker.com/engine/security/seccomp/

The CI job mwcore-phpunit-coverage-master generates PHPUnit code coverage for mediawiki/core. It runs twice per day at 3am and 3pm UTC.

Since we have migrated to WMCS Stretch instance, the job routinely timeout after the 5 hours limit is reached. The build is done using a Docker container which has not changed otherwise. So something seems off on Stretch VM compared to Jessie ones.

The root cause is the Docker upgrade that happened at the same time:

- 18.06.2~ce~3-0~debian
+ 5:18.09.7~3-0~debian-stretch

It has been rolled back on October 30th 2019.

T236881 determined that Docker 18.09.7 is slowed down when the Kernel CPU mitigation spec_store_bypass_disable=on is present. Turning it off restore performance. 18.06.2 is not affected.

Parent task T232706 has various timing for that specific job. Example for builds that happened on the same host (cloudvirt1028):
It is constantly faster on the Jessie instance (which is on cloudvirt1028):

On a Jessie instance it usually takes less than 3 hours:

Success	#245	2 hr 58 min	integration-agent-jessie-docker-1001
Success	#242	2 hr 43 min	integration-agent-jessie-docker-1001
Success	#234	2 hr 40 min	integration-agent-jessie-docker-1001
Success	#223	2 hr 50 min	integration-agent-jessie-docker-1001
Success	#210	2 hr 48 min	integration-agent-jessie-docker-1001
Success	#209	4 hr 3 min	integration-agent-jessie-docker-1001

On a Stretch instance it constantly takes more than 3 hours and 30 minutes:

Success	#255	3 hr 57 min	integration-agent-docker-1008
Success	#252	4 hr 4 min	integration-agent-docker-1005
Success	#250	4 hr 11 min	integration-agent-docker-1005
Success	#246	3 hr 55 min	integration-agent-docker-1005
Success	#239	3 hr 45 min	integration-agent-docker-1005
Success	#233	3 hr 39 min	integration-agent-docker-1008
Success	#225	4 hr 1 min	integration-agent-docker-1008
Success	#214	4 hr 14 min	integration-agent-docker-1005

Repro

To reproduce it, I run the CI releng/tox container against the integration/config repository. It is incentive and fast enough for quick benchmarks.

A) set it up:

install -d -m777 cache src
sudo docker run --rm -it --volume "$(pwd)/cache:/cache" --volume "$(pwd)/src:/src" -e ZUUL_URL=https://gerrit.wikimedia.org/r/ -e ZUUL_PROJECT=integration/config -e ZUUL_BRANCH=master -e ZUUL_REF=master docker-registry.wikimedia.org/releng/tox:0.4.0 --notest

B) run:

time sudo docker run --rm -it --volume "$(pwd)/cache:/cache" --volume "$(pwd)/src:/src" -e ZUUL_URL=https://gerrit.wikimedia.org/r/ -e ZUUL_PROJECT=integration/config -e ZUUL_BRANCH=master -e ZUUL_REF=master -e TOX_TESTENV_PASSENV='*' --entrypoint=tox docker-registry.wikimedia.org/releng/tox:0.4.0

Details

Subject	Repo	Branch	Lines +/-
contint: On stretch, use the docker we have	operations/puppet	production	+1 -1
Disable seccomp in Docker	integration/config	master	+1 -0
contint: downgrade docker on Stretch to match Jessie	operations/puppet	production	+1 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	hashar	T232706 mwcore-phpunit-coverage-master times out after 5 hours
Resolved	hashar	T236675 Investigate Docker slowness between 18.06.2 and 18.09.7
Declined	None	T236881 Kernel CPU mitigations on WMCS Stretch instance cause a 20% slowdown

Event Timeline

hashar created this task.Oct 28 2019, 11:13 AM

hashar mentioned this in T236576: Move all Wikimedia CI (WMCS integration project) instances from jessie to stretch.

Another job that shows the difference is the job running tox for integration/config.git https://integration.wikimedia.org/ci/job/integration-config-tox-docker/

It takes ~ 90 seconds on Stretch instances but 70 seconds on Jessie, or ~ 30% slower
The Mediawiki coverage job takes roughly 4 hours on Stretch but 2h50 on Jessie or ~ 40% slower.

We should be able to benchmark using the following sequence:

Prepare cache for pip using tox --notest

install -d -m777 cache src
docker run --rm -it --volume "$(pwd)/cache:/cache" --volume "$(pwd)/src:/src" -e ZUUL_URL=https://gerrit.wikimedia.org/r/ -e ZUUL_PROJECT=integration/config -e ZUUL_BRANCH=master -e ZUUL_REF=master docker-registry.wikimedia.org/releng/tox:0.4.0 --notest

Times tox:

time docker run --rm -it --volume "$(pwd)/cache:/cache" --volume "$(pwd)/src:/src" -e ZUUL_URL=https://gerrit.wikimedia.org/r/ -e ZUUL_PROJECT=integration/config -e ZUUL_BRANCH=master -e ZUUL_REF=master docker-registry.wikimedia.org/releng/tox:0.4.0

@hashar you could try setting these hiera keys in your project to switch your stretch instances to using sssd for NSS LDAP operations which we found to be faster in Toolforge:

sudo_flavor: sudo
profile::ldap::client::labs::client_stack: sssd

These are the default for Buster instances, but still opt-in for Stretch. The NSS integration might make a difference if anything in the tests is doing a lot of stat() calls or other lookups that would fetch /etc/{password,groups} information from the LDAP backend.

Paladox subscribed.Oct 29 2019, 12:57 AM

Ah thank you @bd808, although the operations are done in a Docker container maybe that ends up using the nslcd. Will give that a try and compare.

Also I though I had the same Docker on all instances but I was wrong:

Jessie	18.06.2-ce
Stretch	18.09.7

Which might make a difference.

Mentioned in SAL (#wikimedia-releng) [2019-10-29T16:31:09Z] <hashar> integration-agent-docker-1005 change sudo and ldap client (to sssd) # T236675

I have switched integration-agent-docker-1005 to sssd and will be able to check whether it performs differently than integration-agent-docker-1008 (Stretch, same Docker, same cloudvirt).

sssd does not make any noticeable difference: integration-agent-docker-1005 and integration-agent-docker-1008 perform the same.

I then wanted to rule out Docker entirely since Jessie and Stretch instances have different versions. I have used:

git clone https://gerrit.wikimedia.org/r/integration/config.git
cd config
tox -e jenkins-jobs -- test jjb -o foo

OS	Duration
Stretch	66 s
Jessie	55 s

I have found the issue. Lets follow up in private task T236881

hashar closed subtask T236881: Kernel CPU mitigations on WMCS Stretch instance cause a 20% slowdown as Declined.Oct 30 2019, 5:10 PM

Mentioned in SAL (#wikimedia-releng) [2019-10-30T20:44:18Z] <hashar> integration-agent-docker-1005 downgrade docker to docker-ce 18.06.2~ce~3-0~debian , the one we had on Jessie # T236675

On the Stretch instance, I have manually installed Docker 18.06.2 using the package we had on Jessie. It is not affected by the slowdown.

So that is an issue somewhere between that and the Docker 18.09.7~3-0 package we are using on Stretch.

Conclusion: rollback the fleet of Stretch instances to 18.06.2 and investigate after.

hashar mentioned this in T226236: Upload docker-ce 18.06.3 upstream package for Stretch.Oct 30 2019, 9:09 PM

Change 547313 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/puppet@production] contint: downgrade docker on Stretch to match Jessie

https://gerrit.wikimedia.org/r/547313

gerritbot added a project: Patch-For-Review.Oct 30 2019, 9:17 PM

hashar renamed this task from Investigate Stretch VM being slower than Jessie VM to Investigate Docker slowness between 18.06.2 and 18.09.7.Oct 31 2019, 9:20 AM

hashar updated the task description. (Show Details)

bd808 removed projects: Cloud-VPS, cloud-services-team.Nov 4 2019, 4:37 AM

Currently being worked on a bit -> tagging to this month.

I guess I can do some profiling using perf record for both case and hope that perf diff would shows anything standing up. Else I guess I would attempt to bisect the affected Docker version. Then 18.06 -> 18.09 is a major change, 18.06 had the Docker daemon spawning the containers while 18.09 delegates that to containerd.io which is managed by systemd.

Versions:

Docker	containerd
19.03.3	1.2.10
19.03	1.2.6
18.09.4	1.2.5
18.09.1	1.2.2

containerd changelog https://github.com/containerd/containerd/releases

Or that is somewhere in runc ... :-\ That is not going to be fun.

Jdforrester-WMF assigned this task to hashar.Nov 6 2019, 4:19 PM

Jdforrester-WMF moved this task from INBOX to Doing on the Release-Engineering-Team-TODO (201911) board.

Jdforrester-WMF edited projects, added Release-Engineering-Team-TODO (201912); removed Release-Engineering-Team-TODO (201911).Dec 2 2019, 7:31 PM

Jdforrester-WMF moved this task from INBOX to Doing on the Release-Engineering-Team-TODO (201912) board.

Packages for Stretch ( stretch-wikimedia/thirdparty/ci ) have been updated for Jenkins ( T239586#5711373 ) and that also updated docker-e and container.io

docker-ce	5:19.03.5
containerd.io	1.2.10-3

Maybe that fix the performance issue we had with 5:18.09.7~3-0~debian-stretch

hashar mentioned this in T239586: Add latest jenkins debian packages to apt.wikimedia.org and upgrade jenkins to latest LTS (2.190.3).Dec 6 2019, 12:18 PM

Change 547313 merged by Jcrespo:
[operations/puppet@production] contint: downgrade docker on Stretch to match Jessie

https://gerrit.wikimedia.org/r/547313

thcipriani edited projects, added Release-Engineering-Team-TODO (2020-01 to 2020-03 (Q3)); removed Release-Engineering-Team-TODO (201912).Jan 7 2020, 10:52 PM

Jdforrester-WMF moved this task from INBOX to Maintenance on the Release-Engineering-Team-TODO (2020-01 to 2020-03 (Q3)) board.Jan 15 2020, 6:45 PM

Jdforrester-WMF mentioned this in T245272: Draft a plan for upgrading kubernetes machines to buster.Feb 14 2020, 5:12 PM

hashar mentioned this in T232706: mwcore-phpunit-coverage-master times out after 5 hours.Mar 5 2020, 2:09 PM

Krenair subscribed.Mar 5 2020, 11:12 PM

thcipriani edited projects, added Release-Engineering-Team-TODO (2020-04 to 2020-06 (Q4)); removed Release-Engineering-Team-TODO (2020-01 to 2020-03 (Q3)).Mar 18 2020, 5:44 PM

thcipriani moved this task from INBOX to Maintenance on the Release-Engineering-Team-TODO (2020-04 to 2020-06 (Q4)) board.

Jdforrester-WMF moved this task from Untriaged to In-progress on the Continuous-Integration-Infrastructure board.Mar 20 2020, 5:51 PM

Dzahn mentioned this in T224591: Migrate contint* hosts to Buster.Mar 25 2020, 3:26 PM

hashar mentioned this in T249226: Delete integration-agent-jessie-docker-1001.Apr 2 2020, 2:50 PM

hashar updated the task description. (Show Details)Apr 6 2020, 7:00 PM

hashar updated the task description. (Show Details)Apr 6 2020, 7:06 PM

hashar updated the task description. (Show Details)Apr 6 2020, 7:13 PM

On integration-agent-docker-1005 which has docker-ce 18.09.0~3-0~debian-stretch I have:

moved runc to runc-back
docker-runc binary from 18.06.2 as runc-new and symlinked runc to it

I ran a container and it complained about lack of https://en.wikipedia.org/wiki/Seccomp Seccomp (wikipedia):

docker: Error response from daemon: OCI runtime create failed: container_linux.go:348: starting container process caused "seccomp: config provided but seccomp not supported": unknown.

I have passed --privileged to docker run to skip setting up seccomp entirely and workaround that error. And surely suddenly my benchmark is fast again!

runc version 1.0.0-rc5+dev.docker-18.06
commit: a592beb5bc4c4092b1b1bac971afed27687340c5
spec: 1.0.0

I have changed the symlink back to the runc from Docker 18.09.0:

runc version 1.0.0-rc10
commit: dc9208a3303feef5b3839f4323d9beb36df0a9dd
spec: 1.0.1-dev

Ran with --privileged and my benchmark is fast again.

So whatever seccomp is happens to slow down the containers dramatically. At least on Stretch.

On my Buster system at home the different is even more dramatic:

docker 18.09.7~3
containerd.io 1.2.10-3

tests  total
23.115s  26.410s

With --privileged or --security-opt seccomp=unconfined:

tests  total
13.001s  15.557s

hashar updated the task description. (Show Details)Apr 6 2020, 10:35 PM

Docker seccomp profile has a whitelist of syscalls: https://github.com/moby/moby/blob/master/profiles/seccomp/default.json

I tried a seccomp profile which allow all syscalls by default:

allow.json

{
	"defaultAction": "SCMP_ACT_ALLOW",
	"architectures": [
		"SCMP_ARCH_X86_64",
		"SCMP_ARCH_X86",
		"SCMP_ARCH_X32"
	],
	"syscalls": [
	]
}

Which can then be used with: docker run --security-opt seccomp=allow.json. That is still performance crippled locally Kernel 5.4.0, libseccomp2 2.3.3-4. Tried with both latest upstream Docker and the one from Buster.

Change 587304 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] Disable seccomp in Docker

https://gerrit.wikimedia.org/r/587304

I asked on the internal ops list about how seccom can introduces such a performance penalty.

Jdforrester-WMF mentioned this in T249729: Assess whether we should still disable seccomp in Docker for CI .Apr 8 2020, 3:53 PM

Mentioned in SAL (#wikimedia-releng) [2020-04-08T16:00:40Z] <James_F> Updating wikipeg-npm-php-docker with seccomp over-ride for final T236675 testing

Mentioned in SAL (#wikimedia-releng) [2020-04-08T16:01:24Z] <James_F> Updating composer-package-php72-docker with seccomp over-ride for final T236675 testing

Mentioned in SAL (#wikimedia-releng) [2020-04-08T16:04:27Z] <James_F> Updating all docker JJB jobs with seccomp over-ride for T236675

Change 587304 merged by jenkins-bot:
[integration/config@master] Disable seccomp in Docker

https://gerrit.wikimedia.org/r/587304

Jdforrester-WMF closed this task as Resolved.Apr 8 2020, 4:13 PM

Jdforrester-WMF mentioned this in T250502: Rebuild integration-agent-puppet-docker-1001 to use Stretch.May 1 2020, 7:56 PM

Change 593806 had a related patch set uploaded (by Jforrester; owner: Jforrester):
[operations/puppet@production] contint: On stretch, use the docker we have

https://gerrit.wikimedia.org/r/593806

Change 593806 merged by RLazarus:
[operations/puppet@production] contint: On stretch, use the docker we have