Page MenuHomePhabricator

Investigate Docker slowness between 18.06.2 and 18.09.7
Open, NormalPublic

Description

The CI job mwcore-phpunit-coverage-master generates PHPUnit code coverage for mediawiki/core. It runs twice per day at 3am and 3pm UTC.

Since we have migrated to WMCS Stretch instance, the job routinely timeout after the 5 hours limit is reached. The build is done using a Docker container which has not changed otherwise. So something seems off on Stretch VM compared to Jessie ones.

The root cause is the Docker upgrade that happened at the same time:

- 18.06.2~ce~3-0~debian
+ 5:18.09.7~3-0~debian-stretch

It has been rolled back on October 30th 2019.

T236881 determined that Docker 18.09.7 is slowed down when the Kernel CPU mitigation spec_store_bypass_disable=on is present. Turning it off restore performance. 18.06.2 is not affected.

Parent task T232706 has various timing for that specific job. Example for builds that happened on the same host (cloudvirt1028):
It is constantly faster on the Jessie instance (which is on cloudvirt1028):

On a Jessie instance it usually takes less than 3 hours:

Success#2452 hr 58 minintegration-agent-jessie-docker-1001
Success#2422 hr 43 minintegration-agent-jessie-docker-1001
Success#2342 hr 40 minintegration-agent-jessie-docker-1001
Success#2232 hr 50 minintegration-agent-jessie-docker-1001
Success#2102 hr 48 minintegration-agent-jessie-docker-1001
Success#2094 hr 3 minintegration-agent-jessie-docker-1001

On a Stretch instance it constantly takes more than 3 hours and 30 minutes:

Success#2553 hr 57 minintegration-agent-docker-1008
Success#2524 hr 4 minintegration-agent-docker-1005
Success#2504 hr 11 minintegration-agent-docker-1005
Success#2463 hr 55 minintegration-agent-docker-1005
Success#2393 hr 45 minintegration-agent-docker-1005
Success#2333 hr 39 minintegration-agent-docker-1008
Success#2254 hr 1 minintegration-agent-docker-1008
Success#2144 hr 14 minintegration-agent-docker-1005

Details

Related Gerrit Patches:

Event Timeline

hashar added a comment.EditedMon, Oct 28, 11:36 AM

Another job that shows the difference is the job running tox for integration/config.git https://integration.wikimedia.org/ci/job/integration-config-tox-docker/

It takes ~ 90 seconds on Stretch instances but 70 seconds on Jessie, or ~ 30% slower
The Mediawiki coverage job takes roughly 4 hours on Stretch but 2h50 on Jessie or ~ 40% slower.

We should be able to benchmark using the following sequence:

Prepare cache for pip using tox --notest

install -d -m777 cache src
docker run --rm -it --volume "$(pwd)/cache:/cache" --volume "$(pwd)/src:/src" -e ZUUL_URL=https://gerrit.wikimedia.org/r/ -e ZUUL_PROJECT=integration/config -e ZUUL_BRANCH=master -e ZUUL_REF=master docker-registry.wikimedia.org/releng/tox:0.4.0 --notest

Times tox:

times docker run --rm -it --volume "$(pwd)/cache:/cache" --volume "$(pwd)/src:/src" -e ZUUL_URL=https://gerrit.wikimedia.org/r/ -e ZUUL_PROJECT=integration/config -e ZUUL_BRANCH=master -e ZUUL_REF=master docker-registry.wikimedia.org/releng/tox:0.4.0

@hashar you could try setting these hiera keys in your project to switch your stretch instances to using sssd for NSS LDAP operations which we found to be faster in Toolforge:

sudo_flavor: sudo
profile::ldap::client::labs::client_stack: sssd

These are the default for Buster instances, but still opt-in for Stretch. The NSS integration might make a difference if anything in the tests is doing a lot of stat() calls or other lookups that would fetch /etc/{password,groups} information from the LDAP backend.

Ah thank you @bd808 though the operations are done in a Docker container maybe that ends up using the nslcd. Will give that a try and compare.

Also I though I had the same Docker on all instances but I was wrong:

Jessie18.06.2-ce
Stretch18.09.7

Which might make a difference.

Mentioned in SAL (#wikimedia-releng) [2019-10-29T16:31:09Z] <hashar> integration-agent-docker-1005 change sudo and ldap client (to sssd) # T236675

hashar added a comment.EditedTue, Oct 29, 4:34 PM

I have switched integration-agent-docker-1005 to sssd and will be able to check whether it performs differently than integration-agent-docker-1008 (Stretch, same Docker, same cloudvirt).

sssd does not make any noticeable difference: integration-agent-docker-1005 and integration-agent-docker-1008 perform the same.

I then wanted to rule out Docker entirely since Jessie and Stretch instances have different versions. I have used:

git clone https://gerrit.wikimedia.org/r/integration/config.git
cd config
tox -e jenkins-jobs -- test jjb -o foo
OSDuration
Stretch66 s
Jessie55 s
hashar changed the task status from Open to Stalled.Wed, Oct 30, 9:39 AM
hashar triaged this task as Normal priority.

I have found the issue. Lets follow up in private task T236881

Mentioned in SAL (#wikimedia-releng) [2019-10-30T20:44:18Z] <hashar> integration-agent-docker-1005 downgrade docker to docker-ce 18.06.2~ce~3-0~debian , the one we had on Jessie # T236675

hashar changed the task status from Stalled to Open.Wed, Oct 30, 8:54 PM

On the Stretch instance, I have manually installed Docker 18.06.2 using the package we had on Jessie. It is not affected by the slowdown.

So that is an issue somewhere between that and the Docker 18.09.7~3-0 package we are using on Stretch.

Conclusion: rollback the fleet of Stretch instances to 18.06.2 and investigate after.

Change 547313 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/puppet@production] contint: downgrade docker on Stretch to match Jessie

https://gerrit.wikimedia.org/r/547313

hashar renamed this task from Investigate Stretch VM being slower than Jessie VM to Investigate Docker slowness between 18.06.2 and 18.09.7.Thu, Oct 31, 9:20 AM
hashar updated the task description. (Show Details)
Jdforrester-WMF added a subscriber: Jdforrester-WMF.

Currently being worked on a bit -> tagging to this month.

hashar added a comment.Wed, Nov 6, 2:02 PM

I guess I can do some profiling using perf record for both case and hope that perf diff would shows anything standing up. Else I guess I would attempt to bisect the affected Docker version. Then 18.06 -> 18.09 is a major change, 18.06 had the Docker daemon spawning the containers while 18.09 delegates that to containerd.io which is managed by systemd.

Versions:

Dockercontainerd
19.03.31.2.10
19.031.2.6
18.09.41.2.5
18.09.11.2.2

containerd changelog https://github.com/containerd/containerd/releases

https://github.com/containerd/containerd/releases/tag/v1.2.10
https://github.com/containerd/containerd/releases/tag/v1.2.9
https://github.com/containerd/containerd/releases/tag/v1.2.8
https://github.com/containerd/containerd/releases/tag/v1.2.7
https://github.com/containerd/containerd/releases/tag/v1.2.6
https://github.com/containerd/containerd/releases/tag/v1.2.5
https://github.com/containerd/containerd/releases/tag/v1.2.4
https://github.com/containerd/containerd/releases/tag/v1.2.3
https://github.com/containerd/containerd/releases/tag/v1.2.2
https://github.com/containerd/containerd/releases/tag/v1.2.1
https://github.com/containerd/containerd/releases/tag/v1.2.0

Or that is somewhere in runc ... :-\ That is not going to be fun.