Update April 7th 2020: the perf regression seems to come from runc using https://docs.docker.com/engine/security/seccomp/
The CI job mwcore-phpunit-coverage-master generates PHPUnit code coverage for mediawiki/core. It runs twice per day at 3am and 3pm UTC.
Since we have migrated to WMCS Stretch instance, the job routinely timeout after the 5 hours limit is reached. The build is done using a Docker container which has not changed otherwise. So something seems off on Stretch VM compared to Jessie ones.
The root cause is the Docker upgrade that happened at the same time:
- 18.06.2~ce~3-0~debian + 5:18.09.7~3-0~debian-stretch
It has been rolled back on October 30th 2019.
T236881 determined that Docker 18.09.7 is slowed down when the Kernel CPU mitigation spec_store_bypass_disable=on is present. Turning it off restore performance. 18.06.2 is not affected.
Parent task T232706 has various timing for that specific job. Example for builds that happened on the same host (cloudvirt1028):
It is constantly faster on the Jessie instance (which is on cloudvirt1028):
On a Jessie instance it usually takes less than 3 hours:
Success | #245 | 2 hr 58 min | integration-agent-jessie-docker-1001 |
Success | #242 | 2 hr 43 min | integration-agent-jessie-docker-1001 |
Success | #234 | 2 hr 40 min | integration-agent-jessie-docker-1001 |
Success | #223 | 2 hr 50 min | integration-agent-jessie-docker-1001 |
Success | #210 | 2 hr 48 min | integration-agent-jessie-docker-1001 |
Success | #209 | 4 hr 3 min | integration-agent-jessie-docker-1001 |
On a Stretch instance it constantly takes more than 3 hours and 30 minutes:
Success | #255 | 3 hr 57 min | integration-agent-docker-1008 |
Success | #252 | 4 hr 4 min | integration-agent-docker-1005 |
Success | #250 | 4 hr 11 min | integration-agent-docker-1005 |
Success | #246 | 3 hr 55 min | integration-agent-docker-1005 |
Success | #239 | 3 hr 45 min | integration-agent-docker-1005 |
Success | #233 | 3 hr 39 min | integration-agent-docker-1008 |
Success | #225 | 4 hr 1 min | integration-agent-docker-1008 |
Success | #214 | 4 hr 14 min | integration-agent-docker-1005 |
Repro
To reproduce it, I run the CI releng/tox container against the integration/config repository. It is incentive and fast enough for quick benchmarks.
A) set it up:
install -d -m777 cache src sudo docker run --rm -it --volume "$(pwd)/cache:/cache" --volume "$(pwd)/src:/src" -e ZUUL_URL=https://gerrit.wikimedia.org/r/ -e ZUUL_PROJECT=integration/config -e ZUUL_BRANCH=master -e ZUUL_REF=master docker-registry.wikimedia.org/releng/tox:0.4.0 --notest
B) run:
time sudo docker run --rm -it --volume "$(pwd)/cache:/cache" --volume "$(pwd)/src:/src" -e ZUUL_URL=https://gerrit.wikimedia.org/r/ -e ZUUL_PROJECT=integration/config -e ZUUL_BRANCH=master -e ZUUL_REF=master -e TOX_TESTENV_PASSENV='*' --entrypoint=tox docker-registry.wikimedia.org/releng/tox:0.4.0