Page MenuHomePhabricator

Terminating MySQL takes several minutes in (Wikibase?) CI jobs
Closed, ResolvedPublic

Assigned To
Authored By
Lucas_Werkmeister_WMDE
Oct 15 2020, 2:20 PM
Referenced Files
F34743426: disk_write_dirty.png
Nov 12 2021, 3:54 PM
F32409500: server2.txt
Oct 20 2020, 8:50 PM
F32409499: server1.txt
Oct 20 2020, 8:50 PM
F32409489: B.txt
Oct 20 2020, 8:50 PM
F32409486: A.txt
Oct 20 2020, 8:50 PM
Tokens
"Evil Spooky Haunted Tree" token, awarded by awight.

Description

Correction: terminating MySQL isn’t actually the slow part, it just happens to be the last message printed before the container exits, and it’s the container exit/cleanup that’s actually slow. T265615#6586508

Original task description:


I’ve noticed that the “Terminating MySQL” step at the end of a CI build can take several minutes. For example, in this recent job:

14:42:01 INFO:quibble.commands:Shutting down backends: <php DevWebServer http://127.0.0.1:9412 /workspace/src> <Xvfb :94> <ChromeWebDriver :94>
14:42:01 INFO:backend.ChromeWebDriver:Terminating ChromeWebDriver
14:42:01 INFO:backend.Xvfb:Terminating Xvfb
14:42:01 INFO:backend.DevWebServer:Terminating DevWebServer
14:42:01 INFO:quibble.commands:Shutting down backends: <mysql /workspace/db/quibble-mysql-ub2yb_5l/socket>
14:42:01 INFO:backend.MySQL:Terminating MySQL
14:46:16 [wmf-quibble-selenium-php72-docker@2] $ /bin/bash -xe /tmp/jenkins6976447998931934422.sh

I have no idea what Quibble or MySQL were doing there, but I doubt that it was really worth four minutes(!) of CI time.

So far, I’ve only noticed this in Wikibase jobs, but that might just be because I don’t often look at the CI output for other repositories.

Event Timeline

hashar added subscribers: kostajh, awight, hashar.

The relevant code:

quibble/backend.py
class BackendServer:
...
   def stop(self):
        if self.server is not None:
            self.log.info('Terminating %s', self.__class__.__name__)
            self.server.terminate()
            try:
                self.server.wait(2)
            except subprocess.TimeoutExpired:
                self.server.kill()  # SIGKILL
            finally:
                self.server = None

Where self.server is a Popen instance. So we basically send SIGTERM and have a busy loop waiting for two seconds which should then send SIGKILL. Clearly that does not work as expected :-\

The python documentation for Popen.wait() mentions it can deadlock when sending stdout/stderr to pipes:

Note

This will deadlock when using stdout=PIPE or stderr=PIPE and the child process generates enough output to a pipe such that it blocks waiting for the OS pipe buffer to accept more data. Use Popen.communicate() when using pipes to avoid that.

But we use /dev/null when we start the server:

self.server = subprocess.Popen([
    '/usr/sbin/mysqld',  # fixme drop path
    '--skip-networking',
    '--datadir=%s' % self.rootdir,
    '--log-error=%s' % self.errorlog,
    '--pid-file=%s' % self.pidfile,
    '--socket=%s' % self.socket,
    ],
    stdout=subprocess.DEVNULL,
    stderr=subprocess.DEVNULL,
)

Some stuff we can try:

  • reproduce it obviously
  • check whether MariaDB version has changed in the CI image
  • add some logging in the stop command (cause clearly subprocess.TimeoutExpired should be raised and cause a SIGKILL to be send)
  • add some option to capture MySQL stdout/stderr (using stream_relay which can consumes a PIPE and output it via logging).
hashar triaged this task as High priority.Oct 15 2020, 4:46 PM

Will look at it tomorrow and try to at least reproduce the issue.

Looking at some random builds, it usually takes 15 / 20 seconds to start the next command which is a bug by itself:

00:17:39.940 INFO:backend.MySQL:Terminating MySQL
00:17:55.311 [wmf-quibble-selenium-php72-docker] $ /bin/bash -xe /tmp/jenkins4972097457850427826.sh
00:19:01.909 INFO:backend.MySQL:Terminating MySQL
00:19:20.063 [wmf-quibble-selenium-php72-docker] $ /bin/bash -xe /tmp/jenkins443040905686928605.sh
00:18:17.942 INFO:backend.MySQL:Terminating MySQL
00:18:34.254 [wmf-quibble-selenium-php72-docker] $ /bin/bash -xe /tmp/jenkins7269376008034149723.sh

I have looked at that job duration trend on https://integration.wikimedia.org/ci/job/wmf-quibble-selenium-php72-docker/buildTimeTrend and the job is constantly way too slow whenever it happens to run on integration-agent-docker-1020:

00:30:08.778 INFO:backend.MySQL:Terminating MySQL
00:31:21.208 [wmf-quibble-selenium-php72-docker] $ /bin/bash -xe /tmp/jenkins8569331040755739778.sh
00:38:54.981 INFO:backend.MySQL:Terminating MySQL
00:42:18.062 [wmf-quibble-selenium-php72-docker@2] $ /bin/bash -xe /tmp/jenkins6862799909867122709.sh
00:39:41.829 INFO:backend.MySQL:Terminating MySQL
00:44:44.184 [wmf-quibble-selenium-php72-docker] $ /bin/bash -xe /tmp/jenkins6896978991300300753.sh
00:31:37.581 INFO:backend.MySQL:Terminating MySQL
00:32:57.181 [wmf-quibble-selenium-php72-docker@2] $ /bin/bash -xe /tmp/jenkins6303269567405582114.sh
00:30:08.106 INFO:backend.MySQL:Terminating MySQL
00:31:22.239 [wmf-quibble-selenium-php72-docker] $ /bin/bash -xe /tmp/jenkins6647671750866068876.sh

That agent runs on a WMCS host that uses Ceph for file backend (T260916). Which itself does not explain the problem since other agents are affected though in seconds, not minutes! I will unpoll it.

Mentioned in SAL (#wikimedia-releng) [2020-10-16T07:48:42Z] <hashar> Disabling integration-agent-docker-1020 (the sole agent using Ceph): it is too slow # T260916 T265615

Change 634465 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] jjb: wmf job with quibble 44

https://gerrit.wikimedia.org/r/634465

To rule out Quibble 0.0.45, I have create a dummy job using docker-registry.wikimedia.org/releng/quibble-stretch-php72:0.0.44-s5. And it is still affected:

00:21:32.612 INFO:backend.MySQL:Terminating MySQL
00:22:36.029 [wmf-quibble44-selenium-php72-docker] $ /bin/bash -xe /tmp/jenkins2652394246537911623.sh

So the logic is faulty somewhere or Quibble is being held by something else that does not show up in logs.

CI for integration/quibble does have a job that runs it entirely against mediawiki/core. The last few builds show more or less show the same issue:

00:11:36.278 INFO:backend.MySQL:Terminating MySQL
00:11:41.731 Archiving artifacts
00:11:09.274 INFO:backend.MySQL:Terminating MySQL
00:11:14.666 Archiving artifacts
00:13:52.197 INFO:backend.MySQL:Terminating MySQL
00:13:57.506 Archiving artifacts

Aka more than two seconds?

Change 634481 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/quibble@master] (DO NOT SUBMIT) test dummy full run

https://gerrit.wikimedia.org/r/634481

Gave it a try with a bit more logging. One run with a 2 seconds delay which is too short and trigger a SIGKILL:

00:22:05.191 INFO:quibble.commands:Shutting down backends: <mysql /tmp/quibble-mysql-avmk3d51/socket>
00:22:05.191 INFO:backend.MySQL:Stopping MySQL
00:22:05.192 INFO:backend.MySQL:Terminating MySQL
00:22:05.192 INFO:backend.MySQL:Waiting up to two seconds
00:22:07.184 INFO:backend.MySQL:Timeout sending SIGKILL
00:22:07.184 INFO:backend.MySQL:self.server = None
00:22:07.184 INFO:backend.MySQL:Completed stop()
00:22:07.184 INFO:quibble.cmd:Out of context
00:22:07.185 INFO:root:Finished Quibble run
00:22:11.905 Archiving artifacts

And after raising the timeout to issue a SIGTERM:

00:10:34.044 INFO:quibble.commands:Shutting down backends: <mysql /tmp/quibble-mysql-hrv8i82m/socket>
00:10:34.044 INFO:backend.MySQL:Stopping MySQL
00:10:34.044 INFO:backend.MySQL:Terminating MySQL
00:10:34.044 INFO:backend.MySQL:Waiting up to five seconds
00:10:36.466 INFO:backend.MySQL:self.server = None
00:10:36.466 INFO:backend.MySQL:Completed stop()
00:10:36.466 INFO:quibble.cmd:Out of context
00:10:36.466 INFO:root:Finished Quibble run
00:10:39.529 Archiving artifacts

Not any conclusive unfortunately :-\

I tried various combination and I can not reproduce it. Maybe it is race condition between us sending SIGKILL (self.server.kill()) and python reaping the processed on garbage collection when we set self.server = None :-\

After many attempts, I finally got a lead https://integration.wikimedia.org/ci/job/integration-quibble-fullrun/153/console

The job executes a shell script utils/ci-fullrun.sh in integration/quibble.git, I have edited it to echo CI FULLRUN completed when the shell exit (using trap). Quibble has a bunch of extra logging added here and there, notably to list processes under /proc/.

I have manually killed the php process running PHPUnit to trigger a fault. Times below are UTC hh:mm:ss

12:59:25 INFO:root:Left processes?
12:59:25 /proc/1/cmdline:/sbin/docker-init--/src/utils/ci-fullrun.sh--color
12:59:25 /proc/25/cmdline:python3-s/tmp/quibble/bin/quibble--skipselenium--color
12:59:25 /proc/7/cmdline:/bin/bash/src/utils/ci-fullrun.sh--color

We are left with 3 processes:

  • docker-init
  • ci-fullrun.sh which is the shell entry point used
  • quibble
12:59:25 INFO:root:Finished Quibble run

Quibble has finished.

`
12:59:25 subprocess.CalledProcessError: Command '['php', 'tests/phpunit/phpunit.php', '--testsuite', 'extensions', '--group', 'Database', '--exclude-group', 'Broken,ParserFuzz,Stub,Standalone', '--log-junit', '/workspace/log/junit-db.xml']' returned non-zero exit status -15

I have kill php hence the exit code -15.

12:59:25 + echo 'CI FULLRUN completed'
12:59:25 CI FULLRUN completed

At this point the shell script being run inside the Docker container has completed.

12:59:44 Build step 'Execute shell' marked build as failure

And it takes more than 18 seconds to complete :-\ I guess I am confident to state that Quibble is not the root cause. That seems to be either in Docker or in Jenkins itself.

Change 634465 abandoned by Hashar:
[integration/config@master] jjb: wmf job with quibble 44

Reason:

https://gerrit.wikimedia.org/r/634465

Next fun, digging into Jenkins!

I looked at https://integration.wikimedia.org/ci/job/wmf-quibble-selenium-php72-docker/64820/ which got affected by the same issue (all times are UTC):

18:08:34 [wmf-quibble-selenium-php72-docker] $ /bin/bash /tmp/jenkins5993728317798275711.sh
// Quibble output
18:34:14 INFO:backend.MySQL:Terminating MySQL
18:35:12 [wmf-quibble-selenium-php72-docker] $ /bin/bash -xe /tmp/jenkins5970126818379489927.sh

It ran on integration-agent-docker-1009 for which I took a couple thread dump while the job was waiting using https://integration.wikimedia.org/ci/computer/integration-agent-docker-1009/systemInfo

A couple thread dumps for the agent:

See https://fastthread.io/my-thread-report.jsp?p=c2hhcmVkLzIwMjAvMTAvMjAvLS1CLnR4dC0tMjAtMzQtMTM7Oy0tQS50eHQtLTIwLTM0LTEz

Not much standing out? There is a stdout copier thread for the script that runs Quibble (/tmp/jenkins5993728317798275711.sh). So I definitely took the thread dump while it was blocked and there is some Groovy scripts running.

I also took two thread dumps of the Jenkins server:

See also: https://fastthread.io/my-thread-report.jsp?p=c2hhcmVkLzIwMjAvMTAvMjAvLS1zZXJ2ZXIyLnR4dC0tMjAtMjctNTI7Oy0tc2VydmVyMS50eHQtLTIwLTI3LTUy&

The later second dump has a thread:

line=5
"Running CpsFlowExecution[Owner[maintenance-sample-workspace-sizes/123736:maintenance-sample-workspace-sizes #123736]] / waiting for integration-agent-docker-1009 id=6114779" daemon prio=5 TIMED_WAITING
    java.lang.Object.wait(Native Method)
    hudson.remoting.Request.call(Request.java:176)
    hudson.remoting.Channel.call(Channel.java:1000)
    hudson.util.RemotingDiagnostics.executeGroovy(RemotingDiagnostics.java:111)
    hudson.util.RemotingDiagnostics$executeGroovy.call(Unknown Source)
    org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:48)
    org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:113)
    com.cloudbees.groovy.cps.sandbox.DefaultInvoker.methodCall(DefaultInvoker.java:20)
    com.cloudbees.groovy.cps.impl.ContinuationGroup.methodCall(ContinuationGroup.java:86)
    com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.dispatchOrArg(FunctionCallBlock.java:113)
    com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.fixArg(FunctionCallBlock.java:83)
    sun.reflect.GeneratedMethodAccessor268.invoke(Unknown Source)
    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    java.lang.reflect.Method.invoke(Method.java:498)
    com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72)
    com.cloudbees.groovy.cps.impl.ContinuationGroup.methodCall(ContinuationGroup.java:89)
    com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.dispatchOrArg(FunctionCallBlock.java:113)
    com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.fixName(FunctionCallBlock.java:78)
    sun.reflect.GeneratedMethodAccessor267.invoke(Unknown Source)
    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    java.lang.reflect.Method.invoke(Method.java:498)
    com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72)
    com.cloudbees.groovy.cps.impl.ConstantBlock.eval(ConstantBlock.java:21)
    com.cloudbees.groovy.cps.Next.step(Next.java:83)
    com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:174)
    com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:163)
    org.codehaus.groovy.runtime.GroovyCategorySupport$ThreadCategoryInfo.use(GroovyCategorySupport.java:129)
    org.codehaus.groovy.runtime.GroovyCategorySupport.use(GroovyCategorySupport.java:268)
    com.cloudbees.groovy.cps.Continuable.run0(Continuable.java:163)
    org.jenkinsci.plugins.workflow.cps.CpsThread.runNextChunk(CpsThread.java:185)
    org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.run(CpsThreadGroup.java:400)
    org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.access$400(CpsThreadGroup.java:96)
    org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:312)
    org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:276)
    org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$2.call(CpsVmExecutorService.java:67)
    java.util.concurrent.FutureTask.run(FutureTask.java:266)
    hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:136)
    jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
    jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59)
    java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    java.util.concurrent.FutureTask.run(FutureTask.java:266)
    java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    java.lang.Thread.run(Thread.java:748)

That leads me to https://integration.wikimedia.org/ci/job/maintenance-sample-workspace-sizes/123736/console which executes a Groovy script hitting all agents in order to look at the jobs workspace disk usage and report back to statsd.

According to the console output it started at 18:34:00 and finished after 18:35:04. Looking at the build XML files, the success event was at 18:35:07. Could it be that the groovy script is holding some lock that prevents the other job from finishing the shell task?

I looked at the job in the task description, but it got hold for 4 minutes so I am not sure how that matches.

I was watching integration-agent-docker-1004 which had 3 builds involving Quibble. I have opened pages using the Jenkins monitoring interface that lets one list processes and java threads. As soon as one showed the delay I refreshed the two pages.

On the list of processes at https://integration.wikimedia.org/ci/monitoring/nodes/integration-agent-docker-1004?part=processes I see:

  • three docker run corresponding to the three builds.
  • a bunch of old docker run --rm --init -v /srv/jenkins/workspace/workspace:/workspaces docker-registry.wikimedia.org/releng/ci-stretch find /workspaces -mindepth 1 -maxdepth 1 -type d -exec du -s {} ;. They come from a maintenance job written in Groovy used to collect the disk space used by workspaces.
  • containerd only has TWO containerd-shim

So the containerized process has completed, it is no more present in the process table but somehow containerd hasn't noticed it, docker run did not receive the notification or ignored it. Which well ... sounds like a Docker bug?

We use Debian Stretch with:

apt-cache madison docker-ce containerd.io

docker-ce5:19.03.5~3-0~debian-stretchhttp://apt.wikimedia.org/wikimedia stretch-wikimedia/thirdparty/ci amd64 Packages
containerd.io1.2.10-3http://apt.wikimedia.org/wikimedia stretch-wikimedia/thirdparty/ci amd64 Packages

Mentioned in SAL (#wikimedia-releng) [2020-10-26T10:56:59Z] <hashar> Upgrading containerd.io from 1.2.10 to 1.2.13-2 on integration-agent-docker-1001, integration-agent-docker-1002, integration-agent-docker-1003, integration-agent-docker-1004 # T265615

Mentioned in SAL (#wikimedia-releng) [2020-10-26T11:03:32Z] <hashar> Bring back integration-agent-docker-1020 . It is not the only one affected by T265615 which probably rules out Ceph as the slowness root cause (T260916)

Mentioned in SAL (#wikimedia-releng) [2020-10-26T11:12:10Z] <hashar> Upgrading containerd.io from 1.2.10 to 1.2.13-2 on integration-agent-docker-1005, integration-agent-docker-1006, integration-agent-docker-1007, integration-agent-docker-1008 # T265615

For later reference:

11:23:24 INFO:backend.MySQL:Terminating MySQL
11:24:41 [wmf-quibble-selenium-php72-docker] $ /bin/bash -xe /tmp/jenkins937915457220888713.sh

And in containerd the log shows the process got reaped properly?:

Oct 26 10:56:21 integration-agent-docker-1006 containerd[666]: time="2020-10-26T10:56:21.739785528Z" level=info msg="shim containerd-shim started" address="/containerd-shim/moby/4c2dddd19e28b278dad4397630db404d287e6a176c9203474347b0cc2b0abded/shim.sock" debug=false pid=22410
Oct 26 11:23:27 integration-agent-docker-1006 containerd[666]: time="2020-10-26T11:23:27.106361593Z" level=info msg="shim reaped" id=4c2dddd19e28b278dad4397630db404d287e6a176c9203474347b0cc2b0abded

I have upgraded containerd from 1.2.10-3 to 1.2.13-2 on eight hosts but that unfortunately does not seem to help :-(

Mentioned in SAL (#wikimedia-releng) [2020-10-26T12:00:01Z] <hashar> Upgrading docker-ce to 19.03.13 and containerd to 1.3.7 on integration-agent-docker-1001, integration-agent-docker-1002, integration-agent-docker-1003 and integration-agent-docker-1004 # T265615

Same for Docker 19.03.13 / 1.3.7 . wmf-quibble-selenium-php72-docker builds still have a minute delay or so.

Change 636621 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] jjb: disable workspace size collection

https://gerrit.wikimedia.org/r/636621

Change 636621 merged by jenkins-bot:
[integration/config@master] jjb: disable workspace size collection

https://gerrit.wikimedia.org/r/636621

Mentioned in SAL (#wikimedia-releng) [2020-10-27T13:19:54Z] <hashar> Enabling puppet again on CI agent (that also causes Docker/containerd to be rolled back) # T265615

I have done various hacking here and there nothing conclusive.

I have rolled back:


Dockerd running with docker: true gives a lot more logs in the journal. At the end of the run, the container is reaped as far as dockerd knows about. The issue seems to be betwen the docker run client and the docker daemon.

Some extra finding: we run the containers with --init, that instructs Docker to spawn the container with docker-init as PID #1 and it is in charge of passing signals and killing all process / zombies etc. Passing TINI_VERBOSITY=3 to the container gives a bit more information. Example of me killing php:

13:37:57 INFO:backend.MySQL:Terminating MySQL

# expected 2 seconds wait

13:37:59 subprocess.CalledProcessError: Command '['php', 'tests/phpunit/phpunit.php', '--group', 'Database', '--exclude-group', 'Broken,ParserFuzz,Stub,Standalone', '--log-junit', '/workspace/log/junit-db.xml']' returned non-zero exit status -15
13:37:59 [INFO  tini (1)] Spawned child process '/src/utils/ci-fullrun.sh' with pid '7'
13:37:59 [DEBUG tini (1)] Received SIGCHLD
....

13:37:59 [DEBUG tini (1)] Reaped child with pid: '267'

# long wait which is the subject of this bug

13:38:05 + forensics
...

forensics is a result of a trap forensics EXIT in the shell script that invokes docker run.

The above definitely confirms to me that it is an issue with docker run not exiting immediately after the container has finished. I have no idea how to further debug that though. Maybe the docker command line client has some logging/debugging. I looked for the sources but couldn't find them under https://github.com/moby/moby

Change 634481 abandoned by Hashar:
[integration/quibble@master] (DO NOT SUBMIT) test dummy full run

Reason:

https://gerrit.wikimedia.org/r/634481

With TINY_VERBOSITY=3 and docker -D -l debug run:

08:58:27 [INFO  tini (1)] Main child exited normally (with status '1')
08:58:28 time="2020-10-28T08:58:28Z" level=debug msg="[hijack] End of stdout"
08:58:38 Build step 'Execute shell' marked build as failure
08:58:38 Archiving artifacts

I had strace attached to the docker run process:

lsof -p 25752
COMMAND   PID           USER   FD      TYPE             DEVICE  SIZE/OFF     NODE NAME
docker  25752 jenkins-deploy  cwd       DIR              253,1      4096   788000 /srv/jenkins/workspace/workspace/integration-quibble-fullrun
docker  25752 jenkins-deploy  rtd       DIR              254,3      4096        2 /
docker  25752 jenkins-deploy  txt       REG              254,3  88967576   396341 /usr/bin/docker
docker  25752 jenkins-deploy  mem       REG              254,3 102803208   262167 /var/cache/nscd/passwd
docker  25752 jenkins-deploy  mem       REG              254,3   1689360  1048605 /lib/x86_64-linux-gnu/libc-2.24.so
docker  25752 jenkins-deploy  mem       REG              254,3     14640  1048608 /lib/x86_64-linux-gnu/libdl-2.24.so
docker  25752 jenkins-deploy  mem       REG              254,3    135440  1048634 /lib/x86_64-linux-gnu/libpthread-2.24.so
docker  25752 jenkins-deploy  mem       REG              254,3    153288  1048601 /lib/x86_64-linux-gnu/ld-2.24.so
docker  25752 jenkins-deploy    0r     FIFO               0,10       0t0 37856380 pipe
docker  25752 jenkins-deploy    1w     FIFO               0,10       0t0 37856381 pipe
docker  25752 jenkins-deploy    2w     FIFO               0,10       0t0 37856381 pipe
docker  25752 jenkins-deploy    3u     unix 0xffff9c0f4c584800       0t0 37858369 type=STREAM
docker  25752 jenkins-deploy    4u  a_inode               0,11         0     7575 [eventpoll]
docker  25752 jenkins-deploy    5u     unix 0xffff9c0f40ca7400       0t0 37848939 type=STREAM
docker  25752 jenkins-deploy    6u     unix 0xffff9c0f29f3e000       0t0 37855997 type=STREAM
docker  25752 jenkins-deploy   63r     FIFO               0,10       0t0 37858367 pipe

File descriptor 5 comes from container-shim and is the process output. 1 is stdout and 2 is stderr which are read by the Jenkins console.

strace -tt -yy
# Python traceback
08:58:27.167448 read(5<UNIX:[37848939->37856385]>, "\2\0\0\0\0\0\4\262Traceback (most recent c"..., 32777) = 1210
08:58:27.167629 write(2<pipe:[37856381]>, "Traceback (most recent call last"..., 1202) = 1202
08:58:27.168045 read(5<UNIX:[37848939->37856385]>, 0xc00060c000, 32777) = -1 EAGAIN (Resource temporarily unavailable)
08:58:27.168210 epoll_pwait(4<anon_inode:[eventpoll]>, [], 128, 0, NULL, 0) = 0
08:58:27.168356 epoll_pwait(4<anon_inode:[eventpoll]>, [{EPOLLIN|EPOLLOUT, {u32=2869452344, u64=140645868527160}}], 128, -1, NULL, 0) = 1
08:58:27.216842 futex(0x56457b0f20d0, FUTEX_WAKE_PRIVATE, 1) = 1
08:58:27.216992 futex(0x56457b0f1fd0, FUTEX_WAKE_PRIVATE, 1) = 1
# tiny output
08:58:27.217106 read(5<UNIX:[37848939->37856385]>, "\1\0\0\0\0\0\n\32[INFO  tini (1)] Spawned"..., 32777) = 2594
08:58:27.217293 write(1<pipe:[37856381]>, "[INFO  tini (1)] Spawned child p"..., 2586) = 2586
08:58:27.217471 read(5<UNIX:[37848939->37856385]>, 0xc00060c000, 32777) = -1 EAGAIN (Resource temporarily unavailable)

08:58:27.217603 epoll_pwait(4<anon_inode:[eventpoll]>, [], 128, 0, NULL, 0) = 0
08:58:27.217800 epoll_pwait(4<anon_inode:[eventpoll]>, [{EPOLLIN|EPOLLOUT|EPOLLHUP|EPOLLRDHUP, {u32=2869452344, u64=140645868527160}}], 128, -1, NULL, 0) = 1
08:58:28.230175 futex(0x56457b0f20d0, FUTEX_WAKE_PRIVATE, 1) = 1
08:58:28.230378 futex(0x56457b0f1fd0, FUTEX_WAKE_PRIVATE, 1) = 1
08:58:28.230586 read(5<UNIX:[37848939->37856385]>, "", 32777) = 0
08:58:28.231167 ioctl(2<pipe:[37856381]>, TCGETS, 0xc0002286b4) = -1 ENOTTY (Inappropriate ioctl for device)

# Docker run debug message: time="2020-10-28T08:58:28Z" level=debug msg="[hijack] End of stdout"
08:58:28.231596 openat(AT_FDCWD, "/etc//localtime", O_RDONLY) = 7</usr/share/zoneinfo/UCT>
08:58:28.231833 read(7</usr/share/zoneinfo/UCT>, "TZif2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\1\0\0\0\0"..., 4096) = 127
08:58:28.232004 read(7</usr/share/zoneinfo/UCT>, "", 4096) = 0
08:58:28.232167 close(7</usr/share/zoneinfo/UCT>) = 0
08:58:28.232403 write(2<pipe:[37856381]>, "time=\"2020-10-28T08:58:28Z\" leve"..., 69) = 69

08:58:28.232624 futex(0xc0004f2148, FUTEX_WAKE_PRIVATE, 1) = 1
08:58:28.232860 epoll_pwait(4<anon_inode:[eventpoll]>, [], 128, 0, NULL, 824640192576) = 0
08:58:28.233009 futex(0x56457b0f2d88, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
08:58:31.637180 futex(0x56457b0f2d88, FUTEX_WAIT_PRIVATE, 0, NULL) = 0

# five seconds delay??

08:58:36.637075 futex(0x56457b0f20c0, FUTEX_WAKE_PRIVATE, 1) = 0
08:58:36.637223 futex(0x56457b0f1fd0, FUTEX_WAKE_PRIVATE, 1) = 1
08:58:36.641284 futex(0x56457b0f2d88, FUTEX_WAIT_PRIVATE, 0, NULL) = -1 EAGAIN (Resource temporarily unavailable)
08:58:36.641482 futex(0x56457b0f20e8, FUTEX_WAKE_PRIVATE, 1) = 0
08:58:36.641624 futex(0x56457b0f1fd0, FUTEX_WAKE_PRIVATE, 1) = 1
08:58:36.641824 futex(0x56457b0f2d88, FUTEX_WAIT_PRIVATE, 0, NULL) = 0

# Some more wait

08:58:38.146730 epoll_pwait(4<anon_inode:[eventpoll]>, [], 128, 0, NULL, 0) = 0
08:58:38.147022 futex(0xc0000b8148, FUTEX_WAKE_PRIVATE, 1) = 1
08:58:38.147191 futex(0xc0000b8148, FUTEX_WAKE_PRIVATE, 1) = 1
08:58:38.147363 futex(0xc00045b2c8, FUTEX_WAKE_PRIVATE, 1) = 1
08:58:38.147560 futex(0x56457b0f2d88, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
08:58:38.147920 nanosleep({tv_sec=0, tv_nsec=3000}, NULL) = 0
08:58:38.148169 futex(0xc000502148, FUTEX_WAKE_PRIVATE, 1) = 1
08:58:38.148300 futex(0xc000502148, FUTEX_WAKE_PRIVATE, 1) = 1
08:58:38.148474 futex(0xc00045b2c8, FUTEX_WAKE_PRIVATE, 1) = 1
08:58:38.148661 futex(0x56457b0f2d88, FUTEX_WAIT_PRIVATE, 0, NULL) = ?
08:58:38.154553 +++ exited with 1 +++

I was NOT tracing threads (strace -f) ...

No with child tracing (-f) and time spent in system call (-T). PID 23094 invokes epoll_pwait which resumes after 3 seconds for whatever reason.

strace -s 128 -tt -yy -T -f
[pid 23092] 10:21:51.923319 write(2<pipe:[76629877]>, "time=\"2020-10-28T10:21:51Z\" level=debug msg=\"[hijack] End of stdout\"\n", 69) = 69 <0.000028>
[pid 23092] 10:21:51.923443 futex(0xc00032e848, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 23094] 10:21:51.923498 <... futex resumed> ) = 0 <0.146771>
[pid 23092] 10:21:51.923531 <... futex resumed> ) = 1 <0.000064>
[pid 23094] 10:21:51.923631 epoll_pwait(4<anon_inode:[eventpoll]>, [], 128, 0, NULL, 0) = 0 <0.000017>
[pid 23092] 10:21:51.923711 epoll_pwait(4<anon_inode:[eventpoll]>,  <unfinished ...>
[pid 23094] 10:21:51.923772 epoll_pwait(4<anon_inode:[eventpoll]>,  <unfinished ...>
[pid 23092] 10:21:51.923823 <... epoll_pwait resumed> [], 128, 0, NULL, 824637311584) = 0 <0.000071>
[pid 23092] 10:21:51.923866 futex(0xc00032e4c8, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 23086] 10:21:51.923916 <... futex resumed> ) = 0 <0.147012>
[pid 23086] 10:21:51.923964 nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 <0.000114>
[pid 23086] 10:21:51.924144 futex(0x55ab98437a90, FUTEX_WAIT_PRIVATE, 0, {tv_sec=60, tv_nsec=0}  <unfinished ...>
[pid 23094] 10:21:55.008992 <... epoll_pwait resumed> [{EPOLLIN|EPOLLOUT, {u32=2505785096, u64=140237482966792}}], 128, -1, NULL, 3) = 1 <3.085160>
[pid 23094] 10:21:55.009155 futex(0x55ab98437a90, FUTEX_WAKE_PRIVATE, 1) = 1 <0.000065>
[pid 23086] 10:21:55.009303 <... futex resumed> ) = 0 <3.085131>

Change 636908 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/puppet@production] ci: run docker with debug logging

https://gerrit.wikimedia.org/r/636908

Looks like the container is stuck in removing status. For a build with:

13:09:39 INFO:backend.MySQL:Terminating MySQL
13:10:33 [wmf-quibble-selenium-php72-docker] $ /bin/bash -xe /tmp/jenkins4716707043969520657.sh

Running docker container inspect in a loop shows:

{
  "Status": "removing",
  "Running": false,
  "Paused": false,
  "Restarting": false,
  "OOMKilled": false,
  "Dead": true,
  "Pid": 0,
  "ExitCode": 0,
  "Error": "",
  "StartedAt": "2020-10-28T12:42:43.482992712Z",
  "FinishedAt": "2020-10-28T13:09:41.834741168Z"
}

The 2 seconds difference is Quibble waiting for MySQL to terminate before issuing a SIGKILL. The inspect loop stopped at 13:10:33 when the container was no more known to Docker.

So my best assumption is that the source code / package install etc are done inside the Docker container. Upon removal, Docker spends a lot of time to delete all those files which might explain the issue. If that is true, that can be reproduced by spawning a container, writing a ton of files and check whether it takes a long time to get it removed.

The wmf-quibble-selenium-php72-docker job has the following volumes:

--volume /srv/git:/srv/git:ro 
--volume /srv/jenkins/workspace/workspace/wmf-quibble-selenium-php72-docker/cache:/cache 
--volume /srv/jenkins/workspace/workspace/wmf-quibble-selenium-php72-docker/log:/workspace/log

Quibble runs in /workspace and clones / install under /workspace/src. That is not mounted and everything thus happens in the container. So I guess it might be all about mounting /workspace from the host and have a publisher to delete it once the job has completed.

There is something off in the JJB macros.

I have looked at it live by running strace profiling against dockerd while a container was being reaped:

strace -C -e trace=file -f -p `pidof dockerd`
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 94.50    0.221674           2    145394     11771 unlinkat
  3.41    0.008000        2667         3         1 umount2
  1.71    0.004000        1000         4           fchmodat
  0.29    0.000674           0     11782           openat
  0.09    0.000220           0     11774         3 newfstatat
  0.00    0.000000           0         4           renameat
------ ----------- ----------- --------- --------- ----------------
100.00    0.234568                168961     11775 total

And is not even a full deletion. But essentially yeah the slowness is due to the file deletion taking ages. Some unlinkat calls even take 100 - 200 ms. We have too many files to delete and there is a lot of I/O contention.

I have found out the root cause which is in the underlying infrastructure. Will file a task about it tomorrow.

Change 636908 abandoned by Hashar:
[operations/puppet@production] ci: run docker with debug logging

Reason:
I have removed the patch from the integration puppet master and reloaded docker

https://gerrit.wikimedia.org/r/636908

I guess Wikibase might be hit harder than other extensions by the deletion slowness due to a larger volume of files in the node_modules of subprojects like Wikidata Bridge?

Edit:

$ git grep --recurse-submodules 'version' '**/package-lock.json' | wc -l
10306

Some 10k npm packages in the three “subprojects” that commit their lockfiles (four other libraries related to the Wikibase data model don’t commit their lockfiles, though I would guess that those have fewer dependencies).

@Lucas_Werkmeister_WMDE yes Wikibase kind of highlights the issue since it has so many files. I have also noticed the disk slowness on various other jobs though, some started to timeout as a result due to the number of IO operations they perform.

I thought about having the files to be written on the host via a bind mount, however we would still require to delete those files during the job, and I guess that will still take a while to complete. It is just that we would do the rm -fR instead of Dockerd handling it for us. So I don't think that would enhance anything.

I am very confident that is related to the WMCS migration of storage to Ceph and filed the subtask T266777: integration instances suffer from high IO latency due to Ceph

That has been more or less fixed by raising the newish IO quota for the integration instances. It is still not ideal and enhancements will be made later on via the infrastructure task T266777.

Sorry @Lucas_Werkmeister_WMDE , I have missed your comment. @kostajh raised it again today so reopening.

21:26:14 <kostajh> hashar: does it seem normal for INFO:backend.MySQL:Terminating MySQL to take ~2 minutes? looking at https://integration.wikimedia.org/ci/job/wmf-quibble-selenium-php72-docker/114519/console

Quibble terminates MySQL via quibble.backend.Backend.stop() which does:

def stop(self):
    if self.server is not None:
        self.log.info('Terminating %s', self.__class__.__name__)
        self.server.terminate()
        try:
            self.server.wait(2)
        except subprocess.TimeoutExpired:
            self.server.kill()  # SIGKILL
        finally:
            self.server = None

terminate sends the SIGTERM signal. It might be ignored and the server can definitely take time to complete. We allow two seconds grace period (wait(2)) then it is send SIGKILL which really should terminate it abruptly no question asked.

So I don't know why it would be holding :( Maybe something is still running or holding :-\

Pfff I am not focusing today. The INFO:backend.MySQL:Terminating MySQL is the last message send by Quibble and it has indeed completed (at max roughly 2 seconds after the message got written). What is taking age and got thoroughly investigated previously is the container deletion. From the above message:

T265615#6586508
essentially yeah the slowness is due to the file deletion taking ages. Some unlinkat calls even take 100 - 200 ms. We have too many files to delete and there is a lot of I/O contention.

Which has lead to T266777 to raise the disk IO quota. I guess we should tune it up, unless Docker can be made to manage the files deletions asynchronously.

I have checked in Docker and it does a os.removeAll which is waited for completion. If a container has huge amount of files that have to be deleted, that would file the memory cache for writes which is is known in Linux as the Dirty memory. From proc(5):

Dirty %lu
Memory which is waiting to get written back to the disk.

That is tunable via system settings:

name="sudo sysctl -ar 'dirty'|egrep -v "= 0$""
vm.dirty_background_ratio = 10
vm.dirty_expire_centisecs = 3000
vm.dirty_ratio = 20
vm.dirty_writeback_centisecs = 500
vm.dirtytime_expire_seconds = 43200

And In Graphite looking at a random agent using the metrics:

secondYAxis(integration.integration-agent-docker-1002.iostat.vda.writes_per_second)
integration.integration-agent-docker-1002.memory.Dirty

The Graphite graph (green memory to be written, blue writes per seconds):

disk_write_dirty.png (386×592 px, 36 KB)

I would assume the unlinkat are piled up in memory fil it up and at some point the cache is full and the unlinks are done synchronously? Maybe if we raise the dirty cache limit (vm.dirty_ratio) we would get a larger buffer. I don't know much about the arcane of Linux disk cache though :-\

This is just a walk-by comment so feel free to ignore, but didn't we have a nearly identical issue last year, and the solution was to put all these little files in a tmpfs which could simply be unmounted when the container is destroyed? If this sounds like something to pursue, I could track down the details...

@awight potentially though tmpfs goes in memory and is expensive. A run with core + Wikibase and all the npm dependencies turn out to be around 9G iirc which is a lot to put in RAM.

I was looking at a stall build today and it iotop definitely showed up the Docker daemon being IO busy. Ideally the container destroying should be done asynchronously rather than holding and preventing docker run from completing but that is how things are :/

On the "hardware" side files are written to the instance "disk" which is backed up by Ceph. There is some IO throttling there. If we could find a way to make the underlying storage to be less synchronous or accept data loss, we would certainly save a bunch of time. I have read some presentation about making CephFS unlink asynchronous (no idea whether we use that) which indicates all unlinks are synchronous, deleting the myriads of files installed by npm would definitely explain the slowness :-\ https://archive.fosdem.org/2020/schedule/event/sds_ceph_async_directory_ops/attachments/slides/3962/export/events/attachments/sds_ceph_async_directory_ops/slides/3962/async_dirops_cephfs.pdf

I have talked to @aborrero and @dcaro about the limits. They are inclined to further tune the limit like has been done last year by Andrew ( T266777#6589231 ) and tentatively raise them.

I've added a burst configuration for 3x the iops for 10s for that flavor, might require stopping and starting (not restarting) the VMs:

root@cloudvirt1024:~# openstack --os-project-id integration flavor show g2.cores8.ram24.disk80.4xiops
+----------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Field                      | Value                                                                                                                                                                                                                                    |
+----------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| OS-FLV-DISABLED:disabled   | True                                                                                                                                                                                                                                     |
| OS-FLV-EXT-DATA:ephemeral  | 0                                                                                                                                                                                                                                        |
| access_project_ids         | integration                                                                                                                                                                                                                              |
| disk                       | 80                                                                                                                                                                                                                                       |
| id                         | c86647d3-d157-43fa-9849-2bd16ef27a5f                                                                                                                                                                                                     |
| name                       | g2.cores8.ram24.disk80.4xiops                                                                                                                                                                                                            |
| os-flavor-access:is_public | False                                                                                                                                                                                                                                    |
| properties                 | aggregate_instance_extra_specs:ceph='true', quota:disk_read_iops_sec='20000', quota:disk_total_bytes_sec='800000000', quota:disk_write_iops_sec='2000', quota:disk_write_iops_sec_max='60000', quota:disk_write_iops_sec_max_length='10' |
| ram                        | 24576                                                                                                                                                                                                                                    |
| rxtx_factor                | 1.0                                                                                                                                                                                                                                      |
| swap                       |                                                                                                                                                                                                                                          |
| vcpus                      | 8                                                                                                                                                                                                                                        |
+----------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Can you try that and report back if there was any change?
We can try increasing the base limit instead if this does not work

Mentioned in SAL (#wikimedia-releng) [2022-01-14T13:49:34Z] <hashar> Restarting all CI Docker agents via Horizon to apply new flavor settings T265615 T299211

Calling this resolved. I took a look at a handful of selenium jobs for a few repos (core, wikibase), in each instance INFO:backend.MySQL:Terminating MySQL seems to take less than a second now. So either our iops quotas are tuned well, or one of the version changes of something is working well.