Page MenuHomePhabricator

Terminating MySQL takes several minutes in (Wikibase?) CI jobs
Open, HighPublic

Description

I’ve noticed that the “Terminating MySQL” step at the end of a CI build can take several minutes. For example, in this recent job:

14:42:01 INFO:quibble.commands:Shutting down backends: <php DevWebServer http://127.0.0.1:9412 /workspace/src> <Xvfb :94> <ChromeWebDriver :94>
14:42:01 INFO:backend.ChromeWebDriver:Terminating ChromeWebDriver
14:42:01 INFO:backend.Xvfb:Terminating Xvfb
14:42:01 INFO:backend.DevWebServer:Terminating DevWebServer
14:42:01 INFO:quibble.commands:Shutting down backends: <mysql /workspace/db/quibble-mysql-ub2yb_5l/socket>
14:42:01 INFO:backend.MySQL:Terminating MySQL
14:46:16 [wmf-quibble-selenium-php72-docker@2] $ /bin/bash -xe /tmp/jenkins6976447998931934422.sh

I have no idea what Quibble or MySQL were doing there, but I doubt that it was really worth four minutes(!) of CI time.

So far, I’ve only noticed this in Wikibase jobs, but that might just be because I don’t often look at the CI output for other repositories.

Event Timeline

hashar added subscribers: kostajh, awight, hashar.

The relevant code:

quibble/backend.py
class BackendServer:
...
   def stop(self):
        if self.server is not None:
            self.log.info('Terminating %s', self.__class__.__name__)
            self.server.terminate()
            try:
                self.server.wait(2)
            except subprocess.TimeoutExpired:
                self.server.kill()  # SIGKILL
            finally:
                self.server = None

Where self.server is a Popen instance. So we basically send SIGTERM and have a busy loop waiting for two seconds which should then send SIGKILL. Clearly that does not work as expected :-\

The python documentation for Popen.wait() mentions it can deadlock when sending stdout/stderr to pipes:

Note

This will deadlock when using stdout=PIPE or stderr=PIPE and the child process generates enough output to a pipe such that it blocks waiting for the OS pipe buffer to accept more data. Use Popen.communicate() when using pipes to avoid that.

But we use /dev/null when we start the server:

self.server = subprocess.Popen([
    '/usr/sbin/mysqld',  # fixme drop path
    '--skip-networking',
    '--datadir=%s' % self.rootdir,
    '--log-error=%s' % self.errorlog,
    '--pid-file=%s' % self.pidfile,
    '--socket=%s' % self.socket,
    ],
    stdout=subprocess.DEVNULL,
    stderr=subprocess.DEVNULL,
)

Some stuff we can try:

  • reproduce it obviously
  • check whether MariaDB version has changed in the CI image
  • add some logging in the stop command (cause clearly subprocess.TimeoutExpired should be raised and cause a SIGKILL to be send)
  • add some option to capture MySQL stdout/stderr (using stream_relay which can consumes a PIPE and output it via logging).
hashar triaged this task as High priority.Thu, Oct 15, 4:46 PM

Will look at it tomorrow and try to at least reproduce the issue.

Looking at some random builds, it usually takes 15 / 20 seconds to start the next command which is a bug by itself:

00:17:39.940 INFO:backend.MySQL:Terminating MySQL
00:17:55.311 [wmf-quibble-selenium-php72-docker] $ /bin/bash -xe /tmp/jenkins4972097457850427826.sh
00:19:01.909 INFO:backend.MySQL:Terminating MySQL
00:19:20.063 [wmf-quibble-selenium-php72-docker] $ /bin/bash -xe /tmp/jenkins443040905686928605.sh
00:18:17.942 INFO:backend.MySQL:Terminating MySQL
00:18:34.254 [wmf-quibble-selenium-php72-docker] $ /bin/bash -xe /tmp/jenkins7269376008034149723.sh

I have looked at that job duration trend on https://integration.wikimedia.org/ci/job/wmf-quibble-selenium-php72-docker/buildTimeTrend and the job is constantly way too slow whenever it happens to run on integration-agent-docker-1020:

00:30:08.778 INFO:backend.MySQL:Terminating MySQL
00:31:21.208 [wmf-quibble-selenium-php72-docker] $ /bin/bash -xe /tmp/jenkins8569331040755739778.sh
00:38:54.981 INFO:backend.MySQL:Terminating MySQL
00:42:18.062 [wmf-quibble-selenium-php72-docker@2] $ /bin/bash -xe /tmp/jenkins6862799909867122709.sh
00:39:41.829 INFO:backend.MySQL:Terminating MySQL
00:44:44.184 [wmf-quibble-selenium-php72-docker] $ /bin/bash -xe /tmp/jenkins6896978991300300753.sh
00:31:37.581 INFO:backend.MySQL:Terminating MySQL
00:32:57.181 [wmf-quibble-selenium-php72-docker@2] $ /bin/bash -xe /tmp/jenkins6303269567405582114.sh
00:30:08.106 INFO:backend.MySQL:Terminating MySQL
00:31:22.239 [wmf-quibble-selenium-php72-docker] $ /bin/bash -xe /tmp/jenkins6647671750866068876.sh

That agent runs on a WMCS host that uses Ceph for file backend (T260916). Which itself does not explain the problem since other agents are affected though in seconds, not minutes! I will unpoll it.

Mentioned in SAL (#wikimedia-releng) [2020-10-16T07:48:42Z] <hashar> Disabling integration-agent-docker-1020 (the sole agent using Ceph): it is too slow # T260916 T265615

Change 634465 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] jjb: wmf job with quibble 44

https://gerrit.wikimedia.org/r/634465

To rule out Quibble 0.0.45, I have create a dummy job using docker-registry.wikimedia.org/releng/quibble-stretch-php72:0.0.44-s5. And it is still affected:

00:21:32.612 INFO:backend.MySQL:Terminating MySQL
00:22:36.029 [wmf-quibble44-selenium-php72-docker] $ /bin/bash -xe /tmp/jenkins2652394246537911623.sh

So the logic is faulty somewhere or Quibble is being held by something else that does not show up in logs.

CI for integration/quibble does have a job that runs it entirely against mediawiki/core. The last few builds show more or less show the same issue:

00:11:36.278 INFO:backend.MySQL:Terminating MySQL
00:11:41.731 Archiving artifacts
00:11:09.274 INFO:backend.MySQL:Terminating MySQL
00:11:14.666 Archiving artifacts
00:13:52.197 INFO:backend.MySQL:Terminating MySQL
00:13:57.506 Archiving artifacts

Aka more than two seconds?

Change 634481 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/quibble@master] (DO NOT SUBMIT) test dummy full run

https://gerrit.wikimedia.org/r/634481

Gave it a try with a bit more logging. One run with a 2 seconds delay which is too short and trigger a SIGKILL:

00:22:05.191 INFO:quibble.commands:Shutting down backends: <mysql /tmp/quibble-mysql-avmk3d51/socket>
00:22:05.191 INFO:backend.MySQL:Stopping MySQL
00:22:05.192 INFO:backend.MySQL:Terminating MySQL
00:22:05.192 INFO:backend.MySQL:Waiting up to two seconds
00:22:07.184 INFO:backend.MySQL:Timeout sending SIGKILL
00:22:07.184 INFO:backend.MySQL:self.server = None
00:22:07.184 INFO:backend.MySQL:Completed stop()
00:22:07.184 INFO:quibble.cmd:Out of context
00:22:07.185 INFO:root:Finished Quibble run
00:22:11.905 Archiving artifacts

And after raising the timeout to issue a SIGTERM:

00:10:34.044 INFO:quibble.commands:Shutting down backends: <mysql /tmp/quibble-mysql-hrv8i82m/socket>
00:10:34.044 INFO:backend.MySQL:Stopping MySQL
00:10:34.044 INFO:backend.MySQL:Terminating MySQL
00:10:34.044 INFO:backend.MySQL:Waiting up to five seconds
00:10:36.466 INFO:backend.MySQL:self.server = None
00:10:36.466 INFO:backend.MySQL:Completed stop()
00:10:36.466 INFO:quibble.cmd:Out of context
00:10:36.466 INFO:root:Finished Quibble run
00:10:39.529 Archiving artifacts

Not any conclusive unfortunately :-\

I tried various combination and I can not reproduce it. Maybe it is race condition between us sending SIGKILL (self.server.kill()) and python reaping the processed on garbage collection when we set self.server = None :-\

After many attempts, I finally got a lead https://integration.wikimedia.org/ci/job/integration-quibble-fullrun/153/console

The job executes a shell script utils/ci-fullrun.sh in integration/quibble.git, I have edited it to echo CI FULLRUN completed when the shell exit (using trap). Quibble has a bunch of extra logging added here and there, notably to list processes under /proc/.

I have manually killed the php process running PHPUnit to trigger a fault. Times below are UTC hh:mm:ss

12:59:25 INFO:root:Left processes?
12:59:25 /proc/1/cmdline:/sbin/docker-init--/src/utils/ci-fullrun.sh--color
12:59:25 /proc/25/cmdline:python3-s/tmp/quibble/bin/quibble--skipselenium--color
12:59:25 /proc/7/cmdline:/bin/bash/src/utils/ci-fullrun.sh--color

We are left with 3 processes:

  • docker-init
  • ci-fullrun.sh which is the shell entry point used
  • quibble
12:59:25 INFO:root:Finished Quibble run

Quibble has finished.

`
12:59:25 subprocess.CalledProcessError: Command '['php', 'tests/phpunit/phpunit.php', '--testsuite', 'extensions', '--group', 'Database', '--exclude-group', 'Broken,ParserFuzz,Stub,Standalone', '--log-junit', '/workspace/log/junit-db.xml']' returned non-zero exit status -15

I have kill php hence the exit code -15.

12:59:25 + echo 'CI FULLRUN completed'
12:59:25 CI FULLRUN completed

At this point the shell script being run inside the Docker container has completed.

12:59:44 Build step 'Execute shell' marked build as failure

And it takes more than 18 seconds to complete :-\ I guess I am confident to state that Quibble is not the root cause. That seems to be either in Docker or in Jenkins itself.

Change 634465 abandoned by Hashar:
[integration/config@master] jjb: wmf job with quibble 44

Reason:

https://gerrit.wikimedia.org/r/634465

Next fun, digging into Jenkins!

I looked at https://integration.wikimedia.org/ci/job/wmf-quibble-selenium-php72-docker/64820/ which got affected by the same issue (all times are UTC):

18:08:34 [wmf-quibble-selenium-php72-docker] $ /bin/bash /tmp/jenkins5993728317798275711.sh
// Quibble output
18:34:14 INFO:backend.MySQL:Terminating MySQL
18:35:12 [wmf-quibble-selenium-php72-docker] $ /bin/bash -xe /tmp/jenkins5970126818379489927.sh

It ran on integration-agent-docker-1009 for which I took a couple thread dump while the job was waiting using https://integration.wikimedia.org/ci/computer/integration-agent-docker-1009/systemInfo

A couple thread dumps for the agent:

See https://fastthread.io/my-thread-report.jsp?p=c2hhcmVkLzIwMjAvMTAvMjAvLS1CLnR4dC0tMjAtMzQtMTM7Oy0tQS50eHQtLTIwLTM0LTEz

Not much standing out? There is a stdout copier thread for the script that runs Quibble (/tmp/jenkins5993728317798275711.sh). So I definitely took the thread dump while it was blocked and there is some Groovy scripts running.

I also took two thread dumps of the Jenkins server:

See also: https://fastthread.io/my-thread-report.jsp?p=c2hhcmVkLzIwMjAvMTAvMjAvLS1zZXJ2ZXIyLnR4dC0tMjAtMjctNTI7Oy0tc2VydmVyMS50eHQtLTIwLTI3LTUy&

The later second dump has a thread:

line=5
"Running CpsFlowExecution[Owner[maintenance-sample-workspace-sizes/123736:maintenance-sample-workspace-sizes #123736]] / waiting for integration-agent-docker-1009 id=6114779" daemon prio=5 TIMED_WAITING
    java.lang.Object.wait(Native Method)
    hudson.remoting.Request.call(Request.java:176)
    hudson.remoting.Channel.call(Channel.java:1000)
    hudson.util.RemotingDiagnostics.executeGroovy(RemotingDiagnostics.java:111)
    hudson.util.RemotingDiagnostics$executeGroovy.call(Unknown Source)
    org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:48)
    org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:113)
    com.cloudbees.groovy.cps.sandbox.DefaultInvoker.methodCall(DefaultInvoker.java:20)
    com.cloudbees.groovy.cps.impl.ContinuationGroup.methodCall(ContinuationGroup.java:86)
    com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.dispatchOrArg(FunctionCallBlock.java:113)
    com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.fixArg(FunctionCallBlock.java:83)
    sun.reflect.GeneratedMethodAccessor268.invoke(Unknown Source)
    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    java.lang.reflect.Method.invoke(Method.java:498)
    com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72)
    com.cloudbees.groovy.cps.impl.ContinuationGroup.methodCall(ContinuationGroup.java:89)
    com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.dispatchOrArg(FunctionCallBlock.java:113)
    com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.fixName(FunctionCallBlock.java:78)
    sun.reflect.GeneratedMethodAccessor267.invoke(Unknown Source)
    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    java.lang.reflect.Method.invoke(Method.java:498)
    com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72)
    com.cloudbees.groovy.cps.impl.ConstantBlock.eval(ConstantBlock.java:21)
    com.cloudbees.groovy.cps.Next.step(Next.java:83)
    com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:174)
    com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:163)
    org.codehaus.groovy.runtime.GroovyCategorySupport$ThreadCategoryInfo.use(GroovyCategorySupport.java:129)
    org.codehaus.groovy.runtime.GroovyCategorySupport.use(GroovyCategorySupport.java:268)
    com.cloudbees.groovy.cps.Continuable.run0(Continuable.java:163)
    org.jenkinsci.plugins.workflow.cps.CpsThread.runNextChunk(CpsThread.java:185)
    org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.run(CpsThreadGroup.java:400)
    org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.access$400(CpsThreadGroup.java:96)
    org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:312)
    org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:276)
    org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$2.call(CpsVmExecutorService.java:67)
    java.util.concurrent.FutureTask.run(FutureTask.java:266)
    hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:136)
    jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
    jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59)
    java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    java.util.concurrent.FutureTask.run(FutureTask.java:266)
    java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    java.lang.Thread.run(Thread.java:748)

That leads me to https://integration.wikimedia.org/ci/job/maintenance-sample-workspace-sizes/123736/console which executes a Groovy script hitting all agents in order to look at the jobs workspace disk usage and report back to statsd.

According to the console output it started at 18:34:00 and finished after 18:35:04. Looking at the build XML files, the success event was at 18:35:07. Could it be that the groovy script is holding some lock that prevents the other job from finishing the shell task?

I looked at the job in the task description, but it got hold for 4 minutes so I am not sure how that matches.

hashar claimed this task.Wed, Oct 21, 5:24 PM

I was watching integration-agent-docker-1004 which had 3 builds involving Quibble. I have opened pages using the Jenkins monitoring interface that lets one list processes and java threads. As soon as one showed the delay I refreshed the two pages.

On the list of processes at https://integration.wikimedia.org/ci/monitoring/nodes/integration-agent-docker-1004?part=processes I see:

  • three docker run corresponding to the three builds.
  • a bunch of old docker run --rm --init -v /srv/jenkins/workspace/workspace:/workspaces docker-registry.wikimedia.org/releng/ci-stretch find /workspaces -mindepth 1 -maxdepth 1 -type d -exec du -s {} ;. They come from a maintenance job written in Groovy used to collect the disk space used by workspaces.
  • containerd only has TWO containerd-shim

So the containerized process has completed, it is no more present in the process table but somehow containerd hasn't noticed it, docker run did not receive the notification or ignored it. Which well ... sounds like a Docker bug?

We use Debian Stretch with:

apt-cache madison docker-ce containerd.io

docker-ce5:19.03.5~3-0~debian-stretchhttp://apt.wikimedia.org/wikimedia stretch-wikimedia/thirdparty/ci amd64 Packages
containerd.io1.2.10-3http://apt.wikimedia.org/wikimedia stretch-wikimedia/thirdparty/ci amd64 Packages

Mentioned in SAL (#wikimedia-releng) [2020-10-26T10:56:59Z] <hashar> Upgrading containerd.io from 1.2.10 to 1.2.13-2 on integration-agent-docker-1001, integration-agent-docker-1002, integration-agent-docker-1003, integration-agent-docker-1004 # T265615

Mentioned in SAL (#wikimedia-releng) [2020-10-26T11:03:32Z] <hashar> Bring back integration-agent-docker-1020 . It is not the only one affected by T265615 which probably rules out Ceph as the slowness root cause (T260916)

Mentioned in SAL (#wikimedia-releng) [2020-10-26T11:12:10Z] <hashar> Upgrading containerd.io from 1.2.10 to 1.2.13-2 on integration-agent-docker-1005, integration-agent-docker-1006, integration-agent-docker-1007, integration-agent-docker-1008 # T265615

For later reference:

11:23:24 INFO:backend.MySQL:Terminating MySQL
11:24:41 [wmf-quibble-selenium-php72-docker] $ /bin/bash -xe /tmp/jenkins937915457220888713.sh

And in containerd the log shows the process got reaped properly?:

Oct 26 10:56:21 integration-agent-docker-1006 containerd[666]: time="2020-10-26T10:56:21.739785528Z" level=info msg="shim containerd-shim started" address="/containerd-shim/moby/4c2dddd19e28b278dad4397630db404d287e6a176c9203474347b0cc2b0abded/shim.sock" debug=false pid=22410
Oct 26 11:23:27 integration-agent-docker-1006 containerd[666]: time="2020-10-26T11:23:27.106361593Z" level=info msg="shim reaped" id=4c2dddd19e28b278dad4397630db404d287e6a176c9203474347b0cc2b0abded

I have upgraded containerd from 1.2.10-3 to 1.2.13-2 on eight hosts but that unfortunately does not seem to help :-(

Mentioned in SAL (#wikimedia-releng) [2020-10-26T12:00:01Z] <hashar> Upgrading docker-ce to 19.03.13 and containerd to 1.3.7 on integration-agent-docker-1001, integration-agent-docker-1002, integration-agent-docker-1003 and integration-agent-docker-1004 # T265615

Same for Docker 19.03.13 / 1.3.7 . wmf-quibble-selenium-php72-docker builds still have a minute delay or so.

Change 636621 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] jjb: disable workspace size collection

https://gerrit.wikimedia.org/r/636621

Change 636621 merged by jenkins-bot:
[integration/config@master] jjb: disable workspace size collection

https://gerrit.wikimedia.org/r/636621

Mentioned in SAL (#wikimedia-releng) [2020-10-27T13:19:54Z] <hashar> Enabling puppet again on CI agent (that also causes Docker/containerd to be rolled back) # T265615

Mentioned in SAL (#wikimedia-releng) [2020-10-27T13:20:30Z] <hashar> Enable Docker daemon debug mode on integration-agent-docker-1001 via Horizon https://horizon.wikimedia.org/project/instances/837fff63-ed52-4c2f-823d-936801056fd2/ # T265615

I have done various hacking here and there nothing conclusive.

I have rolled back:


Dockerd running with docker: true gives a lot more logs in the journal. At the end of the run, the container is reaped as far as dockerd knows about. The issue seems to be betwen the docker run client and the docker daemon.

Some extra finding: we run the containers with --init, that instructs Docker to spawn the container with docker-init as PID #1 and it is in charge of passing signals and killing all process / zombies etc. Passing TINI_VERBOSITY=3 to the container gives a bit more information. Example of me killing php:

13:37:57 INFO:backend.MySQL:Terminating MySQL

# expected 2 seconds wait

13:37:59 subprocess.CalledProcessError: Command '['php', 'tests/phpunit/phpunit.php', '--group', 'Database', '--exclude-group', 'Broken,ParserFuzz,Stub,Standalone', '--log-junit', '/workspace/log/junit-db.xml']' returned non-zero exit status -15
13:37:59 [INFO  tini (1)] Spawned child process '/src/utils/ci-fullrun.sh' with pid '7'
13:37:59 [DEBUG tini (1)] Received SIGCHLD
....

13:37:59 [DEBUG tini (1)] Reaped child with pid: '267'

# long wait which is the subject of this bug

13:38:05 + forensics
...

forensics is a result of a trap forensics EXIT in the shell script that invokes docker run.

The above definitely confirms to me that it is an issue with docker run not exiting immediately after the container has finished. I have no idea how to further debug that though. Maybe the docker command line client has some logging/debugging. I looked for the sources but couldn't find them under https://github.com/moby/moby