Page MenuHomePhabricator

Find a way to significantly speed up the zuul-cloner step
Open, Needs TriagePublic

Description

Split out from T351357#9337934:

One thing that I think about is the setup costs of each job. Would lowering setup cost (e.g., cloning, fetching dependencies, setting up resources) make splitting jobs easier?

Yes, speeding up the clone step would be brilliant. For instance, on the selenium test above:

  • Clone the relevant code: 1m 36s
  • Install composer dependencies: 7s
  • Install npm dependencies: 6s

I think your work on making dependency installation fast has been brilliant, but the actual fetching from gerrit is our significant road bump now.

Event Timeline

The cloner can clone from a local clone and the repos I originally added were the largest ones and or the ones with the most traffic. The arbitrary defined list is in Puppet https://gerrit.wikimedia.org/g/operations/puppet/+/refs/heads/production/modules/profile/manifests/ci/gitcache.pp then it fetches from Gerrit to bring it up to date. If a local clone does not exist, it does a full clone from Gerrit. Then there is some logic to attempt to retrieve a patch from the Zuul merger.

When the local cache is used, the log shows from cache and mediawiki/core is cloned, updated and the patch checked out in 7 seconds:

00:00:11.825 INFO:zuul.Cloner:Creating repo mediawiki/core from cache /srv/git/mediawiki/core.git
00:00:14.786 INFO:zuul.Cloner:Updating origin remote in repo mediawiki/core to https://gerrit.wikimedia.org/r/mediawiki/core
00:00:18.728 INFO:zuul.Cloner:upstream repo has branch master
00:00:19.595 INFO:zuul.Cloner:Prepared mediawiki/core repo with commit f7195a819fafec274566d7bbbca564b53f2e27c2

And I think that is still a full clone cause the mirror directory and the workspace were repositories are written are in different partition (if they were on the same partition, git would use hardlinks for .git which is nearly instance, then in practice a local full clone is fast enough).

Without a cache a small extension takes 10 seconds

00:00:59.801 INFO:zuul.Cloner.mediawiki/extensions/SandboxLink:Creating repo mediawiki/extensions/SandboxLink from upstream https://gerrit.wikimedia.org/r/mediawiki/extensions/SandboxLink
00:01:09.078 INFO:zuul.Cloner.mediawiki/extensions/SandboxLink:Prepared mediawiki/extensions/SandboxLink repo with branch master at commit 65990d646508d0e0ea8350e14fe739e8ddff2c54

Another "optimization" was to run the clones in parallel and Quibble processes up to 8 clones (via quibble --git-parallel=8). I picked that one arbitrarily, then I the idea was to also avoid hammering the git-daemon running on the zuul merger instances.

Then we benefited from git protocol v2 (Blog Post: Faster source code fetches thanks to git protocol version 2) which notably speed up the fetch, then for the WMCS instances it is probably not an issue since they have a large network bandwith with Gerrit, though it still helps.

We clone 62 repositories ...

There are a few ideas:

  • add more git repositories to the local cache (see profile::ci::gitcache in Puppet). Disk space usage on the WMCS instance would need to be taken in account
  • identify the bottleneck(s) in the git clone/fetch, is it:
    • network bound: the local cache would help
    • CPU bound: due to delta computing and I have no idea how that works exactly or that can be optimized
    • Disk bound: the repositories are ultimately written to Ceph which can get some overhead. Notably IO are throttled by default and we have custom VM flavors with elevated IO.

So I guess we could use some profiling which can be done with GIT_TRACE_PERFORMANCE=1 or the newer GIT_TRACE2_PERF=1, doc at https://git-scm.com/docs/api-trace2 though with parallel cloning that is going to be unreadable :D

For the CPU bound, I guess there can be contention on the Jenkins agent since we can run up to 4 builds in parallel (each using 1+ CPU) and the CPU are virtual ones shared with other VMs running on the same OpenStack compute node. Potentially we could double the number of agents and reduce the number of builds to 2 per hosts or head back to one build per instance but there is overhead in having so many agents.

For the disk I/O it was suggested to move the entirety of the load to tmpfs, aka run the Quibble container with --tmpfs /workspace/src:size=16G as we already do for the MySQL database (--tmpfs /workspace/db:size=320M) potentially with extra options since we don't care about loosing data. That would mean instances with a vast larger amount of memory.

Another way was to have a first job which would prepare all the repositories (clone / fetch from Gerrit / fetch and checkout Zuul patches), expose that somehow (maybe via rsync) then have all the jobs to simply blindly fetch the whole prepared tree. That would save a lot of the overhead since currently each jobs end up doing the same git operations when they could be done once. But that would require a lot of refactoring in the Zuul layout, and I am not even sure we can achieve that.

So I guess the low hanging fruits are:

  • add repositories to the local cache (after checking the disk space usage and the remaining space available for the disk)
  • run Quibble with git performance tracing (docker run -e GIT_TRACE2_PERF=/log/git.perf ... which would end up attached as an artifact of the Jenkins build)

Then based on perf identify potential solutions.