Evaluate different strategy for Docker CI instances
Closed, ResolvedPublic

Description

We're currently using 23 m1.medium instances as Jenkins agents to run Docker based jobs, each with a single executor. There are a few potential drawbacks to this configuration and I'd like to explore alternatives.

  1. Docker image caches are ineffective when spread out over so many machines. While we shouldn't be relying on Docker caching for basic functionality, I think it will be an increasingly important optimization in our CI stack as we run more container based jobs and further pipeline adoption. Note that this issue is going to be a larger issue for jobs that build images with a high degree of possible variance, i.e. pipeline jobs that use Blubber.
  2. We control the way these containers are executed (jobs using Blubber will run non-root within the container), so running each job in its own OpenStack VMs seems like unnecessary overhead without adding much security benefit.
  3. Limiting to a single executor for even a m1.medium seems like it wouldn't fully utilize the instance's VCPUs. Looking at grafana-labs, most integration-slave-docker-* don't peak above 50% user CPU utilization that often and the cluster's average utilization is also quite low. See grafana-labs.

One simple alternative might be to migrate to larger instance types (m1.xlarge) and up the number of executors of each node's agent to something more aggressive (e.g. num_cpus * 1.25 but we'd likely want to experiment).

I'd like to hear thoughts from others on this. I vaguely remember a reason for the single executor setup, but does this reason still apply when doing container based builds? Can we solve the problem in other ways?

dduvall created this task.Aug 17 2018, 7:05 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 17 2018, 7:05 PM
dduvall claimed this task.Sep 13 2018, 6:48 PM
dduvall triaged this task as Normal priority.

As of yesterday, we have the following configurations for m4executor nodes:

Instance typeexecutors (n)max vcpu (c)max mem(G) (m)min vcpu (c/n)min mem(G) (m/n)instances (i)total executors (n*i)
m1.medium124241414
m1.xlarge48162428
bigram48362928

I'm hoping that a full days worth of builds is enough to compare mean durations later today, but we'll see.

I'm hoping that a full days worth of builds is enough to compare mean durations later today, but we'll see.

Results are in for all of 2018-09-13 UTC.

Instance typeexecutors (n)max vcpu (c)max mem(G) (m)min vcpu (c/n)min mem(G) (m/n)instances (i)total executors (n*i)build duration mean (min)build duration SD (min)
m1.medium12424141416.298.28
m1.xlarge4816242810.414.76
bigram483629289.603.03

See the spreadsheet for collection method, data sample, hourly mean, and pretty charts.

I've reconfigured one of the bigram instances (integration-slave-docker-1034) to use 6 executors. Granted it's stable, let's leave it running with 6 at least until 9/19 0000 UTC to collect stats and compare performance against the one with 4.

dduvall added a comment.EditedSep 20 2018, 10:41 PM

I've reconfigured one of the bigram instances (integration-slave-docker-1034) to use 6 executors. Granted it's stable, let's leave it running with 6 at least until 9/19 0000 UTC to collect stats and compare performance against the one with 4.

Results for builds from period 2018-09-18T1200Z - 2018-09-20T2100Z.

See the spreadsheet for details. This spreadsheet contains a new sheet for standard deviations across all sampled jobs and node types, helpful in identifying and considering cross-sections where durations most vary. It should be noted, however, that the overall means by node type show a similar pattern to means when filtering for the jobs with least variance, so taking the mean overall still appears to be the best measurement available.

Instance typeexecutors (n)max vcpu (c)max mem(G) (m)min vcpu (c/n)min mem(G) (m/n)instances (i)total executors (n*i)build duration mean (min)build duration SD (min)cpu (system,user) %99
m1.medium12424141414.157.56unknown
m1.xlarge4816242810.345.2953-55%
bigram-4483629148.983.4450%
bigram-668361.336168.383.3352%

This report shows no relevant difference in performance between the bigram configured with 4 executors and the one configured with 6. Taking a peak at the 99th percentile for cpu % (system,user) for integration-slave-docker-1034 (the bigram-6) during this same period shows that we might be able to tune the executors higher still.

Now that the integration project quota has been increased, I'll plan to start up a few more bigrams now that are quota has been increased, configure a couple with 7 executors, collect more data in the coming days, and continue to report here.

Mentioned in SAL (#wikimedia-releng) [2018-09-20T23:20:13Z] <marxarelli> taking integration-slave-docker-1004/1005 offline for replacement (T202160)

dduvall added a comment.EditedSep 24 2018, 11:59 PM

Now that the integration project quota has been increased, I'll plan to start up a few more bigrams now that are quota has been increased, configure a couple with 7 executors, collect more data in the coming days, and continue to report here.

Results for builds between 2018-09-24T00:00:00Z and 2018-09-24T23:20:00Z that include two new bigram instances w/ 7 executors. As with the other tables, see the spreadsheet for a detailed report.

Instance typeexecutors (n)max vcpu (c)max mem(G) (m)min vcpu (c/n)min mem(G) (m/n)instances (i)total executors (n*i)build duration mean (min)build duration SD (min)cpu (system,user) %99
m1.medium12424141414.877.96unknown
m1.xlarge4816242811.816.2160-67%
bigram-4483629148.012.9050%
bigram-668361.336168.403.3448%
bigram-778361.145.1421410.374.5471%

This new report includes stats for two new bigram-7 instances put into the pool which are both bigram instances configured with 7 Jenkins executors. This new report also includes a new breakdown by percentile for each node type, showing the types to perform very similarly for builds below the 20th percentile in duration, and then diverging into the 40th percentile and above of mean duration. At the 80th percentile, the performance gains of all bigram instance configurations are largest. At the 90th percentile mark, both bigram-6 and bigram-7 start to perform a bit worse on average.

In talking this report over with the rest of Release Engineering this morning, I think we'll conclude this experiment for the moment and migrate all m4executor nodes to bigram instances configured with 6 executors. The data shows them to be comparable to the bigram-4 configuration in all but the very highest percentile of mean build durations, and they provide 50% more executor capacity.

We should think about repeating this kind of analysis each month as CI conditions change. It only takes 5-10 minutes to compile a report using the script saved with each report, and the stats have been an invaluable source with which to inform this configuration change. With so many moving parts in CI these days, I think it's prudent we go by the numbers.