Maniphest T202160

Evaluate different strategy for Docker CI instances
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	dduvall
	Aug 17 2018, 7:05 PM

Description

We're currently using 23 m1.medium instances as Jenkins agents to run Docker based jobs, each with a single executor. There are a few potential drawbacks to this configuration and I'd like to explore alternatives.

Docker image caches are ineffective when spread out over so many machines. While we shouldn't be relying on Docker caching for basic functionality, I think it will be an increasingly important optimization in our CI stack as we run more container based jobs and further pipeline adoption. Note that this issue is going to be a larger issue for jobs that build images with a high degree of possible variance, i.e. pipeline jobs that use Blubber.
We control the way these containers are executed (jobs using Blubber will run non-root within the container), so running each job in its own OpenStack VMs seems like unnecessary overhead without adding much security benefit.
Limiting to a single executor for even a m1.medium seems like it wouldn't fully utilize the instance's VCPUs. Looking at grafana-labs, most integration-slave-docker-* don't peak above 50% user CPU utilization that often and the cluster's average utilization is also quite low. See grafana-labs.

One simple alternative might be to migrate to larger instance types (m1.xlarge) and up the number of executors of each node's agent to something more aggressive (e.g. num_cpus * 1.25 but we'd likely want to experiment).

I'd like to hear thoughts from others on this. I vaguely remember a reason for the single executor setup, but does this reason still apply when doing container based builds? Can we solve the problem in other ways?

Related Objects
Search...

Status	Assigned	Task
Resolved	dduvall	T202160 Evaluate different strategy for Docker CI instances
Resolved	aborrero	T204373 Request increased quota for integration Cloud VPS project
Resolved	dduvall	T205362 Migrate m4executor CI nodes to bigram instances

Event Timeline

dduvall created this task.Aug 17 2018, 7:05 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 17 2018, 7:05 PM

dduvall mentioned this in T201972: Add some more m4executor docker slaves for Jenkins.Aug 20 2018, 5:53 PM

As of yesterday, we have the following configurations for m4executor nodes:

Instance type	executors (n)	max vcpu (c)	max mem(G) (m)	min vcpu (c/n)	min mem(G) (m/n)	instances (i)	*total executors (ni)**
`m1.medium`	1	2	4	2	4	14	14
`m1.xlarge`	4	8	16	2	4	2	8
`bigram`	4	8	36	2	9	2	8

I'm hoping that a full days worth of builds is enough to compare mean durations later today, but we'll see.

Restricted Application edited projects, added Release-Engineering-Team (Kanban); removed Release-Engineering-Team (Next). · View Herald TranscriptSep 13 2018, 6:48 PM

dduvall moved this task from Backlog to In-progress on the Release-Engineering-Team (Kanban) board.Sep 13 2018, 6:48 PM

In T202160#4581843, @dduvall wrote:

I'm hoping that a full days worth of builds is enough to compare mean durations later today, but we'll see.

Results are in for all of 2018-09-13 UTC.

Instance type	executors (n)	max vcpu (c)	max mem(G) (m)	min vcpu (c/n)	min mem(G) (m/n)	instances (i)	*total executors (ni)**	build duration mean (min)	build duration SD (min)
`m1.medium`	1	2	4	2	4	14	14	16.29	8.28
`m1.xlarge`	4	8	16	2	4	2	8	10.41	4.76
`bigram`	4	8	36	2	9	2	8	9.60	3.03

See the spreadsheet for collection method, data sample, hourly mean, and pretty charts.

Legoktm awarded a token.Sep 14 2018, 1:29 AM

dduvall mentioned this in T204373: Request increased quota for integration Cloud VPS project.Sep 14 2018, 6:16 PM

Paladox subscribed.Sep 14 2018, 6:24 PM

dduvall added a subtask: T204373: Request increased quota for integration Cloud VPS project.Sep 14 2018, 11:08 PM

I've reconfigured one of the bigram instances (integration-slave-docker-1034) to use 6 executors. Granted it's stable, let's leave it running with 6 at least until 9/19 0000 UTC to collect stats and compare performance against the one with 4.

aborrero closed subtask T204373: Request increased quota for integration Cloud VPS project as Resolved.Sep 20 2018, 9:29 AM

In T202160#4591940, @dduvall wrote:

I've reconfigured one of the bigram instances (integration-slave-docker-1034) to use 6 executors. Granted it's stable, let's leave it running with 6 at least until 9/19 0000 UTC to collect stats and compare performance against the one with 4.

Results for builds from period 2018-09-18T1200Z - 2018-09-20T2100Z.

See the spreadsheet for details. This spreadsheet contains a new sheet for standard deviations across all sampled jobs and node types, helpful in identifying and considering cross-sections where durations most vary. It should be noted, however, that the overall means by node type show a similar pattern to means when filtering for the jobs with least variance, so taking the mean overall still appears to be the best measurement available.

Instance type	executors (n)	max vcpu (c)	max mem(G) (m)	min vcpu (c/n)	min mem(G) (m/n)	instances (i)	*total executors (ni)**	build duration mean (min)	build duration SD (min)	cpu (system,user) %99
`m1.medium`	1	2	4	2	4	14	14	14.15	7.56	unknown
`m1.xlarge`	4	8	16	2	4	2	8	10.34	5.29	53-55%
`bigram-4`	4	8	36	2	9	1	4	8.98	3.44	50%
`bigram-6`	6	8	36	1.33	6	1	6	8.38	3.33	52%

This report shows no relevant difference in performance between the bigram configured with 4 executors and the one configured with 6. Taking a peak at the 99th percentile for cpu % (system,user) for integration-slave-docker-1034 (the bigram-6) during this same period shows that we might be able to tune the executors higher still.

Now that the integration project quota has been increased, I'll plan to start up a few more bigrams now that are quota has been increased, configure a couple with 7 executors, collect more data in the coming days, and continue to report here.

Mentioned in SAL (#wikimedia-releng) [2018-09-20T23:20:13Z] <marxarelli> taking integration-slave-docker-1004/1005 offline for replacement (T202160)

In T202160#4604406, @dduvall wrote:

Now that the integration project quota has been increased, I'll plan to start up a few more bigrams now that are quota has been increased, configure a couple with 7 executors, collect more data in the coming days, and continue to report here.

Results for builds between 2018-09-24T00:00:00Z and 2018-09-24T23:20:00Z that include two new bigram instances w/ 7 executors. As with the other tables, see the spreadsheet for a detailed report.

Instance type	executors (n)	max vcpu (c)	max mem(G) (m)	min vcpu (c/n)	min mem(G) (m/n)	instances (i)	*total executors (ni)**	build duration mean (min)	build duration SD (min)	cpu (system,user) %99
`m1.medium`	1	2	4	2	4	14	14	14.87	7.96	unknown
`m1.xlarge`	4	8	16	2	4	2	8	11.81	6.21	60-67%
`bigram-4`	4	8	36	2	9	1	4	8.01	2.90	50%
`bigram-6`	6	8	36	1.33	6	1	6	8.40	3.34	48%
`bigram-7`	7	8	36	1.14	5.14	2	14	10.37	4.54	71%

This new report includes stats for two new bigram-7 instances put into the pool which are both bigram instances configured with 7 Jenkins executors. This new report also includes a new breakdown by percentile for each node type, showing the types to perform very similarly for builds below the 20th percentile in duration, and then diverging into the 40th percentile and above of mean duration. At the 80th percentile, the performance gains of all bigram instance configurations are largest. At the 90th percentile mark, both bigram-6 and bigram-7 start to perform a bit worse on average.

In talking this report over with the rest of Release Engineering this morning, I think we'll conclude this experiment for the moment and migrate all m4executor nodes to bigram instances configured with 6 executors. The data shows them to be comparable to the bigram-4 configuration in all but the very highest percentile of mean build durations, and they provide 50% more executor capacity.

We should think about repeating this kind of analysis each month as CI conditions change. It only takes 5-10 minutes to compile a report using the script saved with each report, and the stats have been an invaluable source with which to inform this configuration change. With so many moving parts in CI these days, I think it's prudent we go by the numbers.

dduvall mentioned this in T205362: Migrate m4executor CI nodes to bigram instances.Sep 25 2018, 12:00 AM

Krinkle awarded a token.Sep 25 2018, 3:12 AM

dduvall closed this task as Resolved.Sep 25 2018, 11:04 PM

dduvall closed subtask T205362: Migrate m4executor CI nodes to bigram instances as Resolved.

zeljkofilipin mentioned this in Blog Post: Production Excellence #3: September 2018.Sep 26 2018, 12:17 PM

Krinkle mentioned this in T198432: Quibble CI jobs failing due to memory allocation.Sep 26 2018, 6:54 PM

dduvall mentioned this in T205902: Disk-space related issues still occurring for Docker based CI jobs.Oct 1 2018, 5:12 PM

hashar mentioned this in T225025: Request new Flavor for integration Cloud VPS project.Jun 4 2019, 8:56 PM

bd808 mentioned this in T376830: Setup CI build stats reporting tool to produce ongoing reports on capacity and utilization.Oct 9 2024, 6:27 PM