Page MenuHomePhabricator

Add some more m4executor docker slaves for Jenkins
Closed, ResolvedPublic

Description

We're now in a position where two mediawiki/core patches in gate-and-submit will use up all of the m4executor docker slaves. I think adding 3-5 more would be a good idea.

Also I don't think there's much use for the smaller instances that have two docker slots, I'd think those would be better served as m4executor instances.

If Cloud Services capacity is an issue, maybe we can give up some of the nodepool capacity in favor of more docker instances?

Event Timeline

Legoktm created this task.

maybe we can give up some of the nodepool capacity in favor of more docker instances

Or even "most of"? Time to start punishing code that's not switched over?

Or even "most of"? Time to start punishing code that's not switched over?

Or rather, time to stop punishing code that has switched over?

In numbers:

Indeed, adding more Docker executors would be nice :) The delays are getting worse now that we've switched most things to Docker.

I added 3 new m4exector instances today following instructions on: https://www.mediawiki.org/wiki/Continuous_integration/Docker#Jenkins_Agent_Creation

That puts the integration project at capacity, currently. It's only at-capacity due to number of instances (not CPU or memory capacity).

We currently have 4 m1executors. I think we can probably swap 2 of those for 2 m4executors.

We currently have 4 m1executors. I think we can probably swap 2 of those for 2 m4executors.

Sounds good, we could probably switch everything to just use m4executors and kill of the m1 executors totally?
Would simplify everything and not really waste resources

dduvall subscribed.

I filed a related task, T202160: Evaluate different strategy for Docker CI instances, this past Friday which has to do with consolidating executors onto larger instance types, and after discussing that issue in the RelEng weekly meeting, it seems so closely related to this one that I might as well lick both cookies.

The current plan is:

  1. Replace one or two ci1.medium instances with m1.xlarge. The latter has 4x vCPU and memory compared to m1.mediums, so:
  2. Allocate 4-5 Jenkins executors to the m1.xlarge instances. Using 5 executors would result in a slightly lower mem:executor ratio but higher vcpu:executor. (See low vCPU utilization noted in related task.)
  3. Pool the nodes as m4executors and let jobs be scheduled/run normally for a week or so.
  4. Compare job execution time and resource utilization of the m1.xlarge instances with that of the m1.mediums. (Is there an easy way to see mean execution time of jobs by the labels of the nodes they ran on?)
  5. Adjust executor numbers accordingly.
  6. If it makes sense to do so, request a different flavor from Cloud Services that will give us ratios of vcpu/memory more congruent with load.

A m1.xlarge instance is now up and the node is set up with 5 executors.

A bigmem instance is up, and the node is configured with 8 executors.

Hopefully a week or two will give us enough data to compare these different configurations.

instance typeexecutors (n)vcpu/nmem (G)/n
m1.medium124
m1.xlarge51.63.2
bigmem814.5

I'm looking into how best to collect useful metrics, currently thinking that we'll want to look at resource utilization for the instances but also job execution time grouping by (job name, zuul project, instance type).

Change 454222 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/puppet@production] nodepool: reduce number of instances

https://gerrit.wikimedia.org/r/454222

Change 454222 merged by Andrew Bogott:
[operations/puppet@production] nodepool: reduce number of instances

https://gerrit.wikimedia.org/r/454222

Change 455269 had a related patch set uploaded (by Dduvall; owner: Dduvall):
[integration/config@master] Statsd publisher that sends job/node metrics to statsd.eqiad.wmnet

https://gerrit.wikimedia.org/r/455269

Change 455269 merged by jenkins-bot:
[integration/config@master] Publish job duration for labeled nodes to labmon1001

https://gerrit.wikimedia.org/r/455269

Mentioned in SAL (#wikimedia-releng) [2018-09-07T09:30:11Z] <hashar> integration-slave-docker-1025 lower number of executors from 5 to 4. 8 CPUS can not sustain 5 concurrent Quibble builds | T201972

integration-slave-docker-1025 had 5 Quibble jobs in parallel and that slowed down the builds. Seems mysql eats more CPU than expected. I have lowered the number of executors to 4.

Friday I removed the integration-slave-docker-1026 node since it was constantly running out of disk-space and then self-recovering (concurrent running containers eating into /).

I replaced it with integration-slave-docker-1027 which is an m1.xlarge with 4 executors (the same as integration-slave-docker-1025). The resources on this machine are comparable to an m1.medium with 1 executor. Since we are currently at our provisioning capacity, I added a m1.xlarge (1027) to gain more capacity.

instance typeexecutors (n)vcpu/nmem (G)/n
m1.medium124
m1.xlarge51.63.2
bigmem814.5

The statsd publisher was kind of a bust—it isn't collecting enough data and lacks useful metadata by which to filter results, and refactoring the code each time we wish to change the collection/segmentation method seems too cumbersome. I'll be rolling that back today.

However, I was able to collect some useful information for the mediawiki-quibble-vendor-mysql-hhvm-docker job by querying the Jenkins JSON API, and populating a spreadsheet—a win for 90s tech.

https://docs.google.com/spreadsheets/d/1EIIqH5thi7Q7CNgwS-_xB1c-RVTCBfIq4TqOaxWxy8E/edit?usp=sharing

These preliminary results show that the average build duration for the bigmem instance configuration in the above matrix performed much better than the other two configurations. However, the recent disk-full failures (see T202457: mediawiki-quibble docker jobs fails due to disk full) that mostly affected the bigmem instance (though not exclusively) may have skewed these results; the spreadsheet does consider durations for only successful builds but if the node is taken offline due to a full disk, the currently running jobs will finish their runs with drastically more resources than they would if they had to continue contending with other newly scheduled builds, thus giving them quite an edge.

Despite the possibility of skewed results, I think repeating this duration comparison between different node configurations after we solve the disk-full issue would be worthwhile.

T202457: mediawiki-quibble docker jobs fails due to disk full should be less of an issue now, the Quibble based jobs clear out src which helps reduce disk pressure. If there is no concern to add more pressure to WMCS infra, we can add more big slaves.

The full-disk issues have been resolved by giving Docker its own big chunk of the LVM volume group for images and running containers (see T203841: Provide dedicated storage space to Docker for images/containers) and by @hashar's improvements to Quibble's workspace cleanup. As a result, we were able to spin up a few more large instance nodes yesterday: 2 xlarge instances (1 replacing a failed node) and 2 bigram instances, each configured with 4 executors.

Based on the current trends in grafana it seems like there's enough unallocated memory on average to increase the bigram executors, but in any case, the current effective number of m4executors is at 30, which should provide a big boost in capacity for now.

(BTW, an easy way to see this now that we have different configurations is to use the Groovy console:)

Jenkins.instance.nodes.findAll { it.assignedLabels.collect { it.toString() }.contains('m4executor') }.collect { it.numExecutors }.sum()

I'm going to continue to evaluate performance statistics on the new instance types, but I'll do that in T202160: Evaluate different strategy for Docker CI instances so as not to keep spamming this task.