Maniphest T221969

Puppet catalog compiler - increasing max concurrent jobs
Closed, ResolvedPublic0 Estimated Story Points
Actions

Assigned To

Authored By

	herron
	Apr 26 2019, 2:31 PM

Description

We've seen a few times that when 2 compiler jobs are running at the same time a 3rd request will sit queued, sometimes for a long time, until one of the running jobs completes. It seems currently Jenkins is dispatching only 1 puppet catalog compiler job at a time per worker node.

Could we increase this to 2 concurrent jobs per worker? Or is it necessary to add worker nodes to accomplish this?

Details

	Subject	Repo	Branch	Lines +/-
	puppet compiler: bump threads 2 -> 4	integration/config	master	+1 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	hashar	T221969 Puppet catalog compiler - increasing max concurrent jobs
Resolved	Andrew	T222800 Requesting quota increase for 'puppet-diffs' project
Declined	None	T228056 Puppet times out on newly created instance in the puppet-diffs project

Event Timeline

herron created this task.Apr 26 2019, 2:31 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 26 2019, 2:31 PM

Indeed there is just two m1.large instances which have:

4	vCPUs
8 GB	RAM
80 GB	disk

I would assume the puppet compiler to be CPU bound and iirc it uses a minimum of two parallel processes per run (one for the production branch and another one for the current change. Whenever compiling for multiple hosts the job takes longer and keeps the slot busy for a little while.

Adding more executors would have concurrent build would start them faster but they would still racing for CPU usage. So potentially either add more vCPUs or most probably just add a couple more instances :-] It is probably just about spawning two new ones in Horizon, apply the puppet class and then attach them to the Jenkins master.

hashar triaged this task as Medium priority.May 2 2019, 8:13 AM

herron mentioned this in T222800: Requesting quota increase for 'puppet-diffs' project.May 8 2019, 3:04 PM

@hashar while on the topic, is it possible for Jenkins to more evenly dispatch PCC jobs across the workers? Currently compiler1002 receives the bulk of the work and currently is at 95% disk full, while compiler1001 is at only 50% disk full.

In T221969#5167384, @herron wrote:

@hashar while on the topic, is it possible for Jenkins to more evenly dispatch PCC jobs across the workers? Currently compiler1002 receives the bulk of the work and currently is at 95% disk full, while compiler1001 is at only 50% disk full.

Sorry I have missed that comment. Jenkins attempts to schedule a build on a slave that already ran it previously, the original intent was to save time when updating the source repository from a SCM. If a build already occurred on a node, then the job workspace probably already has the source code. That does not stand true on our setup though, we always start the build from scratch.

Eventually via T218458, I have installed a plugin that disable the above behavior so that the build should be more evenly spread across available instances. I have deployed it on May 6th and thus puppet compile jobs should be hopefully equally split between the compiler1001 and compiler1002. I don't know how to proof check that though :-(

In T221969#5176406, @hashar wrote:

I have deployed it on May 6th and thus puppet compile jobs should be hopefully equally split between the compiler1001 and compiler1002. I don't know how to proof check that though :-(

Ah, good news! Indeed, looking at the filesystem jobs look to be more evenly balanced in the last week than in the past. Thanks!

hashar mentioned this in T218458: Jenkins jobs regularly being queued while resources appear to be readily available.May 14 2019, 8:39 PM

I thought about this task a little bit. The current instances have 4 vCPUs. The operations-puppet-catalog-compiler-test job runs the compiler with NUM_THREADS=2

I would suggest:

to use x1.large instances (8 vCPUS / 16G RAM / 160G disk). The RAM / disk is a bit overkill since the compiler is mostly CPU bound iirc.
Set the jobs to use NUM_THREADS=6 (or 7? so we at least have one CPU available for the rest)
add a third instance to the pool

Andrew closed subtask T222800: Requesting quota increase for 'puppet-diffs' project as Resolved.May 16 2019, 5:18 PM

@herron any thoughts about my proposals? ;)

hashar moved this task from Untriaged to In-progress on the Continuous-Integration-Infrastructure board.May 21 2019, 9:14 PM

hashar edited projects, added Release-Engineering-Team (Kanban); removed Release-Engineering-Team.

hashar moved this task from Backlog to In-progress on the Release-Engineering-Team (Kanban) board.

In T221969#5182836, @hashar wrote:

I thought about this task a little bit. The current instances have 4 vCPUs. The operations-puppet-catalog-compiler-test job runs the compiler with NUM_THREADS=2

I would suggest:

to use x1.large instances (8 vCPUS / 16G RAM / 160G disk). The RAM / disk is a bit overkill since the compiler is mostly CPU bound iirc.

Set the jobs to use NUM_THREADS=6 (or 7? so we at least have one CPU available for the rest)

add a third instance to the pool

Hmm, I'm not seeing see x1.large as an available instance type, but I do see these types:

Name	    VCPUS	RAM	Total Disk
m1.medium	2	4 GB	40 GB
m1.large	4	8 GB	80 GB	
m1.xlarge	8	16 GB	160 GB

The puppet-diffs project now has a total quota of 32GB RAM and 16 vCPUs. To increase instance count we could either add a couple of m1.large instances, or reach out to wmcs for a quota increase again. I'd be fine with the former, but happy to re-open the quota task if you'd like.

I like the idea of aligning NUM_THREADS with the CPU count on the compiler hosts. We could even set NUM_THREADS=4 right away on the current hosts for some improvement on compilations >2 hosts. In my testing this speeds up compilation when the number of hosts is >= NUM_THREADS, but does not cause the prod/change compilations for the same host to run in parallel.

@herron sorry for the misleading flavor name. I indeed thought about using the 8vCPUs ones: m1.xlarge. But as you stated more CPU is only useful when testing a lot of hosts so ..

I guess yes, lets get a couple more 4vCPUs m1.large instance and bump NUM_THREADS to 4.

greg edited projects, added Release-Engineering-Team-TODO (201907); removed Release-Engineering-Team (Kanban).Jul 1 2019, 9:19 PM

greg moved this task from INBOX to Doing on the Release-Engineering-Team-TODO (201907) board.Jul 1 2019, 9:21 PM

Change 523158 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] puppet compiler: bump threads 2 -> 4

https://gerrit.wikimedia.org/r/523158

gerritbot added a project: Patch-For-Review.Jul 15 2019, 12:53 PM

Mentioned in SAL (#wikimedia-releng) [2019-07-15T12:55:35Z] <hashar> Creating compiler1003.puppet-diffs.eqiad.wmflabs [172.16.2.46] m1.large / 4vCPUs # T221969

Mentioned in SAL (#wikimedia-releng) [2019-07-15T13:30:35Z] <hashar> Deleting compiler1003.puppet-diffs.eqiad.wmflabs .. insanely slow for some reason (was on cloudvirt1002) # T221969 )

I tried provisioning a new instance but puppet takes age / fail retrieiving files. I have filled T228056 for that.

The faulty compiler1003 instance: https://horizon.wikimedia.org/project/instances/7ee33824-55f0-4c80-ba7f-c317bea301a3/

On compiler1003 I have applied role::puppet_compiler and the hiera config. Postgre fails though:

Notice: /Stage[main]/Postgresql::Slave/Exec[pg_basebackup-compiler1002.puppet-diffs.eqiad.wmflabs]/returns: pg_basebackup: could not connect to server: FATAL:  no pg_hba.conf entry for replication connection from host "172.16.2.57", user "replication", SSL on
Notice: /Stage[main]/Postgresql::Slave/Exec[pg_basebackup-compiler1002.puppet-diffs.eqiad.wmflabs]/returns: FATAL:  no pg_hba.conf entry for replication connection from host "172.16.2.57", user "replication", SSL off
Error: /usr/bin/pg_basebackup -X stream -D /srv/postgres/9.6/main -h compiler1002.puppet-diffs.eqiad.wmflabs -U replication -w returned 1 instead of one of [0]
Error: /Stage[main]/Postgresql::Slave/Exec[pg_basebackup-compiler1002.puppet-diffs.eqiad.wmflabs]/returns: change from notrun to 0 failed: /usr/bin/pg_basebackup -X stream -D /srv/postgres/9.6/main -h compiler1002.puppet-diffs.eqiad.wmflabs -U replication -w returned 1 instead of one of [0]
Notice: /Stage[main]/Postgresql::Slave/File[/srv/postgres/9.6/main/recovery.conf]: Dependency Exec[pg_basebackup-compiler1002.puppet-diffs.eqiad.wmflabs] has failures: true
Warning: /Stage[main]/Postgresql::Slave/File[/srv/postgres/9.6/main/recovery.conf]: Skipping because of failed dependencies
Notice: /Stage[main]/Puppetdb::App/File[/etc/puppetdb]: Not removing directory; use 'force' to override
Notice: /Stage[main]/Puppetdb::App/File[/etc/puppetdb]: Not removing directory; use 'force' to override
Error: Could not remove existing file
Error: /Stage[main]/Puppetdb::App/File[/etc/puppetdb]/ensure: change from directory to link failed: Could not remove existing file
...

I have no idea why it refers to compiler1002 when the instance is 1003!

In hiera config:

- profile::puppetdb::master: compiler1002.puppet-diffs.eqiad.wmflabs
+ profile::puppetdb::master: compiler1003.puppet-diffs.eqiad.wmflabs

hashar removed a subtask: T228056: Puppet times out on newly created instance in the puppet-diffs project.Jul 15 2019, 2:57 PM

Puppet fails due to /etc/puppetdb being a directory when it tries to make it a symlink. Eventually I found we use the default puppetdb:

# apt-cache policy puppetdb
puppetdb:
  Installed: 2.3.8-1~wmf1+stretch
  Candidate: 4.4.0-1~wmf2
  Version table:
     4.4.0-1~wmf2 1001
       1001 http://apt.wikimedia.org/wikimedia stretch-wikimedia/component/puppetdb4 amd64 Packages
 *** 2.3.8-1~wmf1+stretch 1001
       1001 http://apt.wikimedia.org/wikimedia stretch-wikimedia/main amd64 Packages
        100 /var/lib/dpkg/status

5d07ea3214e9cf14cf404ce7a4d15189f800503a did some refactoring so:

- puppetdb_major_version: 4
+ profile::base::puppet::puppet_major_version: 4

I have removed the package manually as well as /etc/puppetdb. Ran puppet and this time it moved past that error.

Then I guess it has some troubles starting the puppetdb service. It stays blocked on:

7959 pts/0    Sl+    0:12  |                       \_ /usr/bin/ruby /usr/bin/puppet agent -tv
8502 ?        Ss     0:00  |                           \_ /bin/systemctl start puppetdb

I have rebooted the instance, it ran puppet on boot and apparently still get stuck on systemctl start puppetdb :-\\\

https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler-test/ now has NUM_THREADS=4.

Change 523158 merged by jenkins-bot:
[integration/config@master] puppet compiler: bump threads 2 -> 4

https://gerrit.wikimedia.org/r/523158

Maintenance_bot removed a project: Patch-For-Review.Jul 16 2019, 10:10 AM

Jdforrester-WMF moved this task from Doing to Done (within RelEng) on the Release-Engineering-Team-TODO (201907) board.Jul 29 2019, 4:27 PM

thcipriani edited projects, added Release-Engineering-Team-TODO; removed Release-Engineering-Team-TODO (201907).Sep 11 2019, 5:53 PM

thcipriani moved this task from Should be empty (use Release-Engineering-Team) to Done (within RelEng) on the Release-Engineering-Team-TODO board.

hashar added a subtask: T228056: Puppet times out on newly created instance in the puppet-diffs project.Sep 17 2019, 1:11 PM

To workaround the insanely slow puppet run from T228056, I commented out base::resolving::labs_additional_domains in hiera and the provisioning has been super fast.

Then I have hit the issue of puppetdb not being the proper version, /etc/puppetdb existing, and eventually the service refusing to start due to lack of /etc/puppetdb/jvm_prometheus_puppetdb_jmx_exporter.yaml (I have touch it).

Still stuck on systemctl start puppetdb.

I have left compiler1003.puppet-diff.eqiad.wmflabs around in case @herron can get it fixed. Or maybe the instance is ready and working but I have no idea how to verify that the puppet compiler behaves properly.

I have deleted the faulty compiler1003.

To add more instances or have instances with more CPU, the puppet provisioning would first have to be fixed. But the first run takes age (declined: T228056 ) and there is then bunch of issues with setting up the puppetdb which does not even start.

The only change made has been to bump the number of threads from 2 to 4 via https://gerrit.wikimedia.org/r/523158 ( https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler-test/ now has NUM_THREADS=4 ).

Meanwhile, it seems the bump of the number of threads has been good enough. If there is interest in having more instances, one would have to first fix the provisioning.

jbond subscribed.Oct 22 2019, 10:18 AM

Puppet catalog compiler - increasing max concurrent jobsClosed, ResolvedPublic0 Estimated Story PointsActions

Description

Details

Related ObjectsSearch...

Event Timeline

Puppet catalog compiler - increasing max concurrent jobs
Closed, ResolvedPublic0 Estimated Story Points
Actions

Related Objects
Search...