Page MenuHomePhabricator

Puppet catalog compiler - increasing max concurrent jobs
Closed, ResolvedPublic0 Estimated Story Points

Description

We've seen a few times that when 2 compiler jobs are running at the same time a 3rd request will sit queued, sometimes for a long time, until one of the running jobs completes. It seems currently Jenkins is dispatching only 1 puppet catalog compiler job at a time per worker node.

Could we increase this to 2 concurrent jobs per worker? Or is it necessary to add worker nodes to accomplish this?

Event Timeline

Indeed there is just two m1.large instances which have:

4vCPUs
8 GBRAM
80 GBdisk

I would assume the puppet compiler to be CPU bound and iirc it uses a minimum of two parallel processes per run (one for the production branch and another one for the current change. Whenever compiling for multiple hosts the job takes longer and keeps the slot busy for a little while.

Adding more executors would have concurrent build would start them faster but they would still racing for CPU usage. So potentially either add more vCPUs or most probably just add a couple more instances :-] It is probably just about spawning two new ones in Horizon, apply the puppet class and then attach them to the Jenkins master.

hashar triaged this task as Medium priority.May 2 2019, 8:13 AM

@hashar while on the topic, is it possible for Jenkins to more evenly dispatch PCC jobs across the workers? Currently compiler1002 receives the bulk of the work and currently is at 95% disk full, while compiler1001 is at only 50% disk full.

@hashar while on the topic, is it possible for Jenkins to more evenly dispatch PCC jobs across the workers? Currently compiler1002 receives the bulk of the work and currently is at 95% disk full, while compiler1001 is at only 50% disk full.

Sorry I have missed that comment. Jenkins attempts to schedule a build on a slave that already ran it previously, the original intent was to save time when updating the source repository from a SCM. If a build already occurred on a node, then the job workspace probably already has the source code. That does not stand true on our setup though, we always start the build from scratch.

Eventually via T218458, I have installed a plugin that disable the above behavior so that the build should be more evenly spread across available instances. I have deployed it on May 6th and thus puppet compile jobs should be hopefully equally split between the compiler1001 and compiler1002. I don't know how to proof check that though :-(

I have deployed it on May 6th and thus puppet compile jobs should be hopefully equally split between the compiler1001 and compiler1002. I don't know how to proof check that though :-(

Ah, good news! Indeed, looking at the filesystem jobs look to be more evenly balanced in the last week than in the past. Thanks!

I thought about this task a little bit. The current instances have 4 vCPUs. The operations-puppet-catalog-compiler-test job runs the compiler with NUM_THREADS=2

I would suggest:

  • to use x1.large instances (8 vCPUS / 16G RAM / 160G disk). The RAM / disk is a bit overkill since the compiler is mostly CPU bound iirc.
  • Set the jobs to use NUM_THREADS=6 (or 7? so we at least have one CPU available for the rest)
  • add a third instance to the pool

@herron any thoughts about my proposals? ;)

I thought about this task a little bit. The current instances have 4 vCPUs. The operations-puppet-catalog-compiler-test job runs the compiler with NUM_THREADS=2

I would suggest:

  • to use x1.large instances (8 vCPUS / 16G RAM / 160G disk). The RAM / disk is a bit overkill since the compiler is mostly CPU bound iirc.
  • Set the jobs to use NUM_THREADS=6 (or 7? so we at least have one CPU available for the rest)
  • add a third instance to the pool

Hmm, I'm not seeing see x1.large as an available instance type, but I do see these types:

Name	    VCPUS	RAM	Total Disk
m1.medium	2	4 GB	40 GB
m1.large	4	8 GB	80 GB	
m1.xlarge	8	16 GB	160 GB

The puppet-diffs project now has a total quota of 32GB RAM and 16 vCPUs. To increase instance count we could either add a couple of m1.large instances, or reach out to wmcs for a quota increase again. I'd be fine with the former, but happy to re-open the quota task if you'd like.

I like the idea of aligning NUM_THREADS with the CPU count on the compiler hosts. We could even set NUM_THREADS=4 right away on the current hosts for some improvement on compilations >2 hosts. In my testing this speeds up compilation when the number of hosts is >= NUM_THREADS, but does not cause the prod/change compilations for the same host to run in parallel.

@herron sorry for the misleading flavor name. I indeed thought about using the 8vCPUs ones: m1.xlarge. But as you stated more CPU is only useful when testing a lot of hosts so ..

I guess yes, lets get a couple more 4vCPUs m1.large instance and bump NUM_THREADS to 4.

Change 523158 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] puppet compiler: bump threads 2 -> 4

https://gerrit.wikimedia.org/r/523158

Mentioned in SAL (#wikimedia-releng) [2019-07-15T12:55:35Z] <hashar> Creating compiler1003.puppet-diffs.eqiad.wmflabs [172.16.2.46] m1.large / 4vCPUs # T221969

Mentioned in SAL (#wikimedia-releng) [2019-07-15T13:30:35Z] <hashar> Deleting compiler1003.puppet-diffs.eqiad.wmflabs .. insanely slow for some reason (was on cloudvirt1002) # T221969 )

hashar changed the task status from Open to Stalled.Jul 15 2019, 1:45 PM

I tried provisioning a new instance but puppet takes age / fail retrieiving files. I have filled T228056 for that.

The faulty compiler1003 instance: https://horizon.wikimedia.org/project/instances/7ee33824-55f0-4c80-ba7f-c317bea301a3/

On compiler1003 I have applied role::puppet_compiler and the hiera config. Postgre fails though:

Notice: /Stage[main]/Postgresql::Slave/Exec[pg_basebackup-compiler1002.puppet-diffs.eqiad.wmflabs]/returns: pg_basebackup: could not connect to server: FATAL:  no pg_hba.conf entry for replication connection from host "172.16.2.57", user "replication", SSL on
Notice: /Stage[main]/Postgresql::Slave/Exec[pg_basebackup-compiler1002.puppet-diffs.eqiad.wmflabs]/returns: FATAL:  no pg_hba.conf entry for replication connection from host "172.16.2.57", user "replication", SSL off
Error: /usr/bin/pg_basebackup -X stream -D /srv/postgres/9.6/main -h compiler1002.puppet-diffs.eqiad.wmflabs -U replication -w returned 1 instead of one of [0]
Error: /Stage[main]/Postgresql::Slave/Exec[pg_basebackup-compiler1002.puppet-diffs.eqiad.wmflabs]/returns: change from notrun to 0 failed: /usr/bin/pg_basebackup -X stream -D /srv/postgres/9.6/main -h compiler1002.puppet-diffs.eqiad.wmflabs -U replication -w returned 1 instead of one of [0]
Notice: /Stage[main]/Postgresql::Slave/File[/srv/postgres/9.6/main/recovery.conf]: Dependency Exec[pg_basebackup-compiler1002.puppet-diffs.eqiad.wmflabs] has failures: true
Warning: /Stage[main]/Postgresql::Slave/File[/srv/postgres/9.6/main/recovery.conf]: Skipping because of failed dependencies
Notice: /Stage[main]/Puppetdb::App/File[/etc/puppetdb]: Not removing directory; use 'force' to override
Notice: /Stage[main]/Puppetdb::App/File[/etc/puppetdb]: Not removing directory; use 'force' to override
Error: Could not remove existing file
Error: /Stage[main]/Puppetdb::App/File[/etc/puppetdb]/ensure: change from directory to link failed: Could not remove existing file
...

I have no idea why it refers to compiler1002 when the instance is 1003!

In hiera config:

- profile::puppetdb::master: compiler1002.puppet-diffs.eqiad.wmflabs
+ profile::puppetdb::master: compiler1003.puppet-diffs.eqiad.wmflabs

Puppet fails due to /etc/puppetdb being a directory when it tries to make it a symlink. Eventually I found we use the default puppetdb:

# apt-cache policy puppetdb
puppetdb:
  Installed: 2.3.8-1~wmf1+stretch
  Candidate: 4.4.0-1~wmf2
  Version table:
     4.4.0-1~wmf2 1001
       1001 http://apt.wikimedia.org/wikimedia stretch-wikimedia/component/puppetdb4 amd64 Packages
 *** 2.3.8-1~wmf1+stretch 1001
       1001 http://apt.wikimedia.org/wikimedia stretch-wikimedia/main amd64 Packages
        100 /var/lib/dpkg/status

5d07ea3214e9cf14cf404ce7a4d15189f800503a did some refactoring so:

- puppetdb_major_version: 4
+ profile::base::puppet::puppet_major_version: 4

I have removed the package manually as well as /etc/puppetdb. Ran puppet and this time it moved past that error.

Then I guess it has some troubles starting the puppetdb service. It stays blocked on:

7959 pts/0    Sl+    0:12  |                       \_ /usr/bin/ruby /usr/bin/puppet agent -tv
8502 ?        Ss     0:00  |                           \_ /bin/systemctl start puppetdb

I have rebooted the instance, it ran puppet on boot and apparently still get stuck on systemctl start puppetdb :-\\\

Change 523158 merged by jenkins-bot:
[integration/config@master] puppet compiler: bump threads 2 -> 4

https://gerrit.wikimedia.org/r/523158

hashar changed the task status from Stalled to Open.Sep 17 2019, 1:47 PM

To workaround the insanely slow puppet run from T228056, I commented out base::resolving::labs_additional_domains in hiera and the provisioning has been super fast.

Then I have hit the issue of puppetdb not being the proper version, /etc/puppetdb existing, and eventually the service refusing to start due to lack of /etc/puppetdb/jvm_prometheus_puppetdb_jmx_exporter.yaml (I have touch it).

Still stuck on systemctl start puppetdb.

I have left compiler1003.puppet-diff.eqiad.wmflabs around in case @herron can get it fixed. Or maybe the instance is ready and working but I have no idea how to verify that the puppet compiler behaves properly.

hashar claimed this task.

I have deleted the faulty compiler1003.

To add more instances or have instances with more CPU, the puppet provisioning would first have to be fixed. But the first run takes age (declined: T228056 ) and there is then bunch of issues with setting up the puppetdb which does not even start.

The only change made has been to bump the number of threads from 2 to 4 via https://gerrit.wikimedia.org/r/523158 ( https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler-test/ now has NUM_THREADS=4 ).

Meanwhile, it seems the bump of the number of threads has been good enough. If there is interest in having more instances, one would have to first fix the provisioning.