Page MenuHomePhabricator

CI for operations/puppet is taking too long
Closed, ResolvedPublic

Description

For each operations/puppet changeset, a short Jenkins job runs that performs pep8/pyflake/puppet-lint etc. For most changesets, you would expect such a job to run in a few seconds, less than half a minute in the vast majority of the cases.

However, it's often the case that such job takes minutes, sometimes many minutes (e.g 5mins for a couple of changesets of mine just now) to run, slowing us down. Since operations/puppet is configured to be Fast forward only, it's often the case that someone needs to just rebase and then wait in order to merge a changeset (or worse: more than one!). Moreover, since changes have to be followed by a manual puppet-merge on the puppetmaster (and sometimes babysat on), this means that we can't have Jenkins automatically merge changesets with C+2/V+2.

All of the above means that for a lot of us, we have to daily go through a process for that is essentially us hitting Rebase, then sitting on our screens idle while waiting for Jenkins to finish for some minutes, then hit submit, then puppet-merge, then do that all over again for the next changeset, and the next and so on and so forth.

This is a waste of time, energy and focus and incredibly frustrating. It was unacceptable 4 years ago when I first raised it (searched my IRC logs: Feb 19th and Mar 12th, 2013 was when I first pointed this out) and it definitely is now, with a far larger puppet tree and contributor base.

operations/puppet is one of the most active repositories Wikimedia has, and one of the most critical ones as far as errors creeping in to production go, so that manually forcing V+2 is dangerous enough. It doesn't look like the current CI plans are going to address this particular CI pipeline -- at least not anytime soon. We expect better than this.

I realize this is the result of some poor architectural choices made in the past (some of them against our strong recommendations at the time) and can't be fixed anytime soon without a major rearchitecturing effort. I'd still be interested in an shorter-term fix though, like for example bypassing Nodepool entirely and having 1-2 VMs dedicated to puppet-linting (or any other short-term fix you folks can think of).

Event Timeline

We should swap operations/puppet to "Rebase If Necessary" like we've done with operations/mediawiki-config and a few other repos. Basically it does a ff if it can, then tries to rebase before retrying so you always end up with a ff-only history (no merge commits). The only time it breaks is when you want it to: there are merge conflicts.

I meant to mention this some time ago. You folks really do waste a lot of time hitting Rebase every day.

What setting were we on before we moved to FF-only? Whatever the prior setting was, it tended to create spurious merge commits all over our actual git history which makes a mess of it, like: https://github.com/wikimedia/puppet/commit/1df0dab661e57fc73942defa4ba4a57436b1b240 . Is "Rebase if necessary" new-ish since then or did we just fail to know it was the optimal option?

What setting were we on before we moved to FF-only? Whatever the prior setting was, it tended to create spurious merge commits all over our actual git history which makes a mess of it, like: https://github.com/wikimedia/puppet/commit/1df0dab661e57fc73942defa4ba4a57436b1b240 . Is "Rebase if necessary" new-ish since then or did we just fail to know it was the optimal option?

It was "Merge if Necessary" which gave you the merge commits you don't want. Rebase If Necessary is...newer than when you changed it last, but pretty old (we've had it for at least a year or so).

For a high-traffic repo that wants a ff-style history, it's the best merge policy.

We should swap operations/puppet to "Rebase If Necessary" like we've done with operations/mediawiki-config and a few other repos. Basically it does a ff if it can, then tries to rebase before retrying so you always end up with a ff-only history (no merge commits). The only time it breaks is when you want it to: there are merge conflicts.

Does this actually work now? In my (recent-ish, as in a month or two ago) experience with wmf-config it doesn't really, and Gerrit often says that patches "Cannot Merge" if they touch the same file, even if it's thousands of lines away and the rebase button works.

I meant to mention this some time ago. You folks really do waste a lot of time hitting Rebase every day.

So do SWATters

We should swap operations/puppet to "Rebase If Necessary" like we've done with operations/mediawiki-config and a few other repos. Basically it does a ff if it can, then tries to rebase before retrying so you always end up with a ff-only history (no merge commits). The only time it breaks is when you want it to: there are merge conflicts.

Does this actually work now? In my (recent-ish, as in a month or two ago) experience with wmf-config it doesn't really, and Gerrit often says that patches "Cannot Merge" if they touch the same file, even if it's thousands of lines away and the rebase button works.

It conflicts on same files a little more often than a normal rebase does, but it handles a lot more scenarios than setting to FF-only.

I meant to mention this some time ago. You folks really do waste a lot of time hitting Rebase every day.

So do SWATters

Indeed. And it's saved us quite a bit of time. It's certainly better than FF-only :)

We should swap operations/puppet to "Rebase If Necessary" like we've done with operations/mediawiki-config and a few other repos. Basically it does a ff if it can, then tries to rebase before retrying so you always end up with a ff-only history (no merge commits). The only time it breaks is when you want it to: there are merge conflicts.

I meant to mention this some time ago. You folks really do waste a lot of time hitting Rebase every day.

Thanks! This is an orthogonal issue to what this task is about and we can certainly explore it.

However, I'd go as far as to say that it's a hacky workaround and one that we should be really careful to apply: a diff on top of commit A may pass CI tests, but the diff on top commit B may not, even if it applies cleanly (because, for example, is using an API that was deprecated somewhere else in the tree). Not running CI tests exactly on top of what you're about to land is in principle the wrong thing to do. In practice, the CI tests of our puppet code are so limited and naive that it won't catch those type of errors anyway, so perhaps it's just theoretical at this point.

However, I'd go as far as to say that it's a hacky workaround and one that we should be really careful to apply: a diff on top of commit A may pass CI tests, but the diff on top commit B may not, even if it applies cleanly (because, for example, is using an API that was deprecated somewhere else in the tree). Not running CI tests exactly on top of what you're about to land is in principle the wrong thing to do. In practice, the CI tests of our puppet code are so limited and naive that it won't catch those type of errors anyway, so perhaps it's just theoretical at this point.

As I understand it, our CI system is smart enough to run the gate-and-submit tests on the commit as it would be when landed (i.e. it locally merges/rebases the commit into/onto master, then runs tests on top of that merged/rebased state). AFAIK it's worked that way since we moved MW to Gerrit in 2012.

As I understand it, our CI system is smart enough to run the gate-and-submit tests on the commit as it would be when landed (i.e. it locally merges/rebases the commit into/onto master, then runs tests on top of that merged/rebased state). AFAIK it's worked that way since we moved MW to Gerrit in 2012.

Puppet doesn't use gate-and-submit, we manually submit (on purpose).

I'm not sure how "Rebase If Necessary" works exactly, but either Jenkins would not run again against the rebased commit (causing the issue that I mentioned above: submitting an untested combination of parent commit + commitdiff), or submission would be blocked on CI re-running the test suite, which is back to square 0, as far as the original task description (CI taking too long) goes.

The situation you outline is certainly possible. Granted CI did test it against latest HEAD, but there's a chance a commit landed in the short period in between testing and your commit finishing its testing and it lands. In practice, this window is pretty small. The majority of usecases (just press rebase because there are intermediate commits) won't be affected at all, since your +2 would already have done the rebase at testing time. I can see a little bit of hesitation, but as I said we've been using it in mw-config for some time now with good results. It's not perfect, but it does avoid a lot of the pointless clicking and testing just because you waited a day to merge a totally standalone commit.

As far as the 5 minutes to test goes, I think that's a bit of bad timing today and bears some explanation. The ideal scenario for Puppet testing is actually only about a minute--this is what I and others routinely get. What happened yesterday morning and spawned this bug was that puppet had some tests right as the new (daily) nodepool images went out. The first test to pick one of these up is always going to be lagged a few minutes. Puppet happened to be the unlucky one--it could just as easily been MW core, OOUI, or anything else. In fact, puppet already is prioritized well above all other commits, so flies through the backlog compared to other stuff. Considering it only takes about a minute, I don't think we can do much more.

Dedicated slaves would only help to a very small degree. They'd still use the same zuul scheduler and queues. The only difference would be the situation in which all executors are currently backed up on slow tests. But that's always a problem for everyone and deserves investigation when it does happen. Assuming all tests are moving, puppet moves fastest.

As I understand it, our CI system is smart enough to run the gate-and-submit tests on the commit as it would be when landed (i.e. it locally merges/rebases the commit into/onto master, then runs tests on top of that merged/rebased state). AFAIK it's worked that way since we moved MW to Gerrit in 2012.

Puppet doesn't use gate-and-submit, we manually submit (on purpose).

I'm not sure how "Rebase If Necessary" works exactly, but either Jenkins would not run again against the rebased commit (causing the issue that I mentioned above: submitting an untested combination of parent commit + commitdiff), or submission would be blocked on CI re-running the test suite, which is back to square 0, as far as the original task description (CI taking too long) goes.

That's correct, Jenkins would not run against the rebased commit if you manually submit it. You would instead have to +2 it and then wait for Jenkins to automatically merge your commit if the tests pass. If it makes you feel any better, you don't have to wait for the first Jenkins run (the one that is kicked off right when you upload the change) to finish before +2ing, and the CI system doesn't wait for that either, so if you want you can +2 commits immediately after uploading them, or +2 things when you decide they should be merged without rebasing them first.

This seems to me to be more or less equivalent to the current process (rebase, wait for CI, then submit), except in two cases: if you know the commit doesn't need to be rebased and Jenkins has already tested it, the current process is faster (submit instantaneously vs +2 and wait for CI to rerun); and conversely, if you are submitting many commits, the other process is faster because you don't have to keep rebasing them, you can just +2 them all and Zuul will sort things out for you(*).

Of course none of this changes the fact that CI for puppet and mw-config should take time measured in seconds not minutes (and CI for MW commits should not take up to 45 minutes as it sometimes does); but if you find yourselves rebasing the same commit multiple times, then Rebase If Necessary + gate-and-submit may be faster on balance than Fast Forward Only + manual submit.

(*) Specifically, it'll stack the commits in order of submission, and run CI on each commit based on the speculative assumption that the ones that precede it will pass and be merged. If one of the commits fails, the commits that were assuming it would pass will adapt by cutting out the failing commit and restarting the stacked CI process without it. Effectively, Zuul rebases all the pending +2ed commits on top of each other, and in case of failure dynamically rebases commits that were rebased on the failing commit to instead be based on some something that hasn't yet failed. This adaptation is done as soon as one of the jobs for a commit fails (as opposed to when all jobs for that commit have finished), which is nice for repos like MW core where we have fast lint check jobs as well as slow unit test jobs.

I wasn't talking about a race of a few seconds or minutes. I think both of you are describing a gate-and-submit workflow or at least one that Jenkins submits (or I'm missing something!). We don't do that now, and I don't think we would get much by doing it: we still need to go SSH to a machine and run "puppet-merge" after submission (and often watch the fleet do what you told it to do), so the whole submission process is a synchronous mental process. You can't let Jenkins run its tests and then go do something else, and assume that Jenkins will do the merge/submission when it's time.

And no, this isn't about just this particular occurenece of the Nodepool image regeneration (although that's also a problem). The CI times for at least operations/puppet have been terrible and unpredictable for as long as I remember. I've been regularly complaining about that since… 2013. Do we keep the job runtimes per job somewhere so that we could check the stats for it? In my experience 2 minutes is common, 1 minute happens maybe half of the time, and then maybe 10-20% of the time we get something crazy. Even one minute is still too long IMHO.

Looking at the data we have it seems that the tests themselves take about 50 seconds to run, and the entire "event in gerrit to results returned" time is less than 2.5 minutes (avg). That (2.5 minute per run in a clean environment) doesn't sound unreasonable to me.

So, with those numbers, I wonder if we can do some of the time saving measures (like rebase if necessary) to make it a better experience (not running tests when we don't need to). Other than "redo the way CI is maintained" (which is already in-progress with the Release Pipeline project) we should identify smaller wins to help (like the above).

Looking at the data we have it seems that the tests themselves take about 50 seconds to run, and the entire "event in gerrit to results returned" time is less than 2.5 minutes (avg). That (2.5 minute per run in a clean environment) doesn't sound unreasonable to me.

So, with those numbers, I wonder if we can do some of the time saving measures (like rebase if necessary) to make it a better experience (not running tests when we don't need to). Other than "redo the way CI is maintained" (which is already in-progress with the Release Pipeline project) we should identify smaller wins to help (like the above).

Thanks, that graph was exactly what I was looking for :) 2.5 minutes on average is pretty unreasonable in my view. As Roan eloquently put it, anything that is measured in minutes rather than seconds is pretty unreasonable, considering the limited amount of tests that lint job is making.

Look, we get it, CI is slower than people would like. When we proposed the nodepool backend we were optimizing for clean environment and maintainability. We got half of that (the first half). We're working on replacing that system, but it won't come immediately. So... we know, we're working on it, and our first round of testing of the new system (not ready for prime time) is that it will greatly reduce turn around time. Anything else needed here?

We're still open to helping get ops/puppet in a better place than it is now with small wins until we can migrate to the new docker based system, if you're game.

Well, first of all, right before I filed this task, Antoine said on IRC:

containers for CI would be for later. The priority has been set toward staging cluster / moving services to prod. So not sure when we will switch CI to containers

So it doesn't really like sound like the primary use case is jobs like the operations/puppet linting, at least not initially. Do you disagree and/or have an ETA for this new/faster system? As I mentioned, this has been a problem for many years already, waiting another couple of years is probably not a great outcome.

Second, yes, I'm all for small wins/workarounds until the longer-term work materliazes. This is exactly what this task is about in fact, see the last paragraph of the original task description.

So it doesn't really like sound like the primary use case is jobs like the operations/puppet linting, at least not initially. Do you disagree and/or have an ETA for this new/faster system?

Not initially, no. The agreed upon target/first use case for the pipeline working group was mathoid. The ops/puppet linting use case has no firm ETA yet (we have a few cross pipeline working group dependencies we should clarify soon, more on that separately ;) ).

Second, yes, I'm all for small wins/workarounds until the longer-term work materliazes.

Cool. We'll do a spike on a rudimentary (probably docker-based?) solution for just ops/puppet and try to determine (roughly) what kind of maintenance overhead it will incur (that's the main draw-back of the pre-nodepool infrastructure, the maintenance burden of the permanent slaves) and if it complicates other in-flight pipeline work. More on that later.

Great, thanks :) I'm looking at the output of a Jenkins job and it looks like it takes about a minute to execute, so I guess we have two semi-related issues: the issue of slow/unpredictable scheduling of jobs, as well as the jobs themselves being a little inefficient/slow to execute.

Looking at one of th logs it looks like the job does the following:

  1. It clones the repository from scratch, plus all of its submodules (13s)
  2. Syncs Castor(?) (2s)
  3. Sets up a Python virtualenv and runs pep8/flake8 even if no Python files were changed in the commit (20s)
  4. Creates a Ruby bundle with rake, puppet, puppet-lint, rubocop etc. (31 packages in total) from scratch, which is later partially uninstalled as well (7s)
  5. Runs bundle exec rake test which (I think!) attempts to parse puppet code a even if no Puppet files were changed. I also see RuboCop running for for all 213 files even if 0 Ruby files were changed in the commit (17s)
  6. Checks Hiera YAML and Puppet templates, even if they weren't touched by the commit (2s)
  7. Runs xUnit and castor-save, both of which emit failures in the logs ("Result was FAILURE") (3s).

These are all conjectures of mine, just by looking at the log, so correct me when I'm wrong :)

It looks like in my very uninformed view:

  • A separate pool would solve a lot of the priority inversion happening due to puppet sharing a pool of resources with other long-running jobs in the CI pipeline and make the scheduling of those jobs smoother and place an upper bound to their time in the queue,
  • A separate image (w/ whatever tech) that would have e.g. Rubocop and puppet-lint pre-installed, and potentially a preseeded clone of the repository updated e.g. daily, could solve a lot of those inefficiencies where everything is set up from scratch on every run. Was that part of your plan?
  • There are still a lot of low-hanging fruits in the way these jobs are set up and executed. I think a few heuristics to not needlessly do Python/Ruby/Puppet/erb/Hiera syntax checks (sometimes over the whole repo) on every change could shave off quite a bit from the total runtime of the job.

Orthogonally to the above, I'd still like to explore Chad's proposal for "Rebase If Necessary". How would that look like without a gate-and-submit pipeline? Should we open a separate task for it?

Change 357741 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[integration/config@master] Docker for operations-puppet-tests

https://gerrit.wikimedia.org/r/357741

These are all conjectures of mine, just by looking at the log, so correct me when I'm wrong :)

Looks all correct. Castor is a thing antoine made to cache dependencies between nodepool instances, i.e. it uses rsync to save things like ruby gems for a particular job to a particular location.

  • A separate pool would solve a lot of the priority inversion happening due to puppet sharing a pool of resources with other long-running jobs in the CI pipeline and make the scheduling of those jobs smoother and place an upper bound to their time in the queue,

I think this will address the main cause of the random spikes of wait time. Then the problem becomes the actual job runtime.

  • A separate image (w/ whatever tech) that would have e.g. Rubocop and puppet-lint pre-installed, and potentially a preseeded clone of the repository updated e.g. daily, could solve a lot of those inefficiencies where everything is set up from scratch on every run. Was that part of your plan?

This is currently addressed in nodepool by castor. This is why things like bundle install --clean report Using [whatever gem] rather than Installing [whatever gem].

I made a patch today that runs this job on a permanent VM that is part of the integration project. Most of the work of the actual job runs inside a container that is generated from a Dockerfile that is part of that patch. This ensures that maintenance of that particular VM is minimal since the clone of the repository happens inside the container (it uses a reference repository on disk to speed up clone time). Running on a separate VM will remove operations/puppet from the queue of jobs waiting for nodepool executors which should make runtime more stable.

The other optimization in the above patch is that it runs rake and tox in parallel. The job takes roughly 30 seconds to execute in this way with no other optimizations.

Once this patch has gotten some review the twisty path through CI will be:

  1. gerrit patch is pushed and zuul adds a job to zuul merger to create the patchset on contint1001 (this happens now, takes < 1 second)
  2. zuul merger completes which tells zuul to add a job to the test-prio (which was setup by antoine and gives operations/puppet the highest priority of all queues), again, < 1 second
  3. The job in the test-prio pipeline kicks off a run of https://integration.wikimedia.org/ci/view/operations/job/operations-puppet-tests-docker on integration-docker-slave-1000 which in its current state takes ~30 seconds.
  4. results are recorded by jenkins, subsequently by zuul and the appropriate results are appended to the patch < 1 second

Thank you @Catrope @demon for the description of how patches are prepared to be tested and the gate system. I endorse your description of the mechanism :-]

Orthogonally to the above, I'd still like to explore Chad's proposal for "Rebase If Necessary". How would that look like without a gate-and-submit pipeline? Should we open a separate task for it?

That is probably trivial enough that we can just do it. Although we might want to discuss the exact strategy to use. Last week I have switched integration/config from fast-forward only to rebase if necessary + allow content merge. That saves a a lot of rebases (and instances consumption) as well as people racing to get their changes merged in. We can probably use a new task to discuss about it at length and ponder the pro/con of such a change.

While at it, I would like to explore the opportunity of switching operations/puppet to CR+2 and let CI merge the change for you. I believe that would be a net win, with the drawback/CON that ops would have to be extra careful when rebasing on the puppet master. Maybe that is a risk we can not take. Same, would probably need a bunch of discussion, I think it is tightly related to the change in the Gerrit submit strategy. So maybe discussion can be held on the same new task.

Looking at the data we have it seems that the tests themselves take about 50 seconds to run, and the entire "event in gerrit to results returned" time is less than 2.5 minutes (avg). That (2.5 minute per run in a clean environment) doesn't sound unreasonable to me.

I have thrown the links above during the release engineer team meeting with off the record explanation. So here is the lengthy one.

Jenkins build time trend

That is the time Jenkins has spent running the job from when it starts on a slave until the build is complete (no more commands to run or exit code != 0). The build time vary due to at least two factors:

  • activity on the OpenStack compute node (labvirt) on which the instance is running. The instances are booted on the machines having the least CPU usage (among other metrics). That was tweaked via T161006. I dont think disk I/O is taken in account.
    • I don't think it is much of an issue. Then I dont think there is any good way to debug that short of scheduling the CI instances on dedicated labvirt nodes with SSD and # of instances * vCPU/instances == # of real CPU.
  • some of the lint commands are optimized to only act on files that have been changed in the patchset. I think that is the main cause of variance. The Rakefile has a method git_changed_in_headthat returns files changed in the patch filtered by file extension. That is then passed to the underlying command which acts on less files and hence is faster.
    • puppet parser validate takes ~ 18 seconds (repro: time bundle exec rake syntax)
      • CI uses the rake task syntax:head which takes a couple second with a patch not changing any .pp file (all hiera and templates files are still validated since that is fast enough.
    • puppet-lint which takes ages (repro: time bundle exec rake puppetlint)
      • CI uses the rake task puppetlint_head which is nearly instance if no .pp files are changed.

CI processing time

Zuul keeps track of the time since a Gerrit patchset-created event is received until the change leaving Zuul. That is emitted as statsd timer metric zuul.pipeline.<name of pipeline>.<repository>.resident_time eg: zuul.pipeline.test-prio.operations.puppet.resident_time).

That metrics takes in account the whole CI processing:

  • Zuul taking in account the Gerrit event
  • Sending the work request to Gearman
  • Gearman having a worker available
  • Nodepool refiling the pool of instances
  • Jenkins scheduling and running the build
  • Reporting / acting on Gerrit

The graphs definitions are:

  • Max time: consolidateBy(zuul.pipeline.test-prio.operations.puppet.resident_time.upper, 'max')
  • Average time: consolidateBy(zuul.pipeline.test-prio.operations.puppet.resident_time.median, 'average')

Or Max time being the maximum aggregated by maximum. So that is the worse wait point per datapoint. A change that takes a while to be processed will standout as a spike. The Average time offers a different figure though.

I have set the thresholds (yellow/red area) arbitrarily to 60 seconds and 120 seconds.

The main cause is the pool of instances has been exhausted and is being refilled. Specially what prompted this task to be open was a change that took 5+ minutes to report back to Gerrit. The cause was the base image got refreshed and it takes some time to transfer it to the compute nodes. I have filled T166889 for that specific issue.

And my final reply, following up on T166888#3322963 that does a breakdown of the job build steps.

  1. It clones the repository from scratch, plus all of its submodules (13s)

We need a full clone so we can find files that got changed in a proposed patchset. I have tried with a shallow clone but there are edge cases that cause the build to fail. The way it is optimized is that the base image has a bare clone of operations/puppet.git and the clone operation uses it as a reference.

Looks like the slow operation is the git fetch from the Zuul git repositories. It might be faster to clone from Gerrit then fetch the prepared patch from the CI server. I guess I can experiment with that.

  1. Syncs Castor(?) (2s)

That is the system that keeps the package managers (pip/gem) dependencies in a central cache. That makes both bundle install and pip install very fast since the modules are already on disk.

  1. Sets up a Python virtualenv and runs pep8/flake8 even if no Python files were changed in the commit (20s)

That part can be optimized the same way I did for puppet-lint: by passing to flake8 solely the files that got changed in the patch.

  1. Creates a Ruby bundle with rake, puppet, puppet-lint, rubocop etc. (31 packages in total) from scratch, which is later partially uninstalled as well (7s)
  2. Runs bundle exec rake test which (I think!) attempts to parse puppet code a even if no Puppet files were changed. I also see RuboCop running for for all 213 files even if 0 Ruby files were changed in the commit (17s)
  3. Checks Hiera YAML and Puppet templates, even if they weren't touched by the commit (2s)

It is actually using the gems that have been preloaded in the home directory by castor. See my previous comment about which files are acted on. But in short: rubocop should be lightning fast given files are already in the disk cache. hiera/templates are all validated since that only takes a couple seconds that can be optimized to run solely on HEAD, then for 2 seconds is that worth it?

  1. Runs xUnit and castor-save, both of which emit failures in the logs ("Result was FAILURE") (3s).

xUnit is to interpret Junit test reports.

castor-save is definitely unneeded overhead. While working on an unrelated feature last week, I have found a solution to skip that job properly though (twist: conditional-step plugin).

These are all conjectures of mine, just by looking at the log, so correct me when I'm wrong :)

It looks like in my very uninformed view:

  • A separate pool would solve a lot of the priority inversion happening due to puppet sharing a pool of resources with other long-running jobs in the CI pipeline and make the scheduling of those jobs smoother and place an upper bound to their time in the queue,

CI is surely starving for resources. I take the blame to not have strongly voiced my concern a year or so ago when magically our capacity got cut in half due to an unrelated project. I take the blame to not have merged jobs together earlier to save instance consumption. Then I guess that is the outcome of wiling to do too many things at once :-/

I can check with cloud service to see whether the pool can be bumped, I am not sure the underlying infrastructure has the capacity for it though but that is worth checking/trying.

We can surely set a dedicated pool of instances solely for operations/puppet. But I am afraid it set a precedence for other use cases such as operations/dns , operations/mediawiki-config, wmf branches etc. They are already prioritized over everything else.

The potential promise is to switch to a new infrastructure (k8s), though that has yet to happen and would still require a much larger pool than what we have now.

  • A separate image (w/ whatever tech) that would have e.g. Rubocop and puppet-lint pre-installed, and potentially a preseeded clone of the repository updated e.g. daily, could solve a lot of those inefficiencies where everything is set up from scratch on every run. Was that part of your plan?

That part is covered. The dependencies (flake8, puppet-lint, rubocop) etc are populated via castor. The git clone already uses a local mirror. Both where hard requisites for the switch to Nodepool.

  • There are still a lot of low-hanging fruits in the way these jobs are set up and executed. I think a few heuristics to not needlessly do Python/Ruby/Puppet/erb/Hiera syntax checks (sometimes over the whole repo) on every change could shave off quite a bit from the total runtime of the job.

I have only optimized out the biggest players (puppet parser validate and puppet-lint), but that most probably be generalized. I am not sure its worth the effort for check that takes just a couple seconds, but I guess we are at a point where any second matter.

__ end of spam of replies __

Change 357804 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/puppet@production] Rake: optimize typos task for CI

https://gerrit.wikimedia.org/r/357804

Thanks for all the detailed responses from all of you, it's really appreciated. It's also great to see Docker patches proposed already -- I have to admit that I didn't expect that so soon! Kudos :)

I was trying to work on that myself from a different angle and taking a closer loop at our Rakefiles etc. yesterday and today (going deep into the rabbithole and submitting a bunch of patches in the process to update to newer RuboCop, puppet, rake, etc. :)

Issues with Nodepool/resourcing aside, I think puppet's Rakefile needs to be rewritten from scratch. It's not very clean/good code as it is and even though could perhaps be gradually improved, it really needs a different inside-out approach. My proposal would be to refactor as follows:

  • Abstract git_changed_in_head into a class, that only runs git commands during initialization. It would then categorize the changed files into buckets (puppet, ruby, erb, python, hiera etc.), and run heuristics for the non-obvious ones (e.g. look at shebangs). The same class could also return all the affected modules, in order to be able to run rspec tests against them.
  • Create a new namespace "lazy", which would include a task each for puppet syntax check, puppet linting, RuboCop etc. only for the changed files. Even the imports can be put under their respective tasks for faster runtime (and better isolation). CI would run the "lazylint" supertask that would DTRT and can be pretty quick to even instantaneous in some cases.
  • I haven't investigated this much, but I'm also wondering if we could make Rakefile the main entry point for all linting. i.e. chainload tox/Python linters from Rake, benefiting from the above categorization of files and making those similarly fast. We'd also get instant wins for changesets like this one), which increase our coverage.
  • The rest of the tasks (RuboCop for all files, puppet-lint across the tree) can and probably should remain and we can run these locally against the whole repository and perhaps even from CI periodically.

My Ruby isn't great and I'm very slow with it, so if someone more knowledgeable in the CI team wants to pick that up, I'd grateful -- for someone experienced it's probably half a day's worth of effort. Otherwise I might propose something, hopefully after the annual review season is over :)

Change 359439 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] Drop castor-save from operations/puppet test job

https://gerrit.wikimedia.org/r/359439

Change 359439 merged by jenkins-bot:
[integration/config@master] Drop castor-save from operations/puppet test job

https://gerrit.wikimedia.org/r/359439

Faidon wrote:

Runs xUnit and castor-save, both of which emit failures in the logs ("Result was FAILURE") (3s).

castor-save is definitely unneeded overhead. While working on an unrelated feature last week, I have found a solution to skip that job properly though (twist: conditional-step plugin).

Faidon wrote:

Creates a Ruby bundle with rake, puppet, puppet-lint, rubocop etc. (31 packages in total) from scratch, which is later partially uninstalled as well (7s)

Castor saves package dependencies when a job is a success in gate-and-submit or in postmerge. operations/puppet.git does not have jobs in those pipelines. Hence I was manually running the save process from time to time.

That caused the part about packages being partially uninstalled because the cache had outdated copy which bundle install --clean garbage collect. But the next build will have them again due to the cache not having been updated.

Since manual steps are a nightmare and definitely not a good use of my time. I finally refactored that mess to be automatic:

A) on postmerge, a change that touched tox.ini or Gemfile will trigger a job that install the package then save the caches. Ie it is automatic now.

B) on test-prio, the job still grab the cache from the central instance. But they no more attempt to save it. Ie the castor-save job is no more triggered that saves a few miliseconds of useless overhead such as:

0:00:53.540 [PostBuildScript] - Execution post build scripts.
00:00:53.543 Waiting for the completion of castor-save
00:00:53.818 castor-save #458720 completed. Result was FAILURE

If the cache fails, a mail is dropped to the qa-alerts mailing list so we would notice.

Change 359951 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/puppet@production] Rake: memoize git_changed_in_head()

https://gerrit.wikimedia.org/r/359951

Thanks for all the detailed responses from all of you, it's really appreciated. It's also great to see Docker patches proposed already -- I have to admit that I didn't expect that so soon! Kudos :)

I was trying to work on that myself from a different angle and taking a closer loop at our Rakefiles etc. yesterday and today (going deep into the rabbithole and submitting a bunch of patches in the process to update to newer RuboCop, puppet, rake, etc. :)

Issues with Nodepool/resourcing aside, I think puppet's Rakefile needs to be rewritten from scratch. It's not very clean/good code as it is and even though could perhaps be gradually improved, it really needs a different inside-out approach. My proposal would be to refactor as follows:

  • Abstract git_changed_in_head into a class, that only runs git commands during initialization. It would then categorize the changed files into buckets (puppet, ruby, erb, python, hiera etc.), and run heuristics for the non-obvious ones (e.g. look at shebangs). The same class could also return all the affected modules, in order to be able to run rspec tests against them.

I went with an instance variable to catch the output of git diff this way we save a shell out + git invocation whenever the method is invoked. https://gerrit.wikimedia.org/r/359951

The rspec modules, it is a bit more complicated. They are tightly coupled and the dependencies are listed in .fixtures.yml files at the root of each module. For example the nrpe module has:

modules/nrpe/.fixtures.yml
fixtures:
    symlinks:
        nrpe: "#{source_dir}"
        base: "../../../../base"
        ferm: "../../../../ferm"
        monitoring: "../../../../monitoring"
        wmflib: "../../../../wmflib"

So if a patch touch wmflib, we got to find each modules that define it as a dependency recursively. So that is slightly more complicated.

  • Create a new namespace "lazy", which would include a task each for puppet syntax check, puppet linting, RuboCop etc. only for the changed files. Even the imports can be put under their respective tasks for faster runtime (and better isolation). CI would run the "lazylint" supertask that would DTRT and can be pretty quick to even instantaneous in some cases.

That is what I did originally for the tasks that took the longer:

rake lint_head               # Run all linting commands against HEAD
rake puppetlint_head         # Run puppet-lint
rake syntax:head             # Syntax checks against HEAD
rake syntax:manifests_head   # Syntax check Puppet manifests against HEAD

Which CI invoke from the task test:

rake test
    lint_head
        typos
        rubocop
        syntax:head
            syntax:manifests_head
            syntax:hiera
            syntax:templates
        puppetlint_head

There are some commands I have not optimized since they run in less than a couple seconds which is pretty minor compared to the rest of the overhead. I had a look at rubocop a wild back, it is not so trivial since it has an internal of file extension to list and I am not quite sure how to retrieve them to pass them to git_changed_in_head(). Then even on the whole tree, rubocop takes less than a couple seconds.

  • I haven't investigated this much, but I'm also wondering if we could make Rakefile the main entry point for all linting. i.e. chainload tox/Python linters from Rake, benefiting from the above categorization of files and making those similarly fast. We'd also get instant wins for changesets like this one), which increase our coverage.

That surely can be done. We can have rake to invoke something like: tox -e pep8 -- <list of files here> and in tox.ini pass the argument by using: commands = flake8 {posargs}.

For https://gerrit.wikimedia.org/r/#/c/357197 , /utils/flake8_no_extension would have to be adjusted so it can be passed list of files as parameters to bypass the tree traversal.

  • The rest of the tasks (RuboCop for all files, puppet-lint across the tree) can and probably should remain and we can run these locally against the whole repository and perhaps even from CI periodically.

That would be all about crafting a Jenkins job that invoke rake lint (instead of rake test) that polls git and notify by irc / email.

Change 357741 merged by jenkins-bot:
[integration/config@master] Docker for operations-puppet-tests

https://gerrit.wikimedia.org/r/357741

Change 360091 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[integration/config@master] Add non-voting puppet docker job

https://gerrit.wikimedia.org/r/360091

Change 360091 merged by jenkins-bot:
[integration/config@master] Add non-voting puppet docker job

https://gerrit.wikimedia.org/r/360091

Mentioned in SAL (#wikimedia-operations) [2017-06-19T22:14:36Z] <thcipriani|afk> Added non-voting operations-puppet-tests-docker job for operations/puppet repo, should (hopefully) be fast, and will timeout after 1 minute if it's not. More info https://gerrit.wikimedia.org/r/#/c/360091/ + T166888

Mentioned in SAL (#wikimedia-operations) [2017-06-19T22:14:36Z] <thcipriani|afk> Added non-voting operations-puppet-tests-docker job for operations/puppet repo, should (hopefully) be fast, and will timeout after 1 minute if it's not. More info https://gerrit.wikimedia.org/r/#/c/360091/ + T166888

I added the docker job as a non-voting job to the test-prio pipeline for the operations/puppet repo the results of which can be viewed at: https://integration.wikimedia.org/ci/job/operations-puppet-tests-docker/

This job will run on 1 of 6 docker executors in the integration project and will run both rake and tox in parallel, in testing it takes ~30 seconds to run.

As I said in the SAL message, it should hopefully be fast, and will timeout after 1 minute if it's not fast. If needs be the job can be removed by reverting https://gerrit.wikimedia.org/r/#/c/360091/ and redeploying the configuration (https://www.mediawiki.org/wiki/Continuous_integration/Zuul#Deploy_configuration).

This should be a low-disruption deployment that will be used to determine whether this job is fast, maintainable, and generates comparable results to the current job https://integration.wikimedia.org/ci/job/operations-puppet-tests-jessie/

Change 360363 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] puppet docker: add missing wrappers

https://gerrit.wikimedia.org/r/360363

I have rebased the serie of patches for operations/puppet.git that slightly enhance the Rakefile. Based on Faidon suggestions.

Change 360363 merged by jenkins-bot:
[integration/config@master] puppet docker: add missing wrappers

https://gerrit.wikimedia.org/r/360363

hashar removed hashar as the assignee of this task.Jul 7 2017, 3:13 PM

Status update

There are a few patches for puppet.git that are pending review/testing.

https://gerrit.wikimedia.org/r/#/c/359457/tests: disable ruby output buffering
https://gerrit.wikimedia.org/r/#/c/359951/Rake: memoize git_changed_in_head()
https://gerrit.wikimedia.org/r/#/c/357804/Rake: optimize typos task for CI

The last seems to nicely speed up the typo check.

For docker

The intent it to migrate the Jenkins job for operations/puppet.git to run on Docker image. There are three instances provisioned on labs for now. There is some polish up needed though:

  • use a Docker registry that is not the ci-staging
  • Review of https://gerrit.wikimedia.org/r/#/c/320942/ contint: New role for Docker based CI slave
  • mirror operations/puppet.git in the docker host instances to speed up clone

FTR, as I mentioned on IRC, these three changes are continuing down the path of accumulating these kind of one-off hacks in the big ball of mud that the Rakefile is, rather than (re)writing it properly, e.g. in the way I described above. They also just address only one of the many costly checks (the typos one), and not e.g. Rubocop, or pep8/flakes tests that are even more expensive, so they're not really a fix for what I described.

RelEng plans to work on the Docker Jenkins job until completion in the short term. This in and of itself should get the job run down to about 30 seconds.

Antoine proposed those incremental speedup changes to the rakefile for ops/puppet but if Operations would like to rewrite the entire ops/puppet rakefile we can assist in the endeavor when you all have the time.

Change 366726 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[integration/config@master] Dockerfiles use build container pattern

https://gerrit.wikimedia.org/r/366726

Change 366726 merged by jenkins-bot:
[integration/config@master] Dockerfiles use build container pattern

https://gerrit.wikimedia.org/r/366726

Change 366591 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] rake: new rakefile specifically for CI

https://gerrit.wikimedia.org/r/366591

Change 369870 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] operations/puppet rake now invokes tox

https://gerrit.wikimedia.org/r/369870

Change 366591 merged by Giuseppe Lavagetto:
[operations/puppet@production] rake: new rakefile tailored for CI

https://gerrit.wikimedia.org/r/366591

Change 369870 merged by jenkins-bot:
[integration/config@master] operations/puppet rake now invokes tox

https://gerrit.wikimedia.org/r/369870

I did rewrite the Rakefile according to @faidon's suggestions, did tweak the dockerfile/run environment a bit, and now an average job takes less than 20 seconds to execute, with the simple changes taking less than 10 seconds. All this while running specs if needed.

There is still plenty of performance bottlenecks, including (I suspect) the container i/o performance on the labs instances, that make jobs that run all the specs quite slow (e.g. https://gerrit.wikimedia.org/r/#/c/369923/), but that's an exceptional case and I think we can tolerate that for now. It's a direction that's worth investigating in the future though.

My next plan is to start adding rather simple specs to the role module, basically just to verify all roles compile, and then seek out how to run specs with the future parser as well.

Change 359951 abandoned by Hashar:
Rake: memoize git_changed_in_head()

Reason:
Got aced differently by Giuseppe

https://gerrit.wikimedia.org/r/359951

Change 357804 abandoned by Hashar:
Rake: optimize typos task for CI

Reason:
Got aced differently by Giuseppe

https://gerrit.wikimedia.org/r/357804