Page MenuHomePhabricator

Sporadic puppet failures
Closed, ResolvedPublic

Description

Since moving to puppet7 puppet runs have started to have sporadic faliures with error messages such as

/Stage[main]/Base::Screenconfig/File[/root/.screenrc]

Could not evaluate: Could not retrieve information from environment production source(s) puppet:///modules/base/screenrc

Theses errors happen at the time of puppet-merge and are a by product of using g10k. when we preform puppet merger to the puppet servers we update a repo in /srv/git/puppet and then use g10k to sync the data to /srv/puppet_code. during this time there is a short period of time when there is no data in /srv/puppet_code and as such we see theses errors. We did not have this issue on with puppet masters as puppet-merge updates the puppetcode directory directly, this has its own issue in that agents could get the wrong file or a partially written file.

We should investigate if we can improve g10k to make it atomic we could also consider removing g10k, this is not currently used however i think if we can keep it it opens up some possibilities for future improvements

Event Timeline

jbond triaged this task as Medium priority.Nov 8 2023, 4:38 PM
jbond created this task.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I took a brief look around to try to understand what other folks are doing, but documentation is surprisingly sparse.

Code manager takes care of this in the enterprise version, they discussed implementing something in the opensource version, but never did, https://puppet.atlassian.net/browse/SERVER-1234.

Looks like most people cobble together something like:

  1. Stage new code
  2. Sync code to puppetservers
  3. Symlink new code into place
  4. Send a HUP signal to the puppetserver

The last step is also interesting, as it relates to puppetserver's caching ability. Evidently in the enterprise version environment_timeout is set to unlimited and Code Manager takes care of hitting puppetserver's rest endpoint to tell it to drop its caches when new code is pushed. A similar result can be obtained by HUPing puppetserver which is mentioned in the ticket. I tried that in the dcl lab environment and it significantly reduces successive puppet run times:

Puppet applies on pki1001:
With environment_timeout = 0 (default)

22 seconds

With environment_timeout = unlimited

14 seconds

With environment_timeout = unlimited, followed by, systemctl reload puppetserver

22 seconds

@jhathaway thanks for investigating by the sounds of it would could probably have a bit of a win if we:

  • set environment_timeout = unlimited
  • update puppet-merge to do a systemctl reload puppetserver after g10k

I agree the best solution would be to make use of symlinks but by the sounds of it the above would still be an improvement?

jbond raised the priority of this task from Medium to High.Nov 14 2023, 4:16 PM
jbond removed a project: Puppet CI.

set priority to high as this is causing issues

Change 974283 had a related patch set uploaded (by JHathaway; author: JHathaway):

[operations/puppet@production] puppetserver: cache code

https://gerrit.wikimedia.org/r/974283

Change 974283 merged by Jbond:

[operations/puppet@production] puppetserver: cache code

https://gerrit.wikimedia.org/r/974283

jhathaway claimed this task.

Looking at puppet board we are still having issues when we do a puppet merge. The following are times in utc where we had a puppet-merge occurring, each of theses times we have 8-10 puppet failures

  • 10:55:46
  • 11:02:35
  • 11:07:32

Thanks for reopening @jbond I'll take a look at those.

Change 976857 had a related patch set uploaded (by JHathaway; author: JHathaway):

[operations/puppet@production] puppetserver: use a symlink to swap in new code

https://gerrit.wikimedia.org/r/976857

Change 976857 merged by JHathaway:

[operations/puppet@production] puppetserver: use a symlink to swap in new code

https://gerrit.wikimedia.org/r/976857

Change 976878 had a related patch set uploaded (by JHathaway; author: JHathaway):

[operations/puppet@production] puppet-merge: don't symlink environments

https://gerrit.wikimedia.org/r/976878

Change 976878 merged by JHathaway:

[operations/puppet@production] puppet-merge: don't symlink environments

https://gerrit.wikimedia.org/r/976878

Change 977185 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] puppet-merge: Fix up help message

https://gerrit.wikimedia.org/r/977185

Change 977184 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] puppet-merge: add prometheus metrics

https://gerrit.wikimedia.org/r/977184

Change 977185 merged by Jbond:

[operations/puppet@production] puppet-merge: Fix up help message

https://gerrit.wikimedia.org/r/977185

Change 977184 merged by Jbond:

[operations/puppet@production] puppet-merge: add prometheus metrics

https://gerrit.wikimedia.org/r/977184

Change 978017 had a related patch set uploaded (by Jbond; author: Jbond):

[operations/puppet@production] puppet-merge: add prometheus metrics

https://gerrit.wikimedia.org/r/978017

I looked at a few puppet merge times in the puppetserver logs and compared those to puppetboard. I am not seeing sporadic failures anymore so resolving.