Page MenuHomePhabricator

Test restarting puppetmaster workers for every code deploy
Closed, DeclinedPublic

Description

From the puppet docs, https://docs.puppet.com/puppet/3.8/configuration.html#environmenttimeout

We recommend setting environment_timeout to unlimited and explicitly refreshing your Puppet master as part of your code deployment process.

  • With Puppet Server, you should refresh environments by calling the environment-cache API endpoint. See the docs for the Puppet Server administrative API.
  • With a Rack Puppet master, you should restart the web server or the application server. Passenger lets you touch a restart.txt file to refresh an application without restarting Apache; see the Passenger docs for details.

We don’t recommend using any value other than 0 or unlimited, since most Puppet masters use a pool of Ruby interpreters which all have their own cache timers. When these timers drift out of sync, agents can be served inconsistent catalogs.

So we should look into the passenger restart mechanism (I vaguely remember it being half-finished in the free version) and possibly integrate it with puppet-merge.

Event Timeline

Joe moved this task from Backlog to Doing on the User-Joe board.

The passenger docs about this are pretty clear: in the floss version of passenger, a restart is blocking - that is passenger will wait for all currently spawned workers to stop, and at least a new one to be spawned, before serving a new request. Those requests are going to be queued, but if we have some expensive catalog being built, it means ~ 30 seconds of blocking.

I don't think this is feasible unless we do something along the following lines:

  • Make the puppetmaster-backend an LVS service
  • Have the puppetmaster frontends connect to that LVS service (with one twist: the frontends should not host a backend as well, since we use LVS-DR)
  • depool/restart/pool the single backend

I will nonetheless experiment with restarting passenger on our backends to see what's the effect in our live environment.

Mentioned in SAL (#wikimedia-operations) [2017-07-03T08:58:56Z] <_joe_> restarting the passenger app on puppetmaster1002 for T169493

Mentioned in SAL (#wikimedia-operations) [2017-07-03T09:07:30Z] <_joe_> restarting the passenger app on puppetmasters in codfw serially with a sleep of 3 seconds for T169493

So all my attempts at restarting one or multiple instances of the puppetmaster backends luckily didn't cause any puppet errors, but it takes up to 4 minutes for a restart to take effect. I'm not sure this is acceptable with our current practices.

Given the fact our puppetmasters are so underloaded that we can safely assume they won't get a perf hit unless we create very complex environments.

So I'll decline this ticket for now and get back to it only if performance hits are significant.