Test restarting puppetmaster workers for every code deploy
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	Joe
	Jul 3 2017, 8:42 AM

Description

From the puppet docs, https://docs.puppet.com/puppet/3.8/configuration.html#environmenttimeout

We recommend setting environment_timeout to unlimited and explicitly refreshing your Puppet master as part of your code deployment process.

With Puppet Server, you should refresh environments by calling the environment-cache API endpoint. See the docs for the Puppet Server administrative API.

With a Rack Puppet master, you should restart the web server or the application server. Passenger lets you touch a restart.txt file to refresh an application without restarting Apache; see the Passenger docs for details.

We don’t recommend using any value other than 0 or unlimited, since most Puppet masters use a pool of Ruby interpreters which all have their own cache timers. When these timers drift out of sync, agents can be served inconsistent catalogs.

So we should look into the passenger restart mechanism (I vaguely remember it being half-finished in the free version) and possibly integrate it with puppet-merge.

Related Objects
Search...

Status	Assigned	Task
Resolved	aborrero	T178717 Upgrade wmcs instances and masters to puppet 4.8
Resolved	None	T177254 Upgrade to puppet 4 (4.8 or newer)
Resolved	None	T169548 Prepare for Puppet 4
Resolved	Joe	T169485 Add support for directory environments to our puppet classes, production puppetmaster
Declined	None	T169493 Test restarting puppetmaster workers for every code deploy

Event Timeline

Joe created this task.Jul 3 2017, 8:42 AM

Joe moved this task from Backlog to Doing on the User-Joe board.

The passenger docs about this are pretty clear: in the floss version of passenger, a restart is blocking - that is passenger will wait for all currently spawned workers to stop, and at least a new one to be spawned, before serving a new request. Those requests are going to be queued, but if we have some expensive catalog being built, it means ~ 30 seconds of blocking.

I don't think this is feasible unless we do something along the following lines:

Make the puppetmaster-backend an LVS service
Have the puppetmaster frontends connect to that LVS service (with one twist: the frontends should not host a backend as well, since we use LVS-DR)
depool/restart/pool the single backend

I will nonetheless experiment with restarting passenger on our backends to see what's the effect in our live environment.

Mentioned in SAL (#wikimedia-operations) [2017-07-03T08:58:56Z] <_joe_> restarting the passenger app on puppetmaster1002 for T169493

Mentioned in SAL (#wikimedia-operations) [2017-07-03T09:07:30Z] <_joe_> restarting the passenger app on puppetmasters in codfw serially with a sleep of 3 seconds for T169493

So all my attempts at restarting one or multiple instances of the puppetmaster backends luckily didn't cause any puppet errors, but it takes up to 4 minutes for a restart to take effect. I'm not sure this is acceptable with our current practices.

Given the fact our puppetmasters are so underloaded that we can safely assume they won't get a perf hit unless we create very complex environments.

So I'll decline this ticket for now and get back to it only if performance hits are significant.

Joe closed this task as Declined.Jul 3 2017, 9:51 AM

Test restarting puppetmaster workers for every code deployClosed, DeclinedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Test restarting puppetmaster workers for every code deploy
Closed, DeclinedPublic
Actions

Related Objects
Search...