Page MenuHomePhabricator

Build new tools puppetmaster
Closed, ResolvedPublic

Description

Thanks to a bad puppet patch (https://gerrit.wikimedia.org/r/#/c/361675/) I totally broke tools-puppetmaster-02. Various rescue attempts notwithstanding, I wound up stuck on

"Warning: Could not intern from text/plain: nested asn1 error"

After hours of failed attempts to fix this, I just built a new puppetmaster, tools-puppetmaster-01. It seems to work fine as a puppetmaster. Actually moving clush to the new box is blocked by T169099.

In the beginning there was

https://gerrit.wikimedia.org/r/#/c/361675/

Which was a no-op on the production puppetmasters but not a no-op on labs puppetmasters due to my over-pruning. So, I swiftly reverted with

https://gerrit.wikimedia.org/r/#/c/361710/

Which would've fixed things. Except, by that time, the apache config for all self-hosted puppetmasters was broken. And, most of those puppetmasters are their /own/ puppetmasters which meant they couldn't fix themselves... So, I went through the list of involved hosts (via watroles) and pasted in the missing bits in order to kickstart things and everything should be fine now.

except... when I fixed the tools puppetmaster it started saying

"Warning: Could not intern from text/plain: nested asn1 error"

I do not know what that is, and Google doesn't know what it is either. Google provided a slight hint, though, as someone at puppet labs responded to a bug report about that message with no explanation but a terse 'this is fixed in the next version'.

So... since labs-puppetmaster-02 was running a 3.7-series package and we're using 3.8 packages elsewhere, I did an 'apt-get install puppetmaster' to move to 3.8. Except, there was some latent unpuppetized pinning in the apt config for that box (a different, also long story) which meant that instead of upgrading to 3.8 it upgraded to 4.something and after that I was well and truly doomed. No amount of cert regenerating or rebooting or de- and re-installing puppet packages would make that 'could not intern' error go away or anything run cleanly.

So, I built a new puppetmaster for tools, tools-puppet-master-01, copied the custom private patches on -02 over to -01, clobbered all the certs on all the existing tools instances and switched everything over to the new puppetmaster. The dance of getting certs cleaned and then initial puppet runs going before an old cron'd puppet run starts up and creates a new but broken cert turned out to be intricate and frustrating and took hours and resulted in the many, many shinken emails that you all received.

This was good practice for me since I'm going to have to migrate the rest of labs to a new puppetmaster soon anyway... but I apologize for all the racket.

Event Timeline

bd808 triaged this task as High priority.Jul 3 2017, 6:41 PM
bd808 subscribed.

Marking as high because we need to get clush running again somewhere sooner rather than later.

A local machine setup for using clush can be found at https://github.com/bd808/wikimedia-cloud-vps-hostgroup-generator and P5831. This can temporarily bridge over the lack of clush in Tools. My feelings are mixed on if the reliance of this method on the openstack-borwser tool is ok for using it to replace the current setup that is broken in the Tools project or not.

clush is back up and running on a new host. The only action-item remaining here is the deletion of the old tools-puppetmaster-02.