Historically, the Labs puppetmasters have been running in the production realm, for various legacy reasons. Early on, Labs (and now WMCS) gained support for self-hosted per-instance puppetmasters, and later for self-hosted per-project puppetmasters. Since then, the two (arguably) most important projects, deployment-prep and tools, have moved to self-hosted puppetmasters.
Having the WMCS "main" puppetmasters in the production realm is yet another labs->production realm bridge (or: a "labs-support" instance). It's especially iffy since Puppet is a complex codebase by itself, and complicated even further by the fact that it is essentially a compiler-on-demand for dynamic, living code. Such a jump has been exploited as a demonstration before and it wasn't that hard to achieve either (let's leave it at that :), so this isn't just hypothetical.
Puppet for WMCS instances doesn't need any kind of private data and there is really no particular reason other than legacy for why it runs in the production realm (as demonstrated by the various project puppetmasters too), so I'd like to discuss the path towards its eventual move to the labs realm. It's not super urgent or anything, but I've been thinking about this for a while and got reminded of it with the recent labspuppetmaster work -- and it turns out I never filed a task about it (that I could find) :)
So, I think there are a few different ideas have been mentioned on how to approach this (and feel free to adjust/correct):
- Move to "labs-support" (= production realm, public IP, but accessible only to Labs): would probably work and be an improvement over the current situation but not really moving to the labs realm and likely not enough.
- Deploy the puppetmasters in multiple VMs, perhaps even across multiple labvirts for increased reliability.
- Deploy a couple of puppetmasters VMs, perhaps allocated in a way that there's only 1 VM running in each bare metal server (I think this has been done already in a few other cases?).
- Wait until WMCS supports bare metal instances in the Labs realm, move the existing bare metal machines there (blocked on Neutron, I guess Ironic too?)
My inclination would be to just go with (2), which doesn't sound like a huge amount of work to me given that all the various parts are there, but I may be missing a lot of background.
How do you (cloud-services-team) folks feel about this? What pros/cons do you see in each and which one is your preferred solution?
TODO:
- Gather input on T220268: Consider ways to make puppetmaster CA changes smoother on the puppet client end and decide what to do - https://gerrit.wikimedia.org/r/506872 and https://gerrit.wikimedia.org/r/506873
- SSH access rules, keys, etc. from designate hosts - for SSH call from modules/openstack/files/mitaka/designate/wmf_sink/base.py - https://gerrit.wikimedia.org/r/514454
- Figure out why I needed to edit two separate copies of the puppet.pem files in labs/private - changed profile::openstack::eqiad1::puppetmaster::cert_secret_path
- Test encapi is working
-
Determine whether I should try rebuilding the new instances and write down all the steps - Consider what changes we'll need to make to images if any (none, we think)
- T219390: Have puppet-merge on puppetmaster1001 publish the official sha1 after merging
And then finally if everyone is happy to go ahead:
- make old encapi read-only on labpuppetmaster1001 (somehow -- kill the r/w endpoint) (disable puppet, edit config in /etc/uwsgi/apps-enabled/labspuppetbackend.ini to set allowed writers to empty list or something invalid)
- Import encapi data
- Direct puppetmasters to the new encapi reader endpoint (patch)
- Move infrastructure over to talking to new puppetmaster - e.g. horizon (patch)
- Verify that in-project puppetmasters still work properly (e.g. toolforge puppetmaster)
- Fix dns recursor hack to point 'puppet' domain to new puppetmaster (patch)
- test that new VMs come up and work!
- Move instances over to using new puppetmaster (patch)