Page MenuHomePhabricator

Move the main WMCS puppetmaster into the Labs realm
Closed, ResolvedPublic

Description

Historically, the Labs puppetmasters have been running in the production realm, for various legacy reasons. Early on, Labs (and now WMCS) gained support for self-hosted per-instance puppetmasters, and later for self-hosted per-project puppetmasters. Since then, the two (arguably) most important projects, deployment-prep and tools, have moved to self-hosted puppetmasters.

Having the WMCS "main" puppetmasters in the production realm is yet another labs->production realm bridge (or: a "labs-support" instance). It's especially iffy since Puppet is a complex codebase by itself, and complicated even further by the fact that it is essentially a compiler-on-demand for dynamic, living code. Such a jump has been exploited as a demonstration before and it wasn't that hard to achieve either (let's leave it at that :), so this isn't just hypothetical.

Puppet for WMCS instances doesn't need any kind of private data and there is really no particular reason other than legacy for why it runs in the production realm (as demonstrated by the various project puppetmasters too), so I'd like to discuss the path towards its eventual move to the labs realm. It's not super urgent or anything, but I've been thinking about this for a while and got reminded of it with the recent labspuppetmaster work -- and it turns out I never filed a task about it (that I could find) :)

So, I think there are a few different ideas have been mentioned on how to approach this (and feel free to adjust/correct):

  1. Move to "labs-support" (= production realm, public IP, but accessible only to Labs): would probably work and be an improvement over the current situation but not really moving to the labs realm and likely not enough.
  2. Deploy the puppetmasters in multiple VMs, perhaps even across multiple labvirts for increased reliability.
  3. Deploy a couple of puppetmasters VMs, perhaps allocated in a way that there's only 1 VM running in each bare metal server (I think this has been done already in a few other cases?).
  4. Wait until WMCS supports bare metal instances in the Labs realm, move the existing bare metal machines there (blocked on Neutron, I guess Ironic too?)

My inclination would be to just go with (2), which doesn't sound like a huge amount of work to me given that all the various parts are there, but I may be missing a lot of background.

How do you (cloud-services-team) folks feel about this? What pros/cons do you see in each and which one is your preferred solution?


TODO:

And then finally if everyone is happy to go ahead:

  • make old encapi read-only on labpuppetmaster1001 (somehow -- kill the r/w endpoint) (disable puppet, edit config in /etc/uwsgi/apps-enabled/labspuppetbackend.ini to set allowed writers to empty list or something invalid)
  • Import encapi data
  • Direct puppetmasters to the new encapi reader endpoint (patch)
  • Move infrastructure over to talking to new puppetmaster - e.g. horizon (patch)
  • Verify that in-project puppetmasters still work properly (e.g. toolforge puppetmaster)
  • Fix dns recursor hack to point 'puppet' domain to new puppetmaster (patch)
  • test that new VMs come up and work!
  • Move instances over to using new puppetmaster (patch)

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+2 -2
operations/puppetproduction+10 -16
operations/puppetproduction+0 -33
operations/puppetproduction+35 -2
operations/puppetproduction+4 -3
operations/puppetproduction+4 -3
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+3 -3
operations/puppetproduction+5 -5
operations/puppetproduction+9 -3
operations/puppetproduction+14 -0
operations/puppetproduction+0 -2
operations/puppetproduction+6 -19
operations/puppetproduction+5 -0
operations/puppetproduction+2 -2
operations/puppetproduction+7 -3
Show related patches Customize query in gerrit

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

JFTR, I don't know what cloudinfra-puppetmaster-01 is. Maybe @Krenair or someone else set up that?

I don't have access to do that. I assume this is a project puppetmaster for either the MX-out or NTP servers that exist in that project.

According to horizon cloudinfra-puppetmaster-01 was created by Andrew in September. It has no signed puppet certs currently, so safe to say nothing is using it as of yet. @Andrew is that something we still need/plan to use, or could we turn down the instance?

As arturo suggests, cloudinfra-puppetmaster-01 is meant to be the puppetmaster for things inside the cloudinfra project. I anticipated us needing that for project-local secrets -- I'm surprised that that e.g. mx-out01 doesn't need it... it certainly will if we add DKIM keys.

So, we can delete the VM if it's confusing people but I'l just need to rebuild it sometime soon :)

Nothing has been done regarding the actual topic of this bug. It's a perfectly reasonable idea but not on the top of the priority list and work there is pending some decisions in other areas.

Ok, clarified then:

  • cloudinfra-puppetmaster-01 is a puppetmaster server just for the cloudinfra project. Still not in use though.
  • we will have to discuss if we create a cloudvps-wide puppetmaster inside cloudvps (i.e, a VM).

#2 is almost certainly they way to go, as it avoids the weird chicken-egg issue of "we need a labs
puppetmaster to build a labs puppetmaster" -- currently I can't even log into a new VM /at all/ until it's
properly puppetized. So to move forward on this we would need some way of accessing an unpuppetized
VM.

For the record -- T215211 is largely resolved, and with that I'm not longer nearly as worried about worst-case 'we locked ourselves out of everything' scenarios.

I'm planning to have a go at this soon.

Change 501581 had a related patch set uploaded (by Alex Monk; owner: Alex Monk):
[operations/puppet@production] openstack::puppet::master::encapi: work on stretch with python3.5

https://gerrit.wikimedia.org/r/501581

Change 501581 merged by Andrew Bogott:
[operations/puppet@production] openstack::puppet::master::encapi: work on stretch with python3.5

https://gerrit.wikimedia.org/r/501581

Change 501587 had a related patch set uploaded (by Alex Monk; owner: Alex Monk):
[operations/puppet@production] openstack::puppet::master::encapi: Avoid nginx-apache conflict

https://gerrit.wikimedia.org/r/501587

I've got puppetmaster set up on puppetmaster.cloudinfra.wmflabs.org now, hosted at cloud-puppetmaster-01 with a backend of cloud-puppetmaster-02. A test client on krenair-t171188-test.testlabs.eqiad.wmflabs is working.
Still a load of stuff to do though. It doesn't have a floating IP or any way for the OpenStack hosts to contact it yet. Had to do some manual actions to avoid apache-nginx conflicts, deal with package problems around cergen's dependencies against the openstack-mitaka-jessie repo, work around ferm's AAAA handling bugs, among other things.

Change 502235 had a related patch set uploaded (by Alex Monk; owner: Alex Monk):
[operations/puppet@production] profile::puppetmaster::frontend: Allow getting allow_from from hiera

https://gerrit.wikimedia.org/r/502235

Change 502235 merged by Alexandros Kosiaris:
[operations/puppet@production] profile::puppetmaster::frontend: Allow getting allow_from from hiera

https://gerrit.wikimedia.org/r/502235

Change 501587 merged by Andrew Bogott:
[operations/puppet@production] openstack::puppet::master::encapi: Avoid nginx-apache conflict

https://gerrit.wikimedia.org/r/501587

The number of puppet.git cherry-picks on cloudinfra-internal-puppetmaster is now 0, there's just the two secret commits to labs/private that are pretty much the purpose of that instance.

TODO:

And then finally if everyone is happy to go ahead:

  • Import encapi data
  • Move infrastructure over to talking to new puppetmaster - e.g. horizon
  • Move instances over to using new puppetmaster
Andrew closed subtask Restricted Task as Resolved.May 7 2019, 8:25 PM

Change 509915 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Replace git-sync-upstream on labspuppetmasters, remove from puppet-merge

https://gerrit.wikimedia.org/r/509915

Change 509915 merged by Andrew Bogott:
[operations/puppet@production] Replace git-sync-upstream on labspuppetmasters, remove from puppet-merge

https://gerrit.wikimedia.org/r/509915

Looks like we regressed here while I was busy - logged onto the new puppetmasters to find puppet has been broken for weeks. Seems to be related to clientpackages changes

Change 511875 had a related patch set uploaded (by Alex Monk; owner: Alex Monk):
[operations/puppet@production] openstack puppetmaster profiles: don't include clientpackages

https://gerrit.wikimedia.org/r/511875

Change 511877 had a related patch set uploaded (by Alex Monk; owner: Alex Monk):
[operations/puppet@production] openstack puppetmaster roles: duplicate for set of profiles to be used in labs

https://gerrit.wikimedia.org/r/511877

Change 511875 merged by Andrew Bogott:
[operations/puppet@production] openstack puppetmaster profiles: don't include clientpackages

https://gerrit.wikimedia.org/r/511875

Change 511877 merged by Andrew Bogott:
[operations/puppet@production] openstack puppetmaster roles: duplicate for set of profiles to be used in labs

https://gerrit.wikimedia.org/r/511877

Change 514454 had a related patch set uploaded (by Alex Monk; owner: Alex Monk):
[operations/puppet@production] certmanager: Set up config for running inside labs realm

https://gerrit.wikimedia.org/r/514454

Change 514454 merged by Andrew Bogott:
[operations/puppet@production] certmanager: Set up config for running inside labs realm

https://gerrit.wikimedia.org/r/514454

encapi works:

diff --git a/modules/openstack/manifests/puppet/master/encapi.pp b/modules/openstack/manifests/puppet/master/encapi.pp
index 509af5e7f8..261e0046e9 100644
--- a/modules/openstack/manifests/puppet/master/encapi.pp
+++ b/modules/openstack/manifests/puppet/master/encapi.pp
@@ -47,7 +47,8 @@ class openstack::puppet::master::encapi(
         ipresolve($designate_host, 4),
         ipresolve($designate_host, 6),
         ipresolve($designate_host_standby, 4),
-        ipresolve($designate_host_standby, 6)]),',')
+        ipresolve($designate_host_standby, 6),
+        '127.0.0.1']),',')
 
     # We override service_settings because the default includes autoload
     #  which insists on using python2
krenair@cloud-puppetmaster-01:~$ curl "http://localhost:8101/v1/test/node/test.asd.codfw.labtest"
hiera: {}
roles: []
krenair@cloud-puppetmaster-01:~$ curl "http://localhost:8101/v1/test/prefix/test.asd.codfw.labtest/hiera" --data '{a: b}' -H 'Content-Type: application/x-yaml'
{status: ok}
krenair@cloud-puppetmaster-01:~$ curl "http://localhost:8101/v1/test/node/test.asd.codfw.labtest"
hiera: {a: b}
roles: []
krenair@cloud-puppetmaster-01:~$ curl "http://localhost:8101/v1/test/prefix/test.asd.codfw.labtest" -X DELETE
{status: ok}
krenair@cloud-puppetmaster-01:~$ curl "http://localhost:8101/v1/test/node/test.asd.codfw.labtest"
hiera: {}
roles: []
Andrew updated the task description. (Show Details)
Krenair updated the task description. (Show Details)

Change 530340 had a related patch set uploaded (by Alex Monk; owner: Alex Monk):
[operations/puppet@production] cloud: Switch encapi calls to new puppetmaster

https://gerrit.wikimedia.org/r/530340

Change 530341 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] cloud recursors: alias 'puppet' to the new in-labs puppetmaster

https://gerrit.wikimedia.org/r/530341

Change 530382 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] labpuppetmaster1001/1002: Clean up after moving puppetmasters to the cloud

https://gerrit.wikimedia.org/r/530382

Change 533758 had a related patch set uploaded (by Alex Monk; owner: Alex Monk):
[operations/puppet@production] Make puppetmaster CA content key be a hash keyed by puppetmaster

https://gerrit.wikimedia.org/r/533758

Change 530344 had a related patch set uploaded (by Alex Monk; owner: Alex Monk):
[operations/puppet@production] cloud: Change monitoring things to look at new pupeptmaster

https://gerrit.wikimedia.org/r/530344

Change 530371 had a related patch set uploaded (by Alex Monk; owner: Alex Monk):
[operations/puppet@production] cloud: Move instances to use new puppetmaster

https://gerrit.wikimedia.org/r/530371

At Wikimania me and Andrew discussed what else needs doing before we push the button here. We realised that for the actual hiera change of puppetmaster plus the profile::base::certificates::puppet_ca_content addition there was a slight problem - some instances have overridden puppetmaster but won't have overrides for profile::base::certificates::puppet_ca_content (the whole idea of that key being to be kept synced with puppetmaster - they should be changed at the same time so puppet can fiddle about with certificates as appropriate). We decided to make it a hash keyed by puppetmaster, which is ignored if the relevant key is not set. I've uploaded that as https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/533758/ and updated https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/530371/ for it.

Mentioned in SAL (#wikimedia-operations) [2019-09-09T17:56:24Z] <andrewbogott> disabling puppet on labpuppetmaster1001 as part of T171188

Change 530340 merged by Andrew Bogott:
[operations/puppet@production] cloud: Switch encapi calls to new puppetmaster

https://gerrit.wikimedia.org/r/530340

Change 530344 merged by Andrew Bogott:
[operations/puppet@production] cloud: Change monitoring things to look at new puppetmaster

https://gerrit.wikimedia.org/r/530344

Change 530341 merged by Andrew Bogott:
[operations/puppet@production] cloud recursors: alias 'puppet' to the new in-labs puppetmaster

https://gerrit.wikimedia.org/r/530341

Change 535275 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] cloud recursors: alias 'puppet' to the cloud-internal puppetmaster IP

https://gerrit.wikimedia.org/r/535275

Change 535275 merged by Andrew Bogott:
[operations/puppet@production] cloud recursors: alias 'puppet' to the cloud-internal puppetmaster IP

https://gerrit.wikimedia.org/r/535275

Change 533758 merged by Andrew Bogott:
[operations/puppet@production] Make puppetmaster CA content key be a hash keyed by puppetmaster

https://gerrit.wikimedia.org/r/533758

Change 535305 had a related patch set uploaded (by Andrew Bogott; owner: Alex Monk):
[operations/puppet@production] Make puppetmaster CA content key be a hash keyed by puppetmaster

https://gerrit.wikimedia.org/r/535305

Change 535305 merged by Andrew Bogott:
[operations/puppet@production] Make puppetmaster CA content key be a hash keyed by puppetmaster

https://gerrit.wikimedia.org/r/535305

Change 530371 merged by Andrew Bogott:
[operations/puppet@production] cloud: Move instances to use new puppetmaster

https://gerrit.wikimedia.org/r/530371

Change 535329 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] labs.yaml: remove profile::base::certificates::puppet_ca_content

https://gerrit.wikimedia.org/r/535329

Change 535329 merged by Andrew Bogott:
[operations/puppet@production] labs.yaml: remove profile::base::certificates::puppet_ca_content

https://gerrit.wikimedia.org/r/535329

Change 530382 merged by Andrew Bogott:
[operations/puppet@production] labpuppetmaster1001/1002: Clean up after moving puppetmasters to the cloud

https://gerrit.wikimedia.org/r/530382

Mentioned in SAL (#wikimedia-operations) [2019-09-11T12:40:42Z] <moritzm> removing now puppet/puppetdb packages from labpuppetmaster* T171188

Mentioned in SAL (#wikimedia-operations) [2019-09-11T12:40:51Z] <moritzm> removing now obsolete puppet/puppetdb packages from labpuppetmaster* T171188

Change 537111 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Move labpuppetmaster1001 and 1002 to role::spare

https://gerrit.wikimedia.org/r/537111

Change 537111 merged by Andrew Bogott:
[operations/puppet@production] Move labpuppetmaster1001 and 1002 to role::spare

https://gerrit.wikimedia.org/r/537111