Page MenuHomePhabricator

Move the main WMCS puppetmaster into the Labs realm
Open, NormalPublic

Description

Historically, the Labs puppetmasters have been running in the production realm, for various legacy reasons. Early on, Labs (and now WMCS) gained support for self-hosted per-instance puppetmasters, and later for self-hosted per-project puppetmasters. Since then, the two (arguably) most important projects, deployment-prep and tools, have moved to self-hosted puppetmasters.

Having the WMCS "main" puppetmasters in the production realm is yet another labs->production realm bridge (or: a "labs-support" instance). It's especially iffy since Puppet is a complex codebase by itself, and complicated even further by the fact that it is essentially a compiler-on-demand for dynamic, living code. Such a jump has been exploited as a demonstration before and it wasn't that hard to achieve either (let's leave it at that :), so this isn't just hypothetical.

Puppet for WMCS instances doesn't need any kind of private data and there is really no particular reason other than legacy for why it runs in the production realm (as demonstrated by the various project puppetmasters too), so I'd like to discuss the path towards its eventual move to the labs realm. It's not super urgent or anything, but I've been thinking about this for a while and got reminded of it with the recent labspuppetmaster work -- and it turns out I never filed a task about it (that I could find) :)

So, I think there are a few different ideas have been mentioned on how to approach this (and feel free to adjust/correct):

  1. Move to "labs-support" (= production realm, public IP, but accessible only to Labs): would probably work and be an improvement over the current situation but not really moving to the labs realm and likely not enough.
  2. Deploy the puppetmasters in multiple VMs, perhaps even across multiple labvirts for increased reliability.
  3. Deploy a couple of puppetmasters VMs, perhaps allocated in a way that there's only 1 VM running in each bare metal server (I think this has been done already in a few other cases?).
  4. Wait until WMCS supports bare metal instances in the Labs realm, move the existing bare metal machines there (blocked on Neutron, I guess Ironic too?)

My inclination would be to just go with (2), which doesn't sound like a huge amount of work to me given that all the various parts are there, but I may be missing a lot of background.

How do you (cloud-services-team) folks feel about this? What pros/cons do you see in each and which one is your preferred solution?

Related Objects

Event Timeline

faidon created this task.Jul 20 2017, 4:15 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 20 2017, 4:15 PM
bd808 edited projects, added Cloud-VPS, Puppet; removed Cloud-Services.Jul 24 2017, 3:49 PM
Andrew added a subscriber: Andrew.Jul 24 2017, 3:56 PM

I'm pretty sure that #1 is moot -- at least, anytime we discuss it we conclude that the 'labs-support' vlan isn't really a useful concept and should be eliminated.

#2 is almost certainly they way to go, as it avoids the weird chicken-egg issue of "we need a labs puppetmaster to build a labs puppetmaster" -- currently I can't even log into a new VM /at all/ until it's properly puppetized. So to move forward on this we would need some way of accessing an unpuppetized VM.

#3 isn't obviously more useful to me than #2 -- as long as we have redundant puppetmasters and ensure they're not on the same physical host we would get the same value (as I understand it)

#4 is appealing in part because it avoids chicken-and-egg, but unless we wind up having multiple use cases for bare-metal-in-instance-vlan it's probably not worth the additional complexity over #2.

Here are some things that need to be thought about/figured out before we can go forward:

  • Security model: Having a labs VM that is a Ops-only and critical to the whole cluster is not unheard of but difficult, and isn't currently solved to my satisfaction
  • Chicken-and-egg: We need a way to build a fresh puppetmaster in the event that we have no current working puppetmaster
  • Stability: Historically puppetmasters on labs have been flaky and crashed a lot. Why?
  • Performance: We haven't ever had a VM puppetmaster support more than a few dozen clients. I can't think of any reason why this would be an issue but it needs testing.
chasemp added a subscriber: chasemp.EditedJul 24 2017, 5:32 PM

My understanding of this is we are looking at putting this in the public VLAN as the current compromise short of moving services into the the Labs realm directly, though I believe in this case making the masters themselves instances is the best eventual outcome. A few decent sized unknowns for me are: we have one base image that expects an external puppetmaster (even for project masters) and would need to figure out some special bootstrap process for the masters themselves (and feel really sure its not going to be broken in the large intervals we come back around to it), and we haven't thought through managing this puppetmaster within the context of an instance at all. I don't feel like we have the bandwidth to bite this off directly right now. My vote is pursuing the course of action in-flight to decouple the puppetmaster from labcontrol, put in the public VLAN with the new hardware, firewall off from non-instances, and make notes for portions of this process that would effect a next-stage of converting to an instance. I think most of the in progress work here needs to be done for either outcome.

MoritzMuehlenhoff triaged this task as Normal priority.Jul 25 2017, 12:40 PM
  • Performance: We haven't ever had a VM puppetmaster support more than a few dozen clients. I can't think of any reason why this would be an issue but it needs testing.

Thought I'd do some numbers on this.

In deployment-prep we run happily with an m1.medium puppetmaster serving around 75 hosts. It looks like tools is up to around 150 hosts on an m1.medium.

Labs currently has 777 instances currently according to http://tools.wmflabs.org/openstack-browser/ - if we exclude the projects with their own puppetmasters (tools, deployment-prep, integration, automation-framework, bstorm-test, gerrit, git, hound, phabricator, puppet, striker, testlabs, thumbor, toolsbeta, traffic, wikifactmine) that looses 353 of them.

So 424ish remaining currently likely to be using the central puppetmaster, if we assume tools' current puppetmaster to be at maximum load then you'd want something at least 3-4 times the size of an m1.medium in terms of VCPUs. An m1.xlarge might be capable of the job unless you want to introduce a custom flavour?

Ping? Could we setup a couple of puppetmasters in the new "cloudinfra" project and see where that leads us? I was previously told that this is probably a 1-2 weeks projects; is that still the current assessment, and if so, do you have an estimate on when this could be scheduled?

In terms of concerns raised before in this task:

  • Security model: I suppose that's cloudinfra, right? We need to address that regardless, as we move more services within WMCS. In any case, anything would be better than the current security model...
  • The chicken-and-egg problem is the same as production's puppetmasters -- they are in the exact same spot (clients to themsleves). The solution there I think is to set up multiple ones across different cloudvirts and hope they won't all die together. If the worst happens, we can always intervene manually and set something up.
  • Stability: are puppetmasters on labs still crashing? They run the exact same distro and software, so I'm not sure why it would be the case, but we can help debug if that happens!
  • Performance: I'll defer to @Krenair's analysis as he seems to know more than me about this :) It sounds to me like we already have half of the fleet pointed at their own puppetmasters, so it doesn't sound like a different order of magnitude?
faidon edited projects, added Cloud-Services; removed Cloud-VPS.Oct 20 2018, 9:41 AM

I imagine we'd need to issue every instance being moved a new puppet cert, as we presumably wouldn't want to hand the current labs puppetmaster CA over to the new instance? That might be fairly easy due to autosigning though.

  • Stability: are puppetmasters on labs still crashing? They run the exact same distro and software, so I'm not sure why it would be the case, but we can help debug if that happens!

I don't remember seeing deployment-prep's one crashing.

Krenair added a comment.EditedOct 20 2018, 4:09 PM
  • Security model: I suppose that's cloudinfra, right? We need to address that regardless, as we move more services within WMCS. In any case, anything would be better than the current security model...

I think what Andrew was referring to was the historical issues around access control on restricted bastions. AFAIK that isn't an issue with the model around the cloudinfra project as it just uses the standard project group mechanism. As long as the group only contains authorised people you should be okay without relying on anything particularly obscure like profile::ldap::client::labs::restricted_to.

  • Performance: I'll defer to @Krenair's analysis as he seems to know more than me about this :)

I just looked at current numbers and assumed number of VCPUs required is proportional to number of puppet clients. I also assumed it'd work like a current project puppetmaster with just a single instance serving all clients, but actually I imagine it should work more like prod with multiple hosts? I think I also made assumptions about instance roles based on names. I didn't do anything particularly special.

It sounds to me like we already have half of the fleet pointed at their own puppetmasters, so it doesn't sound like a different order of magnitude?

Numbers update: Labs has 728 instances. I took a list of all instances with 'puppet' in the name, here's the ones I either know or suspect to be puppetmasters:

automation-framework af-puppetmaster02
bstorm-test bstorm-puppet-01
cloudinfra cloudinfra-puppetmaster-01
deployment-prep deployment-dumps-puppetmaster02
deployment-prep deployment-puppetmaster03
gerrit puppet-gerrit
git puppet-paladox
hound hound-puppet-02
integration integration-puppetmaster01
phabricator puppet-phabricator
puppet keith-puppetmaster
puppet puppet-jmm-pmaster
striker striker-puppet01
testlabs gtirloni-puppetmaster-01
testlabs abogott-puppetmaster
thumbor puppetmaster
tools tools-puppetmaster-01
toolsbeta toolsbeta-puppetmaster-02
toolsbeta toolsbeta-puppetmaster-01
traffic traffic-puppetmaster
wikifactmine puppetmaster-01
wmcs-nfs nfs-puppetmaster-01

I manually filtered these ones out because I either know or suspect they're not puppetmasters:

automation-framework af-puppetdb01
automation-framework af-puppetdb02
deployment-prep deployment-puppetdb02
puppet puppet-jmm-kernel-stretch2
puppet puppet-jmm-kernel-stretch
puppet puppet-ema-2
testlabs puppet-compiler-v4-tools
testlabs puppet-compiler-v4-other
testlabs puppet-compiler-other
testlabs puppet-compiler-tools
toolsbeta toolsbeta-puppetdb-01

So anyway, that's 22 puppetmasters. The projects containing the puppetmasters listed above contain 329 instances, averaging 15 instances per project puppetmaster. The projects not containing puppetmasters have 399 instances.
So it's about 45% of labs probably using one of 22 project puppetmasters, and we're talking about moving the remaining 55% to some new puppetmaster instances. The question is how many and how big should each new puppetmaster be.

aborrero added a subscriber: aborrero.

I will try to discuss this in our next team meeting.

Noob question: I understand that cloudinfra-puppetmaster-01 is a puppetmaster just for the cloudinfra project, right?
Are we talking about other server, let's say cloudvps-puppetmaster-01 also in the cloudinfra project?

JFTR, I don't know what cloudinfra-puppetmaster-01 is. Maybe @Krenair or someone else set up that?

More broadly, the concept behind this task is to set up "default" puppetmasters for all of the projects, i.e. all instances that don't use some special kind of puppetmaster. Whether those default/central/last-resort puppetmasters should be hosted within the cloudinfra project or some other WMCS project… is an implementation detail I think.

JFTR, I don't know what cloudinfra-puppetmaster-01 is. Maybe @Krenair or someone else set up that?

I don't have access to do that. I assume this is a project puppetmaster for either the MX-out or NTP servers that exist in that project.

herron added a subscriber: herron.Nov 19 2018, 3:05 PM

JFTR, I don't know what cloudinfra-puppetmaster-01 is. Maybe @Krenair or someone else set up that?

I don't have access to do that. I assume this is a project puppetmaster for either the MX-out or NTP servers that exist in that project.

According to horizon cloudinfra-puppetmaster-01 was created by Andrew in September. It has no signed puppet certs currently, so safe to say nothing is using it as of yet. @Andrew is that something we still need/plan to use, or could we turn down the instance?

As arturo suggests, cloudinfra-puppetmaster-01 is meant to be the puppetmaster for things inside the cloudinfra project. I anticipated us needing that for project-local secrets -- I'm surprised that that e.g. mx-out01 doesn't need it... it certainly will if we add DKIM keys.

So, we can delete the VM if it's confusing people but I'l just need to rebuild it sometime soon :)

Nothing has been done regarding the actual topic of this bug. It's a perfectly reasonable idea but not on the top of the priority list and work there is pending some decisions in other areas.

Ok, clarified then:

  • cloudinfra-puppetmaster-01 is a puppetmaster server just for the cloudinfra project. Still not in use though.
  • we will have to discuss if we create a cloudvps-wide puppetmaster inside cloudvps (i.e, a VM).

#2 is almost certainly they way to go, as it avoids the weird chicken-egg issue of "we need a labs
puppetmaster to build a labs puppetmaster" -- currently I can't even log into a new VM /at all/ until it's
properly puppetized. So to move forward on this we would need some way of accessing an unpuppetized
VM.

For the record -- T215211 is largely resolved, and with that I'm not longer nearly as worried about worst-case 'we locked ourselves out of everything' scenarios.

Krenair claimed this task.Mar 26 2019, 5:45 PM

I'm planning to have a go at this soon.

Change 501581 had a related patch set uploaded (by Alex Monk; owner: Alex Monk):
[operations/puppet@production] openstack::puppet::master::encapi: work on stretch with python3.5

https://gerrit.wikimedia.org/r/501581

Change 501581 merged by Andrew Bogott:
[operations/puppet@production] openstack::puppet::master::encapi: work on stretch with python3.5

https://gerrit.wikimedia.org/r/501581

Change 501587 had a related patch set uploaded (by Alex Monk; owner: Alex Monk):
[operations/puppet@production] openstack::puppet::master::encapi: Avoid nginx-apache conflict

https://gerrit.wikimedia.org/r/501587

Krenair added a comment.EditedApr 6 2019, 1:35 AM

I've got puppetmaster set up on puppetmaster.cloudinfra.wmflabs.org now, hosted at cloud-puppetmaster-01 with a backend of cloud-puppetmaster-02. A test client on krenair-t171188-test.testlabs.eqiad.wmflabs is working.
Still a load of stuff to do though. It doesn't have a floating IP or any way for the OpenStack hosts to contact it yet. Had to do some manual actions to avoid apache-nginx conflicts, deal with package problems around cergen's dependencies against the openstack-mitaka-jessie repo, work around ferm's AAAA handling bugs, among other things.

Change 502235 had a related patch set uploaded (by Alex Monk; owner: Alex Monk):
[operations/puppet@production] profile::puppetmaster::frontend: Allow getting allow_from from hiera

https://gerrit.wikimedia.org/r/502235

Change 502235 merged by Alexandros Kosiaris:
[operations/puppet@production] profile::puppetmaster::frontend: Allow getting allow_from from hiera

https://gerrit.wikimedia.org/r/502235

Change 501587 merged by Andrew Bogott:
[operations/puppet@production] openstack::puppet::master::encapi: Avoid nginx-apache conflict

https://gerrit.wikimedia.org/r/501587

Krenair added a comment.EditedApr 19 2019, 4:56 AM

The number of puppet.git cherry-picks on cloudinfra-internal-puppetmaster is now 0, there's just the two secret commits to labs/private that are pretty much the purpose of that instance.

TODO:

And then finally if everyone is happy to go ahead:

  • Import encapi data
  • Move infrastructure over to talking to new puppetmaster - e.g. horizon
  • Move instances over to using new puppetmaster
Andrew closed subtask Restricted Task as Resolved.Tue, May 7, 8:25 PM

Change 509915 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Replace git-sync-upstream on labspuppetmasters, remove from puppet-merge

https://gerrit.wikimedia.org/r/509915

Change 509915 merged by Andrew Bogott:
[operations/puppet@production] Replace git-sync-upstream on labspuppetmasters, remove from puppet-merge

https://gerrit.wikimedia.org/r/509915

Looks like we regressed here while I was busy - logged onto the new puppetmasters to find puppet has been broken for weeks. Seems to be related to clientpackages changes

Change 511875 had a related patch set uploaded (by Alex Monk; owner: Alex Monk):
[operations/puppet@production] openstack puppetmaster profiles: don't include clientpackages

https://gerrit.wikimedia.org/r/511875

Change 511877 had a related patch set uploaded (by Alex Monk; owner: Alex Monk):
[operations/puppet@production] openstack puppetmaster roles: duplicate for set of profiles to be used in labs

https://gerrit.wikimedia.org/r/511877

Change 511875 merged by Andrew Bogott:
[operations/puppet@production] openstack puppetmaster profiles: don't include clientpackages

https://gerrit.wikimedia.org/r/511875

Change 511877 merged by Andrew Bogott:
[operations/puppet@production] openstack puppetmaster roles: duplicate for set of profiles to be used in labs

https://gerrit.wikimedia.org/r/511877