Page MenuHomePhabricator

Re-think puppet management for deployment-prep
Open, MediumPublic

Description

Deployment-prep is thought as "staging" area to production, so it includes a lot of the production puppet roles.

This means that anyone willing to change something in puppet (even simple refactors) will be faced with the following issues:

  • There is no greppable way to find out where that role is applied on within deployment-prep
  • Hiera is distributed between on-disk hieradata, the horizon UI (in multiple places), and wikitech.

This results in a lot of time lost chasing things around all those places, and a general difficulty at finding out which classes are actually applied there, and how their data is organized (in a different model than in production).

So my proposal would be to actually make deployment-prep work as similarly to production as possible, while allowing changes to be made to deployment-prep to be made by non-opsens (e.g. people who don't have +2 on ops/puppet).

My idea is as follows (10k ft view):

  • on the deployment-prep puppetmaster, configure a 'staging' environment for puppet, with its own site.pp that will include deployment-prep nodes declaration, instead of production ones. Those will come from a separate repository with its own merge/commit rights, separated from ops/puppet
  • on the deployment-prep puppetmaster, define a disk-based hiera hierarchy to mimic 1:1 what we have in production, as far as the hierachy goes. This hiera hierarchy will also be in the new repository, thus freed from ops +2 need
  • We might want to have all nodes in this environment derive from a base node that includes all the labs boilerplate that is needed for openstack/ldap/basic system.

This would allow me as an opsen working on continuous refactors of our production code to quickly know where to modify deployment-prep in order not to break it, and give greater logic and ordering to the hiera definitions for deployment-prep, which is now out of control, see:

I think this would be a mid-level effort (with the largest task being reorganizing that unruly mess of hiera I just described) and real benefits both on the short and long term.

Event Timeline

Joe added a subscriber: BBlack.

We might want to have all nodes in this environment derive from a base node that includes all the labs boilerplate that is needed for openstack/ldap/basic system.

That part is covered in site.pp by having labs instance include role::labs::instance:

manifests/site.pp
node default {         
    if $::realm == 'production' {
        include ::standard
    } else {
        # Require instead of include so we get NFS and other
        # base things setup properly
        require ::role::labs::instance
    }
}

I to think the Hiera multiple sources are a source are confusing. Originally OpenStackManager had an interface to let ones define class and variables, then apply a role/variables on a node.

The Hiera namespace on wikitech has been introduced to offer self service for non ops or save ones the trouble of crafting a puppet patch and cherry picking it on a standalone puppet master. That is now better served by the Horizon UI.

An immediate low hanging fruit would probably to move all the content from https://wikitech.wikimedia.org/wiki/Hiera:Deployment-prep to puppet.git /hieradata/labs/deployment-prep/common.yaml and leave a comment about using Horizon from now on.

In Horizon, we should probably make people aware that changes there should be transient and eventually land to puppet.git /hieradata. Similar to how puppet master cherry picks eventually get merged in puppet.git once they are ready.

Puppet environments in labs would be quite a great thing. I can imagine project admin having commit rights on something like labs/puppet/deployment-prep.git which would be a standalone puppet hierarchy and be fetched on the puppetmaster under /environments.

@hashar the point is to have something that resembles production, including the role-based hiera lookup.

It is also pretty important in the role/profile paradigm we're trying to enforce.

At the moment the only convenient way to do that is the pupppet-prefix function of Horizon.

Honestly this ticket is not about low-hanging fruits, but about an idea to make that environment manageable without draining energies from people to keep it unbroken when doing puppet refactors/changes aimed at production.

Honestly if avoiding breaking beta could be done with one commit shadowing the hieradata commit in production, and a couple of greps, it would be easier to enforce people to do it, versus the current process that is, more or less:

  • Go to horizon and cycle through all the VMs in deployment prep that can have the class applied according to your memory/experience
  • Cycle through all the puppet prefixes defined there too
  • Search hiera in the 4 places I named before

which takes to an experienced opsen around 2-3 hours even for a pretty simple refactoring, and of course most people don't even get to the next objective that would be

  • remove from hiera all variables that are now unneeded

so that all in all the deployment-prep hiera is thousands of lines of sometimes-repeated variables.

So yeah removing one source of hiera would help a bit, but not so much tbh. We need a radically different approach and accept deployment-prep is qualitatively different from every other labs project, and treat it accordingly.

on the deployment-prep puppetmaster, define a disk-based hiera hierarchy to mimic 1:1 what we have in production

I think this would need to be a git repo that is tracked in gerrit or diffusion primarily because we can't guarantee that any particular OpenStack VM will be durable. If something happens to the disk image itself or to the labvirt host that currently holds the image we would need to be able to rebuild the puppetmaster and put all the custom config back.


At a high level I like the idea. The use of a project local puppetmaster gives deployment-prep some flexibility that would be difficult to replicate across all of the OpenStack projects via our central puppetmaster. Ideally this 'prod like' management setup would become a Puppet role/profile that any project could use if its use case was sufficiently production like to warrant the additional complexity of managing a project puppetmaster and a dedicated hiera config repo.

Live hacking deployment-prep to implement the concept is likely to be frustrating for both the users of the beta cluster and the engineers working on fine tuning the implementation. I think I would suggest making a new project request for a place to develop the proof of concept for this setup. Once the big kinks are worked out and the parts that can be upstreamed to ops/puppet.git have been merged it would be easier to convert deployment-prep in place.

on the deployment-prep puppetmaster, define a disk-based hiera hierarchy to mimic 1:1 what we have in production

I think this would need to be a git repo that is tracked in gerrit or diffusion primarily because we can't guarantee that any particular OpenStack VM will be durable. If something happens to the disk image itself or to the labvirt host that currently holds the image we would need to be able to rebuild the puppetmaster and put all the custom config back.

That is exactly the plan. Deployment-prep would have its own puppet repo with site.pp and hieradata.

  • on the deployment-prep puppetmaster, configure a 'staging' environment for puppet, with its own site.pp that will include deployment-prep nodes declaration, instead of production ones. Those will come from a separate repository with its own merge/commit rights, separated from ops/puppet

+1 finding node definitions has become tricky since they are no longer stored in ldap and I just spent a bit of time this morning digging for this exact information in various dashboard.

Will this repo just be a different site.pp for beta node definitions + hieradata? Or was something broader envisioned?

  • on the deployment-prep puppetmaster, define a disk-based hiera hierarchy to mimic 1:1 what we have in production, as far as the hierachy goes. This hiera hierarchy will also be in the new repository, thus freed from ops +2 need

+1 continually re-familiarizing myself with the hiera hierarchy is no fun. I answer questions like why adding a value to the hieradata/role/common/deployment/server.yaml doesn't affect deployment-tin often.

We discussed the role hierarchy on T120165: Implement role based hiera lookups for labs. I'm not clear on if objections there are still valid or ever were valid for just beta (labs labs labs).

For clarification sake, it sounds like this hierarchy would be a repo to which admins of beta cluster would have +2 and would be completely separate and mostly divergent from operations/puppet. It also sounds like this proposal would also necessitate that we no longer use horizon or wikitech for hieradata lookups.

I'm good with all of this. The divergence of hierarchies coupled with the use of online + on-disk hieradata sources is confusing.

  • We might want to have all nodes in this environment derive from a base node that includes all the labs boilerplate that is needed for openstack/ldap/basic system.

agree with @hashar -- I think we're OK here.

One thing this proposal does not address (which I assume means that we'd be maintaining the status-quo) is deployment-prep puppetmaster cherry-picks. I took a stab at this problem in T135427: Beta puppetmaster cherry-pick process but I back-burnered that project long ago.

I think I would suggest making a new project request for a place to develop the proof of concept for this setup. Once the big kinks are worked out and the parts that can be upstreamed to ops/puppet.git have been merged it would be easier to convert deployment-prep in place.

I think that if we use a new project for this work we may end up with a new weird third thing that has different problems from both beta and operations puppet but is incompatible with both. The temptation to start with a clean slate is strong, but I think divergent efforts in this area seem to rarely converge, examples are the staging project from 2 years ago and mediawiki-vagrant that has become its own thing.

I think I would suggest making a new project request for a place to develop the proof of concept for this setup. Once the big kinks are worked out and the parts that can be upstreamed to ops/puppet.git have been merged it would be easier to convert deployment-prep in place.

I think that if we use a new project for this work we may end up with a new weird third thing that has different problems from both beta and operations puppet but is incompatible with both. The temptation to start with a clean slate is strong, but I think divergent efforts in this area seem to rarely converge, examples are the staging project from 2 years ago and mediawiki-vagrant that has become its own thing.

I only meant for figuring out how the puppetmaster changes work and puppetizing that upstream; I'm not suggesting building a full environment at all. As you say that way lies madness.

<threadjack>mediawiki-vagrant's puppet was always meant to be its own thing. The use cases for provisioning a full stack development environment have very little overlap with production operations. It would be nice to reuse more common bits from ops/puppet.git, but currently there is no appetite from the maintainers of that repo to produce content that is sharable in any way other than point-in-time copies. I totally understand that as making a modular and reusable collection of Puppet resources is a non-trivial undertaking and would return very little if any benefit to production operations.</threadjack>

Will this repo just be a different site.pp for beta node definitions + hieradata? Or was something broader envisioned?

Mostly, yes.

  • on the deployment-prep puppetmaster, define a disk-based hiera hierarchy to mimic 1:1 what we have in production, as far as the hierachy goes. This hiera hierarchy will also be in the new repository, thus freed from ops +2 need

+1 continually re-familiarizing myself with the hiera hierarchy is no fun. I answer questions like why adding a value to the hieradata/role/common/deployment/server.yaml doesn't affect deployment-tin often.

Well, that's exptected, that is the production hierarchy, I don't think we should use the same files in prod and labs.

We discussed the role hierarchy on T120165: Implement role based hiera lookups for labs. I'm not clear on if objections there are still valid or ever were valid for just beta (labs labs labs).

Those objections are still valid if we want to use a ENC. I'm proposing of completely switching the model here.

One thing this proposal does not address (which I assume means that we'd be maintaining the status-quo) is deployment-prep puppetmaster cherry-picks. I took a stab at this problem in T135427: Beta puppetmaster cherry-pick process but I back-burnered that project long ago.

yeah I wouldn't try to solve that here as well, and environments are not silver bullets for it, either.

I think I would suggest making a new project request for a place to develop the proof of concept for this setup. Once the big kinks are worked out and the parts that can be upstreamed to ops/puppet.git have been merged it would be easier to convert deployment-prep in place.

I think that if we use a new project for this work we may end up with a new weird third thing that has different problems from both beta and operations puppet but is incompatible with both. The temptation to start with a clean slate is strong, but I think divergent efforts in this area seem to rarely converge, examples are the staging project from 2 years ago and mediawiki-vagrant that has become its own thing.

I think Bryan is proposing to test just the puppetmaster setup in a different project, then moving those changes to deployment-prep.

I agree with that idea.

Also, just to clarify: this is me expressing a real, everyday problem and a proposal for a possible solution.

I don't think it is my responsibility to work on such a project. I am willing to help whoever is up to the task, though.

chasemp subscribed.

@Joe I support the idea of not allowing sharing of role-related hieradata between prod and deployment-prep, if it means making everything a lot simpler and centralised. However, we still need a strategy for things that can and should be shared between prod and deployment-prep.

As starting point, and example use case, I would like us to figure out how and where to define the scap::sources variable. This is a simple map of repo names that is stateless and not in any way specific to production or an individual cluster. It should be apply everywhere by default, without the need to duplicate it. We may want to allow overriding or merging in some way at run-time (like we do with various HHVM/PHP setting arrays as well), but the bulk of it should be shared by default.

Change 436581 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[operations/puppet@production] Move scap::sources from role::deployment_server to common

https://gerrit.wikimedia.org/r/436581

Change 436581 abandoned by Krinkle:
Move scap::sources from role::deployment_server to common

https://gerrit.wikimedia.org/r/436581

Intermediate proposal: can we give +2 rights on labs/private to everyone with root in deployment-prep?

Waiting for ops to +2 labs/private changes adds unnecessary friction to testing Puppet changes. The alternatives (putting secrets in horizon and/or local-only patches on the puppetmaster) adds to the mess.

Intermediate proposal: can we give +2 rights on labs/private to everyone with root in deployment-prep?

For anyone wondering who this is, see the "Adminstrators" section of https://openstack-browser.toolforge.org/project/deployment-prep.

Waiting for ops to +2 labs/private changes adds unnecessary friction to testing Puppet changes. The alternatives (putting secrets in horizon and/or local-only patches on the puppetmaster) adds to the mess.

The 3rd option is posting the change to gerrit and then cherry-picking it locally on the deployment-prep puppetmaster. This is not the most beautiful workflow, but it worked pretty well back when I was doing lots and lots of work in deployment-prep. https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/How_code_is_updated#Cherry-picking_a_patch_from_gerrit

For anyone wondering who this is, see the "Adminstrators" section of https://openstack-browser.toolforge.org/project/deployment-prep.

Is there a way for Gerrit to query this via LDAP? (If not, I guess we document adding someone in both places as part of the procedure when adding them in deployment-prep.)

The 3rd option is posting the change to gerrit and then cherry-picking it locally on the deployment-prep puppetmaster. This is not the most beautiful workflow, but it worked pretty well back when I was doing lots and lots of work in deployment-prep. https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/How_code_is_updated#Cherry-picking_a_patch_from_gerrit

+1 to this being a nicer option.

For anyone wondering who this is, see the "Adminstrators" section of https://openstack-browser.toolforge.org/project/deployment-prep.

Is there a way for Gerrit to query this via LDAP? (If not, I guess we document adding someone in both places as part of the procedure when adding them in deployment-prep.)

Not today no. We do mirror project level membership into LDAP (cn=project-deployment-prep,ou=groups,dc=wikimedia,dc=org), but not the roles that are held by individual accounts. That state data is only available in the OpenStack Keystone database.

Intermediate proposal: can we give +2 rights on labs/private to everyone with root in deployment-prep?

Changes to labs/private need to be puppet-merge'd on the production puppetmaster which is why (AIUI) it's limited to just "ops". (I don't know why this is the case, just that it's what we currently do)

Intermediate proposal: can we give +2 rights on labs/private to everyone with root in deployment-prep?

Changes to labs/private need to be puppet-merge'd on the production puppetmaster which is why (AIUI) it's limited to just "ops". (I don't know why this is the case, just that it's what we currently do)

It was made automatic behavior by T228443: Help people remember to merge labs/private git which was a step for T227029: Prevent catalog breakage on cloud instances by decoupling core cloud puppetmaster from custom puppetmasters which stalled out as likely infeasible. Prior to that it was a separate manual step on the prod puppetmasters. I think @Andrew may be able to better explain the whys and if it is a decision that can be reexamined today.

deployment-spec specific Hiera keys are also kept in the production operations/puppet repository (T277680) for some (for me unknown) reason.

Intermediate proposal: can we give +2 rights on labs/private to everyone with root in deployment-prep?

For anyone wondering who this is, see the "Adminstrators" section of https://openstack-browser.toolforge.org/project/deployment-prep.

Nope, in T71269 every member of the project was granted full sudo access.