Page MenuHomePhabricator

Easing pain points caused by divergence between cloudservices and production puppet usecases
Open, MediumPublic

Description

(suggestions for a better title most welcome)

Currently the operations puppet repository is used by both cloud services and the production network.  However, the use of puppet has some subtitle differences.  The main differences are the ENC and hiera structure.  Over the years this has caused a number of issues where for instance a member of the SRE team (most often the OP) introduces a change which is tested with production in mind, is green in CI and PCC. however when the change is deployed and merged it causes issues with cloud services, often due to missing hiera defaults but sometimes triggered by some of the other subtle differences.

Other then day to day issues, the use of a shared repo also means that the velocity of change is pinned to the slowest moving part. Cloud services has to deal with a lot more nuanced installations with many varying factors. This means that certain things such as dropping support for an old version of puppet is often more difficult and slower to achieve as it requires more coordination and communication with a wider audience.

Over the years, there have been discussions on how we could ease this pain with many solutions proposed and all though some small changes have been achieved the underlining issues still persists. This task is an effort to re-invigorate those discussions in an effort to try and resolve this issues once and for all. At the very least i think it would be useful to link all of the various efforts that have been proposed over time and document the difference and nuance between the cloud services and production puppet environments

Divergent

Hiera structure

One of the main difference between production and the Cloud environment is the structure used by hiera. This has been documented in T255787 but will include here as

Production
  • make use of the wmflib::expand_paths for common and site expansion
  • make use of a $_role variable created with the role function (see below)
Cloud
  • makes use of cloudlib::httpyaml to fetch data from openstack.
  • has some additional parts to the hierarcy e.g.
    • "cloud/%{::wmcs_deployment}/%{::labsproject}/hosts/%{::hostname}.yaml"
    • "cloud/%{::wmcs_deployment}/%{::labsproject}/common.yaml"
    • "cloud/%{::wmcs_deployment}.yaml"
    • "cloud.yaml"
  • can also use a secret repo out side of git (on the puppetmaster FS)
    • /etc/puppet/secret/hieradata/%{::labsproject}.yaml
  • use a different hierarchy in the private repo
    • "labs/%{::labsproject}/common.yaml"
    • "%{::labsproject}.yaml"
    • "labs.yaml"

Node classifier

The node classifier is essentially used to provide a list of classes that should be applied to a node as well as some additional parameters (aka hiera keys)

Production

In the production environment we use the site.pp manifest along with a custom role function. When called with e.g. role(foo::bar) it does two things:

  • load the the class role::foo::bar
  • inject a global variable (node parameter) $_role = foo/bar into the manifest. The main use case for this is to look up role specific parameters in hiera (as noted above)

Recently there has been some effort to add the role variable to the cloud node classifier however its currently stalled (see comments on change).

Cloud

The cloud environment uses a custom script which queries the openstack api to produce a list of classes and additional hiera keys to apply to a node. This functionality enables community members to easily test out classes from the puppet repo, swapping hiera values and pairing different profiles classes without the need to make a commit to the puppet repository.

PuppetDB

The cloud environment doesn't have a puppetdb installation (although some individual projects may). This means that any use of either exported resources or functions which rely on puppetdb e.g. puppetdb_query wont work on an arbitrary cloud instance. The lack of puppetdb also means that the cumin puppetdb backend does not function in the cloud environment, however i think that issue is out of scope of this task.

Global Variables

the puppet repo configures a number of global variables via the realm.pp manifest. some of theses variables are the same in both environments some are different and some only exist in the cloud environment. The two variables which differ are

  • $realm this points to either 'labs' or 'production' dependent on which DNS domain a node is in.
  • $nameservers In production this points to the production anycast service . The cloud environment set this to the cloud service dns servers which has logic to auto populate entries for nodes created in openstack

The $realm variable is also used extensively in the puppet policy

Only in cloud

Most of theses variables are used to provide additional lookup paths in hiera (see above)

  • $labsproject this points to the openstack/horizon project of the node
  • $wmcs_deployment This indicates the openstack deployment which today is either the cloud production (eqiad1) or development environment (codfw1dev)
  • $projectgroup this is equal to "project-${labsproject}" (not sure of the use case hoping cloud services can clarify

In order to produce the variables above some temporary variables where also used, however as they are defined in realm.pp they will also be injected into node scope as such we list them here to be explicit

  • $pieces this equals $_trusted_certname.split('[.]')
  • $dnsconfig used to populate the $nameservers variable

Possible ways forward

  • It should be possible to completely drop the realm variable and relay instead on hiera to control different logic paths. this will likely require a lot of refactoring however it should reduce the number of code paths which differ moving the majority of the problem to hiera
  • Add puppetdb to cloud services. I suspect this has been investigated many times and likely difficult to support for instance with there own puppet masters
    • (@bd808) puppetdb is not multi-tenant safe/aware which is the blocker to attaching a puppetdb instance to the shared puppetmaster used by the majority of Cloud VPS projects
  • Explore the possibility of adding wmflib::expand_path to the cloud services hiera. this could be an additional level with the lowest hiera priority. this one of the areas that cause the most day to day pain and feels like it could be a quick win
  • inject role and use the role variable in cloud services hiera structure 680266
  • separate puppet repos (in some form). This is something that has been discussed a few times with many different proposals, i think it would be useful to try and resurrect some of those discussions/ideas

Related Objects

Event Timeline

jbond triaged this task as Medium priority.Jun 25 2021, 11:35 AM
jbond created this task.

FYI i have attempted to add tasks which are related as children to provide some background but please remove them if they are invalid, further please add any which may be missing

bd808 added a subscriber: bd808.

Note sure if it's in scope for this task, but the lack of PuppetDB or any central tracking system makes it hard to see usage of specific Puppet classes/modules so you're kind of running blind when refactoring or removing stuff.

Note sure if it's in scope for this task, but the lack of PuppetDB or any central tracking system makes it hard to see usage of specific Puppet classes/modules so you're kind of running blind when refactoring or removing stuff.

https://openstack-browser.toolforge.org/puppetclass/ is far from the features of a full PuppetDB catalog, but it is more than nothing.

In the past I've stated that a shared PuppetDB is not possible due to the lack of tenant isolation abilities in PuppetDB. Thinking about this a bit more today, I'm wondering exactly how true that statement is. I'm not questioning if PuppetDB has suddenly grown multi-tenancy, but I am wondering what exactly would be disclosed across projects by a shared PuppetDB collecting catalogs from the projects attached to the shared puppetmasters. In the shared puppetmaster environment we have already conceded that there are no secrets which can only be seen by a single project. The things in the (horribly named) labs/private.git repo which substitute for the manifests in the puppetmaster local production 'private' repo are public to all tenants, so there would be no new leak there. Things pulling from PuppetDB for config however probably would end up with weird results do to a lack of partitioning. If you did a query as part of a manifest for something like "ip of all nodes with role::foo" to configure a ferm rule I think you would get cross-tenant results. In the case of something like an instance ferm config that wouldn't necessarily be a problem as there is also a project-wide firewall in the form of OpenStack config that would block cross-tenant traffic by default. If the query results were used for something like configuring an elasticsearch deployment to know all of it's expected peers however that might be problematic if the cluster thought it should be able to contact ES servers in other network isolation zones.

Maybe this is a silly question, but why does PuppetDB need to be shared? We already have the self-hosted puppetmaster thing, can we tack on PuppetDB to that?

Maybe this is a silly question, but why does PuppetDB need to be shared? We already have the self-hosted puppetmaster thing, can we tack on PuppetDB to that?

A self-hosted puppetdb is possible today. Deployment-prep, toolsbeta, and puppet-dev all have them. The vast majority of projects however do not operate their own puppetmaster. https://openstack-browser.toolforge.org/puppetclass/role::puppetmaster::standalone shows 26 projects doing so out of 163 total projects. We have talked about the possibility of making a puppetmaster a default part of a new project if that would solve some of the other issues, but doing so would add complexity for users and compute load for the cluster so as a solution it needs to be pretty compelling.

Thank you @jbond for raising this topic!

To noone's surprise, over the (past many) years, I've wondered about the tight coupling between production host management and cloud VPS host management. The coupling is near-realtime and cross-realm; in other words SREs continuously deploy code with root-level changes in N environments they have no control or monitoring over -- often resulting in the symptoms you eloquently articulated @jbond. On the flip side, N environments operated by their own systems administrators, have Wikimedia-production-specific base system changes centrally deployed, often in ways that feel superfluous, unnecessary limiting, or overriding choices that the sysadmin may want to make (such as e.g. deploying their own config management with Ansible or cloud-init, or their own AMI etc.). This feels to me, conceptually, as the root of the problem here.

In 2011 our OpenStack VM setup, then called Labs, was envisioned as a replicable production environment to be used for development and staging (cf. Wikimedia Labs), as well as the eventual home for production itself. Ten years later, the latter use case proved unviable and is not pursued anymore, and the former two are only secondary or tertiary use cases with limited usefulness -- with the notable exception of a handful of projects, such as deployment-prep. In the 2021 Cloud Services survey, 72% of users reported that they use VPSes for "Hosting one or more tools or other public services such as a web app; bot; dashboard; API; etc.", while only less than one in two (44%) reported "Testing and experimenting with software ". In the 2021 Developer Satisfaction survey using cloud VMs for local development was the lowest scored (14%) method. In other words, 2021's use cases have little overlap with 2011/2012's usecases, when the decision to use production's Puppet code was made.

Furthermore, the limited number of non-WMCS specific Puppet profiles/roles in the OpenStack browser is perhaps evidence that "testing and experimenting with software" is typically not for software or stacks that Puppet assists deploying. Even for the true "production staging ground" use case, a use case that could be helpful to the maintainers of operations/puppet (SRE), the current design has been proven to be too limiting and too divergent from actual production code, and resulted to the creation of the Pontoon project by the Observability team, taking a different approach (Cc @fgiunchedi).

Finally, from a forward-looking organizational direction perspective, the cloud services infrastructure primarily exists to serve our technical community, sustain it and assist with its further growth, with a primary focus on the tools community. @bd808 can correct me if I'm wrong, but I don't think that the SRE-maintained operations/puppet code tree is the right tool to advance TCB, or envisioned to be used for that.

In other words, this practice and resulting shared responsibility is of little benefit to our organizational goals and stakeholders -- whether that's to the SRE team, the MW developer community, the Tools community etc. So, with that in mind, I'd like to question the usefulness of tight coupling, and specifically using the production Puppet code deployed by default in all projects. I believe the current design to be an artifact of a much older design that is now not aligned to either infrastructure's strategic direction or organizational structure. I would like to see us revisit/reevalute that design, i.e. I'd like to consider the task description's last bullet point ("separate puppet repos (in some form)") as the primary alternative here, and to carefully and critically consider the benefits, alongside the costs/pitfalls that have already been described. SRE I/F and @jbond can continue to take the lead here as the primary maintainers of our production Puppet infrastructure and repository; I'd be happy to help and especially review this from a longer-term infrastructure direction perspective. Thank you all!

As someone who works regularly with deployment-prep I've found that pulling hiera from both operations/puppet and openstack/horizon is easily the most annoying pain point. I imagine the situation isn't as bad in other projects, but in deployment-prep the hiera is essentially split (and in some cases duplicated, not necessarily with the same values) between ops/puppet (which at least some SREs prefer when poking around deployment-prep) and openstack (which doesn't require +2 on ops/puppet so most others prefer it).

In other words, 2021's use cases have little overlap with 2011/2012's usecases, when the decision to use production's Puppet code was made.

+1

Furthermore, the limited number of non-WMCS specific Puppet profiles/roles in the OpenStack browser is perhaps evidence that "testing and experimenting with software" is typically not for software or stacks that Puppet assists deploying.

I agree.

We tend to assume that the reason cloud is so sensitive to breakage by upstream production changes is because of divergence in the setup and therefore we must remove the divergence. The reality is that we create that divergence all the time on purpose.

  • There are often missing bits of hiera, for instance, on purpose. We certainly would not want production credentials from hiera available on cloud VMs.
  • There is no puppetdb in cloud on purpose because it is not multi-tenant.
  • We tend not to sweat the fact that hiera doesn't work the same most days in cloud because volunteer contributors to cloud projects are unlikely to have +2 on puppet and even less likely to have access to the private credentials.
  • Finally cloud-specific classes tend to be multi-tenant and key off things like the keystone project that cannot possibly exist in production, on purpose.

The fact that WMCS SREs can commit to puppet with +2 creates the peculiar state that WMCS-maintained cloud projects tend to be highly puppetized compared to anything else in cloud besides peculiar cloud-wide mechanisms like the nfs mounts and things of that nature. Recent projects have been less puppetized even so, such as kubeadm, kustomize and helm-based work in paws and Toolforge.

We discuss this matter when puppet breaks because well-meaning people do good things to puppet and the cloud blows up because there was an odd bit that nobody (probably including me) knew about that was different, possibly in a surprising way. I think it is natural to assume the broken thing is that they are different or divergent. I think that we can reduce the difference between them, but we cannot eliminate it. I don't imagine anyone would want to, for that matter, given enough thought and discussion. It cannot hurt to eliminate a few surprise traps, though, you know?

I think it would make excellent sense if the "labs" realm in puppet were actually either a different puppet system or something entirely different that was a bit more accessible for cloud users to configure for their own systems with--including the option to set up an entirely production-like actual standalone puppet master if people wanted to test puppet modules. I suspect that my team, myself included, generally dislikes this notion not on theoretical grounds but practical ones. It's a big project to commit to, and it is difficult to be sure how much of the maintenance of a totally separate system for configuring VMs we are already practically operating. We are definitely already operating at least part of such a system with our own cumin, puppetmaster hierarchy, puppet repo prefixes, Horizon puppet enc, etc, and the lack of multi-tenancy in that build chafes regularly.

Thanks for all the responses

The reality is that we create that divergence all the time on purpose.

I completely agree cloud is a very different environment compared to the production enviornment and trying to shoe horn in a solution that works for both is doomed to fail.

It cannot hurt to eliminate a few surprise traps

From my initial list i think adding wmflib::expand_path to the cloud services hiera feels like it could be a simple quick win. The risk here is that some of the defaults may change, however they could easily be fixed by adding the correct value to cloud.yaml (which would take precedence). Further i think having a stretch goal to remove the realm variable feels useful.

I think it would make excellent sense if the "labs" realm in puppet were actually either a different puppet system or something entirely different

From previous discussions i have had this seems to be the general view, with at least the following caveats

  • Its useful for production modules to work in cloud for testing reason (this now seems less of a priority)
  • There are some base classes which are used by cloud instance and it would be usefull to not have to duplicate effort.

In relation to this later point i think its worth identifying which classes are used. A very quick traversal of role::wmcs::instance suggests at least the following, i have missed glue like modules aka profile::* and modules which are installed not strictly needed e.g. profile::base::certificates, profile::pki::client)

  • sudo
  • security
  • debian
  • exim4
  • ldap (cloud is probably the primary user of this module)
  • base::syslog (this is 5 lines of puppet code)
  • base::service_auto_restart (probably not needed or easily replaced)
  • ok just realised the whole of profile::base is included

If theses are the only modules used by the base instance then i think we could:

  • simply fork.
  • potentially use a module from forge which dose the same thing
  • manage theses modules as external modules (debian, although needing an update, already is).
  • a mix of the above

Managing modules as external modules has always been met with scepticism, or at leats there are split opinions within the SRE team on if its a good idea or if its just creating work. However I think that the modules above are all fairly stable and it would not add much overhead to spin theses out. being more creative we might be able to do something in CI which auto publishes modules from the operations/puppet repo into there own individual repos. I wrote the preceding sentence before realising that profile::base is included, however leaving it here as i think it could be a usefull way forward.

The inclusion of profile::base also looks like an area which could ba a quick win. profile::base is very much targeted at production and includes a lot of stuff which i don't think is needed in clouds base policy e.g. debdeploy, pki::client, logoutd, profile::rsyslog::kafka_shipper, nrpe. Perhaps we could split this out and create profile::base::wmcs containing only the base modules actually needed?

Once we have cleaned up profile::base it may be that the set of modules shared are stable enough that managing them externally or adding some magic to CI doesn't look so painful.

Another class of instances which i have not considered are the cloud*,labs* hosts which are in the production network and likely use more of the production code base. To me it feels like these should be treated separately, i generally consider them production boxes in this context. however that view may not be shared

Change 702325 had a related patch set uploaded (by Jbond; author: John Bond):

[operations/puppet@production] cloud dev - hiera: add wmflib::expand_path to codfw1dev hiera

https://gerrit.wikimedia.org/r/702325

Change 702326 had a related patch set uploaded (by Jbond; author: John Bond):

[operations/puppet@production] cloud - hiera: add wmflib::expand_path to hiera

https://gerrit.wikimedia.org/r/702326

Thank you @jbond for raising this topic!

Indeed, thank you!

To noone's surprise, over the (past many) years, I've wondered about the tight coupling between production host management and cloud VPS host management. The coupling is near-realtime and cross-realm; in other words SREs continuously deploy code with root-level changes in N environments they have no control or monitoring over -- often resulting in the symptoms you eloquently articulated @jbond. On the flip side, N environments operated by their own systems administrators, have Wikimedia-production-specific base system changes centrally deployed, often in ways that feel superfluous, unnecessary limiting, or overriding choices that the sysadmin may want to make (such as e.g. deploying their own config management with Ansible or cloud-init, or their own AMI etc.). This feels to me, conceptually, as the root of the problem here.

In 2011 our OpenStack VM setup, then called Labs, was envisioned as a replicable production environment to be used for development and staging (cf. Wikimedia Labs), as well as the eventual home for production itself. Ten years later, the latter use case proved unviable and is not pursued anymore, and the former two are only secondary or tertiary use cases with limited usefulness -- with the notable exception of a handful of projects, such as deployment-prep. In the 2021 Cloud Services survey, 72% of users reported that they use VPSes for "Hosting one or more tools or other public services such as a web app; bot; dashboard; API; etc.", while only less than one in two (44%) reported "Testing and experimenting with software ". In the 2021 Developer Satisfaction survey using cloud VMs for local development was the lowest scored (14%) method. In other words, 2021's use cases have little overlap with 2011/2012's usecases, when the decision to use production's Puppet code was made.

Furthermore, the limited number of non-WMCS specific Puppet profiles/roles in the OpenStack browser is perhaps evidence that "testing and experimenting with software" is typically not for software or stacks that Puppet assists deploying. Even for the true "production staging ground" use case, a use case that could be helpful to the maintainers of operations/puppet (SRE), the current design has been proven to be too limiting and too divergent from actual production code, and resulted to the creation of the Pontoon project by the Observability team, taking a different approach (Cc @fgiunchedi).

Also to noone's surprise, I fully agree with the points and very thoughtful considerations above!

To add more context behind the drive for Pontoon: one of my mantras is that a Debian base OS install plus puppet.git must be enough to get an host and its roles close to what we have in production. We're not there yet (think e.g. credentials in public-private.git, which could be auto generated on demand instead) but we're not that far off either and iterations are quick. The need to resemble production, although with different reliability requirements, allows us to stub systems out as needed while keeping the same functionality. The fact that puppet.git now is aware of Cloud VPS facilities (e.g. read users/keys from LDAP) is a useful bonus although not strictly required (e.g. if we're not in realm labs then SRE users and public keys get provisioned instead).

Finally, from a forward-looking organizational direction perspective, the cloud services infrastructure primarily exists to serve our technical community, sustain it and assist with its further growth, with a primary focus on the tools community. @bd808 can correct me if I'm wrong, but I don't think that the SRE-maintained operations/puppet code tree is the right tool to advance TCB, or envisioned to be used for that.

In other words, this practice and resulting shared responsibility is of little benefit to our organizational goals and stakeholders -- whether that's to the SRE team, the MW developer community, the Tools community etc. So, with that in mind, I'd like to question the usefulness of tight coupling, and specifically using the production Puppet code deployed by default in all projects. I believe the current design to be an artifact of a much older design that is now not aligned to either infrastructure's strategic direction or organizational structure. I would like to see us revisit/reevalute that design, i.e. I'd like to consider the task description's last bullet point ("separate puppet repos (in some form)") as the primary alternative here, and to carefully and critically consider the benefits, alongside the costs/pitfalls that have already been described. SRE I/F and @jbond can continue to take the lead here as the primary maintainers of our production Puppet infrastructure and repository; I'd be happy to help and especially review this from a longer-term infrastructure direction perspective. Thank you all!

FWIW I'm also +1 on considering revisiting/reevaluating whether separate puppet repos would serve us (SRE, WMCS, Cloud VPS users) better. Thank you all for working on this!

Change 704333 had a related patch set uploaded (by Jbond; author: John Bond):

[operations/puppet@production] profile: create in module data for profile

https://gerrit.wikimedia.org/r/704333

Since I had a random conversation about this in IRC today with @nskaggs, I thought I should record what I found here. I was curious about the numbers of cloud users who are actively doing something with puppet.

Of 168 projects active today, I found that 110 projects have actually applied puppet classes using horizon to something besides what hits the entire cloud. I got this by checking on instance-puppet git with: ls -l */*.roles |awk '($5 > 3 ) {print $9}'| awk -F '/' '{print $1}' | uniq | wc -l (for checking my method). Validating the size was bigger than "3" was to filter the empty files with just brackets in them.

I did not evaluate how many project owners are using custom puppet code vs. "cloud utility" roles like role::labs::mediawiki_vagrant, labs_lvm and cinderutils. I can confirm that nobody is using mariadb modules outside of staff-involved projects.

Since I had a random conversation about this in IRC today with @nskaggs, I thought I should record what I found here

Thanks @Bstorm i have added this to T289658

joanna_borun changed the status of subtask T289665: Normalise hiera default values from Open to In Progress.
joanna_borun changed the status of subtask T289666: Improve PCC support for cloud VPS environments from Open to In Progress.

As part of T288618, we started using production puppet to manage a WMCS environment in response to the problems described above. This new environment is for testing logging cluster changes and drilling update procedures on a cluster with "live" data.

All that to say, please be in contact with the Observability folks as changes are implemented. I don't mean to say the current pattern of management needs to be maintained, but we are interested in keeping this environment functional for deployment-prep users as well as our needs through the transition period.