Page MenuHomePhabricator

Consider alternative configuration management tooling
Open, LowPublic

Description

Im creating this task based on a discussion that started on the ops mailing list. il try to keep context as best i can but feel free to update where appropriate

wonder if our energies might be better spent searching for alternatives to Puppet? We have some of the foremost Puppet experts in the world (including you), but we're unable to migrate off a version that has been EOL for nearly 2 years without external help. Puppet's ecosystem is in decline, and we will eventually have to move off it anyway. To me, the only question is how much effort we want to spend on Puppet before that time comes?

Ansible and Terraform are newer tools with much healthier ecosystems. They're built on technologies (Golang/Python, Jinja, YAML, HCL) that have broad support, so it's considerably easier to get help from the community. (Bonus: Ansible's core developers are active in #ansible, and you can usually get an answer to your question immediately during North American working hours).

Let me know if you think this is worth pursuing.

Event Timeline

jbond triaged this task as Low priority.Oct 28 2022, 8:40 AM
jbond created this task.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Thanks Brian for bringing up some alternative ideas!

I wonder if our energies might be better spent searching for
alternatives to Puppet? We have some of the foremost Puppet experts in
the world (including you), but we're unable to migrate off a version
that has been EOL for nearly 2 years without external help. Puppet's
ecosystem is in decline, and we will eventually have to move off it
anyway. To me, the only question is how much effort we want to spend on
Puppet before that time comes?

I agree with you that Puppet's ecosystem seems to be in decline.
However, my understanding is that some big companies, e.g. github still
use it, so I don't think it will disappear for some time, but there is
definitely less tooling and growth in the open source community around
Puppet.

Ansible and Terraform are newer tools with much healthier ecosystems.
They're built on technologies (Golang/Python, Jinja, YAML, HCL) that
have broad support, so it's considerably easier to get help from the
community. (Bonus: Ansible's core developers are active in #ansible, and
you can usually get an answer to your question immediately during North
American working hours).

@jhathaway
Besides, the whole problem of rewrites and migrations are extremely hard!
How do you envision Terraform and Ansible replacing our Puppet
infrastructure?

Terraform focuses primarily on creating components of an infrastructure,
say a server instance and a load balancer, but doesn't offer much in the
way of server configuration, e.g. install this package and config file,
which is the bulk of our Puppet configuration.

Ansible does provide patterns for server configuration, but it lacks
Puppet's and Terraform's concept of current state and desired state,
which makes long lived servers difficult to maintain in Ansible. Most
folks who use Ansible build up golden images and rebuild servers when
they change configuration. I think moving our infrastructure towards a
rebuild from scratch model would be extremely difficult.

Thanks! Your perspective as a both a Puppet expert and relative n00b like me is very much appreciated. I hope you (and everyone else) will allow me to plead my case at a future SRE meeting!

@jhathaway
I agree with you that Puppet's ecosystem seems to be in decline.
However, my understanding is that some big companies, e.g. github still
use it, so I don't think it will disappear for some time, but there is
definitely less tooling and growth in the open source community around
Puppet.
Besides, the whole problem of rewrites and migrations are extremely hard!

@bking
I agree, it will be very time-consuming and painful to move off Puppet. But the current situation also seems painful and untenable.

If we switched, we could gain the help of a large, vibrant community. And they'd certainly benefit from our work as well: here's an awesome sysctl config that pretty much the entire Internet should be using*. If that were part of an ansible role on Galaxy, we'd probably help a lot of people improve their security and performance.

On the negative side, it's hard to imagine a situation in which the typical problems associated with a declining product (lack of community knowledge, difficulty hiring etc) get better for Puppet. They'll only get worse, especially considering that config management itself is becoming less important in most new devops environments (k8s, "managed" services, etc). I feel certain this migration will have to happen at some point, just as WMF migrated off (chef? CFEngine? Excel?) onto Puppet in the first place.

I do have some ideas about what a phased migration would look like, but I'd have to convince y'all first ;) .

@jhathaway
Terraform focuses primarily on creating components of an infrastructure,
say a server instance and a load balancer, but doesn't offer much in the
way of server configuration

@bking
Sorry, I don't want to derail this too much as it's mostly about Puppet/Ansible, but I will say that we managed baremetal servers with Terraform at a former job. My team (Search) is still in charge of hundreds of physical servers. So that type of lifecycle management would be nice, but orthogonal to this conversation.

@jhathaway
Ansible does provide patterns for server configuration, but it lacks
Puppet's and Terraform's concept of current state and desired state,
which makes long lived servers difficult to maintain in Ansible. Most
folks who use Ansible build up golden images and rebuild servers when
they change configuration. I think moving our infrastructure towards a
rebuild from scratch model would be extremely difficult.

@bking
I've been part of SRE teams (and had plenty of customers) that did use ansible-pull successfully for state management of long-lived servers, some in larger environments than WMF.

There were definitely config drift problems, but to the best of my knowledge, they were more about bad server hygiene (making one-off changes, not updating on a regular basis) than Ansible itself. I don't know if the way we use Puppet (periodic runs via systemd timer as opposed to continually monitoring state) would have prevented these issues.

Please let me know if you have any further feedback or questions on using ansible for state managment, would like to try ansible, etc... Unless someone thinks this is a really bad idea, I'm going to reach out to Mark and Lukasz to present this at a future SRE meeting.

Thanks for your time,

I agree, it will be very time-consuming and painful to move off Puppet. But the current situation also seems painful and untenable.

Could you please elaborate a bit more on what is painful and untenable about our current situation, besides the fact we're trying to package puppet-server "properly"[1]?

That doesn't seem enough, by far, to justify spending thousands and thousands of man-hours moving from puppet to another configuration management system, instead of doing work to improve our infrastructure and make it progressively less reliant on puppet and change management[2], for instance.

Before we have a deeper discussion on this topic, which shouldn't hijack this thread, I would like to see a more thorough justification. In any case, I'd move this discussion to a separate thread or (better) on phabricator, where we can include more people.

[1] If we were ok with "dirty" debian packages, which I think we should be in this case, we wouldn't have all these issues. We just have tried to contribute puppet-server to the larger Debian community. I think 99.999% of puppet-using shops wouldn't think of this as an issue and just moved on and used the packaging offered by puppetlabs.
[2] Almost all of the application layer now operates on kubernetes. Soon more infrastructural support applications will run in containers too. Apart from where it makes no sense (datastores like mysql, cassandra, elasticsearch, and the edge caches) we are moving collectively in that direction.

[I'm not saying we should move to Ansible necessarily, but wanted to respond to something said up-thread :)]

I've used Ansible a fair amount in previous places, and I think it is entirely possible to have long-running servers managed thus (I've done so!); one might typically run Ansible out of cron and look for unexpected changes, for example. Yes you have the issue of un-Ansibleized changes, but you can get that with puppet too, and that's a matter of having good practices in place.

An advantage of Ansible's usual push model is that it makes transitions easier - say I want to move which host is running swiftrepl, I make the change in inventory and then run Ansible against the swift nodes, and it'll remove swiftrepl from one node and add it to another, and it's easier to do that within Ansible than with Puppet's "it'll happen next time puppet runs on a node" model. Obviously we have cumin and our cookbooks to handle this sort of thing, but in an Ansible model it's easier to do this out of your config management setup rather than needing separate infrastructure.

but we're unable to migrate off a version that has been EOL for nearly 2 years without external help.

Let me first start by saying that if there was some pressing reason to migrate we could definitely do so, there are many options other then trying to stay complient with debian. further lets not conflate debians ability to package $software as an indicate of the health of that $software, if we do we should note that ansible doesn't come out to well either the current package in stable is also EOL and there are no packages for ansible-tower/controller

Ansible and Terraform are newer tools with much healthier ecosystems.

Leaving aside newer == better argument can you point to more then https://galaxy.ansible.com/ to justify this statement? If i look at galaxy there are definitely more contributors and modules however that could be a UX artefact, if we limit the search results to only include "puppet 7" and "ansible deprecated == false" the results are very similar. Further from a very practical PoV searching for something simple like apache, bind or exim give very different results and with, IMO, ansible not coming up with any good results except for exim. Which surfaces the fact that code maturity and feature coverage should be the mark we use to measure and not number of modules . i also compared the ansible irc/matrix channel to puppets slack/irc channel and the user counts are vastly different (~10k for puppet vs 1k for ansible) so i remain unconvinced as to which has the stronger community and ecosystem.

They're built on technologies (Golang/Python, Jinja, YAML, HCL) that have broad support

im going to leave arguing ruby vs go/python to someone else :P although i would say ruby has broad support. cloujour+jruby completely agree someone must have comeback from a Ayahuasca retreat when they suggested that.

Bonus: Ansible's core developers are active in #ansible

This is also true of the puppet slack channel for what its worth

I agree with you that Puppet's ecosystem seems to be in decline.

i think managing bare metal is in decline and puppet labs missed the mark with k8s, and although i watch with baited breath to see the full impact of the perforce acquisition, puppet still has some very big payers using it so i dont think its going away anytime soon.

But the current situation also seems painful and untenable.

Could you please elaborate a bit more on what is painful and untenable about our current situation, besides the fact we're trying to package puppet-server "properly"

+1 any decision to either switch or stick with puppet needs to be informed so we should definitely document our puppet pain points and more importantly how they would be solved by $somethingelse

If we switched, we could gain the help of a large, vibrant community.

again i think this needs more justification. Ill also point out that at the foundation we have resisted using puppet community modules for $reasons (which could perhaps be re-considered in a different task) and im not sure this would change much with ansible.

And they'd certainly benefit from our work as well: here's an awesome sysctl config that pretty much the entire Internet should be using*

one of the $reasons is that making something customised for our environment is easier then creating a generic module that works on every flavour of linux + windows. This also means our modules have not become generally useable and released on forge (some of them probably could and should but again, different task) and switching tools will not fix this issue.

I do have some ideas about what a phased migration would look like, but I'd have to convince y'all first ;) .

Id definitely be curious however id say that facter is not really the thing id be concerned about. PuppetDB is imo the thing puppet brings to the table that ansible is missing. Specifically exported resources and puppetdb_query which allow individual nodes to know about the state of the entire infrastructure (perhaps ansible-tower has this now?). PuppetDB also enables reporting capability with a rich API which we use to drive things like cumin, spicerack and pcc node section, this is something that ansible-tower may well allow for but it would require a lot of retooling.

I've been part of SRE teams (and had plenty of customers) that did use ansible-pull successfully for state management of long-lived servers, some in larger environments than WMF.

honestly i dont doubt that we could make ansible work, however i dont think that the size of environments is the proof we need but more how diverse and complex the systems is. i.e. its easier to manage a 10k hadoop cluster, then it is to manage 2k nodes spanning ~250 different roles.

An advantage of Ansible's usual push model is that it makes transitions easier - say I want to move which host is running swiftrepl, I make the change in inventory and then run Ansible against the swift nodes, and it'll remove swiftrepl from one node and add it to another, and it's easier to do that within Ansible than with Puppet's "it'll happen next time puppet runs on a node" model. Obviously we have cumin and our cookbooks to handle this sort of thing, but in an Ansible model it's easier to do this out of your config management setup rather than needing separate infrastructure.

i think i must be missing something because i honestly dont see a big difference here.

In puppet you commit the change to the repo and either wait for puppet to run or manually run puppet via cumin or something else

In ansible you commit the change to the repo and either wait for anisble to be run via cron or manually run ansible

Unless someone thinks this is a really bad idea, I'm going to reach out to Mark and Lukasz to present this at a future SRE meeting.

i think that considering how we manage serveres going forward is a healthy conversation to have however i think we need to document the issues we have, the outcomes we desire and think about how our infrastructure will change over the next 5 years. i dont think its a good idea to have a solution in mind (ansible) and to then try and build a case for it. If you are to ask me to consider ansible specifically, my view is that like operating systems configuration management tools have there strengths and weakness and at our complexity and scale we will meet those complexities and limits with whatever tool we chooses. As such i feel moving to ansible would mean a multi year transition to convert all our roles and more importantly our tooling e.g. pcc, cumin, spicerack etc and at the end of it we would just have a different set of problems and frustrations. As such any proposal to change (whether to anisble or something elses) needs to clearly highlight the current issues we face and justify how some $newtool will solve those problems

Considering how we look to the future i think the following comment is the comment that resonates with me the most

They'll only get worse, especially considering that config management itself is becoming less important in most new devops environments (k8s, "managed" services, etc).

We will never get rid of bare metal and the need to manage it however i think that if we look to a future 5 years away and assuming the current trends it seems reasonable to assume that the majority of our infrastructure will move to k8s managed services and the ~250 roles we currently have could be whittled down to a handful e.g. dbs, k8s infrastructure, openstack servers, hadoop?, cassandra?, cache? lvs? As such the majority of the complexity will move away from puppet, naturally as they dont have a solution in the container space, towards something like helm with a simple bare metal infrastructure consisting of a few clusters. considering this I'm not convinced ansible helps us anymore then puppet, but i suspect there probably is something better then puppet.

jbond renamed this task from Consider migrating alternative configuration managment tooling to Consider alternative configuration managment tooling.Oct 28 2022, 1:29 PM

Thanks jbond, these are all legitimate points and must be addressed before we start to consider Ansible. Here's what I have so far:

lets not conflate debians ability to package $software as an indicate of the health of that $software,

ACK

if we do we should note that ansible doesn't come out to well either the current package in stable is also EOL and there are no packages for ansible-tower/controller

A few points on that:

Code maturity and feature coverage should be the mark we use to measure and not number of modules

Excellent point.

i also compared the ansible irc/matrix channel to puppets slack/irc channel and the user counts are vastly different (~10k for puppet vs 1k for ansible) so i remain unconvinced as to which has the stronger community and ecosystem.

It's true that measuring vitality in an open-source project is difficult. Finding someone who believes that Puppet has an equal or stronger community to Ansible in 2022 would be even more difficult.

The company behind Puppet (also confusingly called Puppet) was acquired by a private equity group , and laid off 15% of its staff, including technical people .

This happened to a previous employer (major backer of Openstack) and I can tell you the next steps:

  1. The best technical employees flee the company, afraid of more layoffs or demoralized by the failure of the product.
  2. The focus shifts to wringing cash out of locked-in Enterprise customers.
  3. Due to 1) and 2), open source culture within the company slowly dries up.

That approach is extremely profitable in the short term, but it's ultimately poison to the ecosystem. If we don't move, it will be eventually be us propping up Puppet without help. That's why I'm bringing this up now.

They're built on technologies (Golang/Python, Jinja, YAML, HCL) that have broad support
im going to leave arguing ruby vs go/python to someone else :P although i would say ruby has broad support. cloujour+jruby completely agree someone
must have comeback from a Ayahuasca retreat when they suggested that.

LOL @Ayahuasca retreat! I'm not arguing personal preference (who actually likes YAML?), just that the foundational technologies of Ansible have strong and vibrant communities. Not many people are learning Puppet DSL and ERB in 2022.

Bonus: Ansible's core developers are active in #ansible

This is also true of the puppet slack channel for what its worth

To clarify, IRC is a required part of the job for Red Hat engineers. The core Ansible devs are active during NA working hours and I typically get a response within a few minutes. (You can imagine how useful that would be for learning). If Puppet is that way too, great! I'll check out their Slack channel and add it to our onboarding.

i think managing bare metal puppet labs missed the mark with k8s, and although i watch with baited breath to see the full impact of the perforce acquisition, puppet still has some very big payers using it so i dont think its going away anytime soon.

In addition to the stuff I posted above, the Glassdoor page for Perforce doesn't look too positive.

But the current situation also seems painful and untenable.

Could you please elaborate a bit more on what is painful and untenable about our current situation, besides the fact we're trying to package puppet-server "properly"

+1 any decision to either switch or stick with puppet needs to be informed so we should definitely document our puppet pain points and more importantly how they would be solved by $somethingelse
"
The Disposable Development Environment page begins with a summary, "In order to maintain a good software development cycle for the WMF infrastructure we need have a robust development environment which is easy to set up, configure, adapt, and ideally disposable so that we are not consuming too many resources. " Reading the entire page, it's very hard not to conclude that Puppet is the biggest obstacle to that goal. So if resources are to be committed for DDE, let's consider migrating off Puppet as part of that.

Also, we build/maintain a lot of tools to address Puppet's deficiencies, most of which aren't used outside the Foundation (PCC, Pontoon, Puppet ENC API, Cumin/spicerack, probably more). These have overlap or would not be necessary with Ansible. This might not be as visible to those with years of experience at the Foundation and/or with Puppet.

Here's my outsider experience: I struggled for a couple of weeks trying to get PKI working in deployment-prep, and despite consulting with multiple SREs/ and our volunteer deployment-prep SME, no one's advice got me there. I eventually figured it out, but it was lots of time away from the primary goal of testing Elasticsearch upgrades.

Our modules have not become generally useable and released on forge (some of them probably could and should but again, different task) and switching tools will not fix this issue.

Sorry, to be clear I'm talking about roles and templates. If you're a young technology enthusiast moving towards a career as a sysad/SRE, which one of these tasks is easier to reason out? Technically, we share our battle-tested config with the world, but its impact is limited because it's written in Puppet DSL and ERB. If our mission is to spread free knowledge, this should certainly be a consideration.

PuppetDB is the thing puppet brings to the table that ansible is missing. Specifically exported resources and puppetdb_query... PuppetDB also enables
reporting capability with a rich API which we use to drive things like cumin, spicerack and pcc node section, this is something that ansible-
tower may well allow for but it would require a lot of retooling.

We do need to look more closely at our requirements there. Tower has become Ansible Automation Platform (community version is called "AWX"), but the old version definitely held state in an RDB. Retooling will definitely be necessary, but I'd argue that using popular tools is healthier in the long run than open-ended commitments to tools that don't see use outside the Foundation.

An advantage of Ansible's usual push model is that it makes transitions easier - say I want to move which host is running swiftrepl, I make the change in inventory and then run Ansible against the swift nodes, and it'll remove swiftrepl from one node and add it to another, and it's easier to do that within Ansible than with Puppet's "it'll happen next time puppet runs on a node" model. Obviously we have cumin and our cookbooks to handle this sort of thing, but in an Ansible model it's easier to do this out of your config management setup rather than needing separate infrastructure.
i think i must be missing something because i honestly dont see a big difference here.

In puppet you commit the change to the repo and either wait for puppet to run or manually run puppet via cumin or something else

In ansible you commit the change to the repo and either wait for ansible to be run via cron or manually run ansible

I don't want to speak for Matthew, but I think is his point is that you can use the same tool for state management and operations.

i dont think its a good idea to have a solution in mind (ansible) and to then try and build a case for it.

If this was a crowded space, I'd agree with you. But there's not much out there in config management these days besides Ansible. Bolt , with its agentless model and YAML-based plans, is a tacit admission that Puppet's agent approach is too complicated and rapidly losing popularity. I had great hopes for it, but unfortunately it seems to be mostly a tech demo (see note about using Puppet Plans instead of YAML for anything complicated ) . Considering the shake-ups in corporate Puppet, I doubt this tool will ever be mature enough to use in production environments.

If you are to ask me to consider ansible specifically, my view is that like operating systems configuration management tools have there strengths and weakness and at our complexity and scale we will meet those complexities and limits with whatever tool we chooses.

I'm sure it's not your intention, but this reads as "we are already doing the best we possibly can." Based on my previous experience, I'd say we definitely can do better.

if we look to a future 5 years away...the majority of our infrastructure will move to k8s managed services and the ~250 roles we currently have could
be whittled down to a handful e.g. dbs, k8s infrastructure, openstack servers, hadoop?, cassandra?, cache? lvs? As such the majority of the complexity will move away from puppet...towards something like helm with a simple bare metal infrastructure consisting of a few clusters. considering this I'm not convinced ansible helps us anymore then puppet, but i suspect there probably is something better then puppet.

Couple of points on this:

  • I believe Puppet is an obstacle to running a stateless infrastructure. Whereas Ansible integrates neatly with Terraform, Puppet is built around the mindset of long-lived servers. The total number of roles might decrease, but still we'd be juggling two paradigms. As we gaze 5 years into the future, we have to consider turnover as well. This probably means fewer and fewer SREs with Puppet experience propping up what will still be a critical piece of our infrastructure, in a shrinking Puppet ecosystem. The maintainers of other Foundation-only tools may be gone too. I'm not sure that would make for a happy future jbond ;) .
  • Immediately prior to working at WMF, I worked at Q2, a company that is far further down the stateless infrastructure journey. We still used plenty of Ansible for configuring stateful services, building container images with Packer, etc. We should not assume that stateless infrastructure means no config management/operations, particularly at WMF where we will presumably be maintaining our own bare-metal infra.

Thanks again for reading my novel and if you or anyone else has feedback, please add it here!

Thanks for the response brian, in genral i think that ansible could be better and i think some of the points around puppet dying and the different strength of the comunity are valid. however im not sure the cost is justified. I would say though that when comparing puppet to ansible we need to make sure we are making valid comparisons. for instance i think that whether we where to use ansible or puppet we would always want to have some agent running periodicity. As such we cant justify ansible based on the fact that it is agent-less (as we would need to install it as an agent)

Id also say that when comparing something like PCC with ansibles offering we need to consider the complex cases and how this is displayed in the UX. one of the reason i have not looked at making the jump to octocatalog diff is because the output is very verbose and its not easy to parse. we should also not underestimate the ability for pcc to know which roles/hosts it should actually tests. Theses are things i suspect we would need to add to ansibles testing infrastructre but would be happy to be proved wrong.

The company behind Puppet (also confusingly called Puppet)

minor correction, they are called perforce (use to be puppetlabs)

But the current situation also seems painful and untenable.

Could you please elaborate a bit more on what is painful and untenable about our current situation, besides the fact we're trying to package puppet-server "properly"

+1 any decision to either switch or stick with puppet needs to be informed so we should definitely document our puppet pain points and more importantly how they would be solved by $somethingelse
"
The Disposable Development Environment page begins with a summary, "In order to maintain a good software development cycle for the WMF infrastructure we need have a robust development environment which is easy to set up, configure, adapt, and ideally disposable so that we are not consuming too many resources. " Reading the entire page, it's very hard not to conclude that Puppet is the biggest obstacle to that goal. So if resources are to be committed for DDE, let's consider migrating off Puppet as part of that.

Personally im not convinced that puppet is the main issue here, i think the biggest issues is not having a set of shared services that machines can use to test against, this would be the same whether we use ansible or puppet. e.g. to add cfssl/pki support to a new role you need a pki/cfssl sever that will respond to api request and provide a certificate for the applications tests and perhaps mocking it for the unit tests and working with the community

Also, we build/maintain a lot of tools to address Puppet's deficiencies, most of which aren't used outside the Foundation (PCC, Pontoon, Puppet ENC API, Cumin/spicerack, probably more). These have overlap or would not be necessary with Ansible. This might not be as visible to those with years of experience at the Foundation and/or with Puppet.

Its worth noting that the uses of many of theses tools is more due to the fact that the foundations was an early adopter of puppet, the tools we use have equivalents in the puppet world now and migrating to them is also an option

Here's my outsider experience: I struggled for a couple of weeks trying to get PKI working in deployment-prep, and despite consulting with multiple SREs/ and our volunteer deployment-prep SME, no one's advice got me there. I eventually figured it out, but it was lots of time away from the primary goal of testing Elasticsearch upgrades.

See above i remain unconvinced that theses tricker problems will be solved by simply switching to ansible

Our modules have not become generally useable and released on forge (some of them probably could and should but again, different task) and switching tools will not fix this issue.

Sorry, to be clear I'm talking about roles and templates. If you're a young technology enthusiast moving towards a career as a sysad/SRE, which one of these tasks

This seems like a very convoluted example i think the following would be a fairer comparison

ensure_packages(['ntp', 'ntpdate'])
service { 'ntp':
 ensure => running,
 enabled => true,
 require => Package['ntp']
}
- name: install ntp
  package:
    name: "{{item}}"
   state: prsent
 with_items:
  - ntp
  - ntpdate
- name: manage service ntp
  service:
    name: ntpd
    started: yes
    enabled: yes

Im not arguing puppet DSL is easier but we should at least compare simlar code. I would be curious to see what a more complicated example (e.g. cfssl::cert) would look like in ansible

Bolt , with its agentless model and YAML-based plans, is a tacit admission that Puppet's agent approach is too complicated and rapidly losing popularity.

I disagree with this, bolt tasks, at least when they where initially introduced, are for preforming one off tasks, or some type of service orchestration (plans) , they are more similar to spicerack cookbooks in design then a replacement for puppet agent. it is also worth noting that puppet has, from very early on, had puppet apply which is puppets decentralised offering.

If you are to ask me to consider ansible specifically, my view is that like operating systems configuration management tools have there strengths and weakness and at our complexity and scale we will meet those complexities and limits with whatever tool we chooses.

I'm sure it's not your intention, but this reads as "we are already doing the best we possibly can." Based on my previous experience, I'd say we definitely can do better.

no this was not my intention, i think we could do much much better in this space. my point is that we have a complex environment here and in my view switching to ansible will not solve some of the underlining issues we have with testing.

Thanks again for your response.

Id also say that when comparing something like PCC with ansibles offering we need to consider the complex cases... we should also not underestimate the ability for pcc to know which roles/hosts it should actually tests. Theses are things i suspect we would need to add to ansibles testing infrastructre but would be happy to be proved wrong.

I think we are talking about different things when we talk about testing. You and your team, as the owners of PCC and config management, would need to build the CI pipeline for Ansible. This will take some time and I don't mean to hand-wave that away. But, you can ask the core ansible devs questions on IRC. Also, I don't want to dox anybody, but there are probably SREs here (raise your hand if you're brave) that have built an ansible testing pipeline before ;) . Do note this key difference between Puppet and Ansible :

Ansible believes you should not need another framework to validate basic things of your infrastructure is true. This is the case because Ansible is an order-based system that will fail immediately on unhandled errors for a host, and prevent further configuration of that host. This forces errors to the top and shows them in a summary at the end of the Ansible run.
[...]
Obviously at the development stage, unit tests are great too. But don’t unit test your playbook. Ansible describes states of resources declaratively, so you don’t have to. If there are cases where you want to be sure of something though, that’s great, and things like stat/assert are great go-to modules for that purpose.

For the rest of us, SREs (and SWEs) who are just using Puppet as a means to an end (making it easier to operate servers), the current workflow of code change -> run thru CI -> deploy canary -> rollback -> start again could be significantly faster with Ansible.

With Ansible, we can test an individual role or task from our desktops and have a high degree of confidence that it will work BEFORE we submit it to CI. Similar to PCC, we can see potential changes with ansible --check and fully resolve variables and templates. We commit to the Puppet repo ~100 times per week, anything we could do to speed up this process could pay off in the long term.

If there is a way to fully resolve templates and variables without running through CI, let me know. I've asked multiple people and so far the only answer I've gotten is "no."

Personally im not convinced that puppet is the main issue here, i think the biggest issues is not having a set of shared services that machines can use to test against,

Admittedly, I don't know a lot about the shared service APIs we need to mock or build outside of prod, but if making Puppet happy in deployment-prep is a prerequisite for creating those tests, it definitely could be considered an obstacle.

Here's my outsider experience: I struggled for a couple of weeks trying to get PKI working in deployment-prep, and despite consulting with multiple SREs/ and our volunteer deployment-prep SME, no one's advice got me there. I eventually figured it out, but it was lots of time away from the primary goal of testing Elasticsearch upgrades.

See above i remain unconvinced that theses tricker problems will be solved by simply switching to ansible

Let me elaborate a little more on my experience in deployment-prep:

  • I created a cloud server with cloud-init and my cloud public key, but was permanently locked out after a few minutes (Puppet did this, despite the fact that I was a project admin).
  • I was (mistakenly) advised that I could fix the problem by changing deployment-prep's hieradata, a complex data structure which (to the best of my knowledge) has no validation.
  • Next, I was pointed to puppet-ecdsacert . I consulted with several people, including one of the original script authors, but no one could get it to generate valid certs for my hosts' Elastic service.
  • I tried Pontoon, but Puppet broke it too.
  • I eventually got our cluster certs to validate with some dirty hacks . Then and only then was I able to work on my primary goal of upgrading Elastic.

I don't know if this is a typical experience, but there are enough complaints to know it's not an isolated incident. Maybe there are other, better test environments, or there's something particularly bad about the Search platform setup. If you're reading this, feel free to add your experience with our test environments, good or bad.

From my outsider perspective, it seems like we:

  • Build self-maintained tools to mimic the complexity of production, binding us ever tighter to a dying technology and paradigm (Puppet/long-lived servers).
  • Don't always do a great job of maintaining those tools.
  • Wonder why we don't have good test environments.

This is by no means an attack on any tool maintainer. Everyone is using their considerable talents to keep our users happy. But we do have to consider why test environments are so difficult to make and maintain at the Foundation, and (to me anyway) Puppet is the main explanation.

You and others have expressed that migrating off Puppet is a bad investment, considering that most services will eventually be managed through K8s/Helm. That's certainly a reasonable perspective, and if anyone feels strongly either way, I urge them to respond here.

My opinion is that if we are serious about moving to an immutable infrastructure, we should be using tools that fit that paradigm instead of work against it. I think using Ansible for both state management and operations will significantly reduce the amount of code the SRE orgs have to own and maintain. So we wouldn't be just retooling, but exchanging self-maintained tools used by a handful of people for a mature open-source tool with a large, vibrant community.

Our modules have not become generally useable and released on forge (some of them probably could and should but again, different task) and switching tools will not fix this issue.

Sorry, to be clear I'm talking about roles and templates. If you're a young technology enthusiast moving towards a career as a sysad/SRE, which one of these tasks

This seems like a very convoluted example i think the following would be a fairer comparison

Agreed, that is a more fair comparison and apologies for that. I'll try and whip up a playbook for cfssl:cert, if you or anyone else reading this wants to help, hit me up in IRC or email.

and (to me anyway) Puppet is the main explanation.

The problems of deployment-prep are a matter of resourcing, (lack of) team ownership, processes and prioritization, not the tooling.

Let me elaborate a little more on my experience in deployment-prep:

  • I created a cloud server with cloud-init and my cloud public key, but was permanently locked out after a few minutes (Puppet did this, despite the fact that I was a project admin).
  • I was (mistakenly) advised that I could fix the problem by changing deployment-prep's hieradata, a complex data structure which (to the best of my knowledge) has no validation.
  • Next, I was pointed to puppet-ecdsacert . I consulted with several people, including one of the original script authors, but no one could get it to generate valid certs for my hosts' Elastic service.
  • I tried Pontoon, but Puppet broke it too.
  • I eventually got our cluster certs to validate with some dirty hacks . Then and only then was I able to work on my primary goal of upgrading Elastic.

I have felt the same type of pain trying to test pieces of Puppet code
here at the foundation. The creation of PCC and Pontoon both attest to
the need for better Puppet testing strategies. At my last job, we had a
pretty high fidelity staging environment and we also used a masterless
Puppet setup. This allowed for pretty fast development iterations by
developing and testing code in our staging environment. You could change
a piece of code and run the dry run or noop against our staging servers
without committing. Then once you were happy, commit the result. We
still had pain around vetting that our changes in our staging
environment would create the desired effect with different state, facts,
and hiera data in our production environment. We even developed a tool
similar to PCC, https://github.com/braintree/heckler, to vet our changes
in production. I agree with you that Puppet Labs failed to build a good
testing story, at least in the opensource space, which is why shops have
rolled their own solutions, such as GitHub's
https://github.com/github/octocatalog-diff. I would love to see
Puppet's opensource community come together and build tooling as
@MoritzMuehlenhoff mentioned. I'm not sure if you saw @cmooney's
lightning talk at the SRE summit on using https://containerlab.dev to
create network development environments, but I think he showed one
possible avenue for better testing environments. Namely, using our
existing infrastructure code to build high fidelity lab environments
locally which we could then use to develop and test our Puppet code
against. We could perhaps iterate on Pontoon to create the environments.
If we were able to create this tooling it would also be useful for
application testing and even exploring other config tooling such as
Ansible.

The problems of deployment-prep are a matter of resourcing, (lack of) team ownership, processes and prioritization, not the tooling.

I would be tempted to respond with "Why do our test environments take so many resources? Why doesn't anyone want to own them?" But jbond gave me the history of deployment-prep yesterday and it sounds like you're spot on.

. I'm not sure if you saw @cmooney's
lightning talk at the SRE summit on using https://containerlab.dev to
create network development environments, but I think he showed one
possible avenue for better testing environments.

I missed that one, but suffice to say I'm a huge cmooney fan ;) . The stakes are always higher with network devices, so it would be great to further minimize risk there.

We even developed a tool
similar to PCC, https://github.com/braintree/heckler, to vet our changes
in production.
I agree with you that Puppet Labs failed to build a good
testing story, at least in the opensource space, which is why shops have
rolled their own solutions, such as GitHub's
https://github.com/github/octocatalog-diff. I would love to see
Puppet's opensource community come together and build tooling as
@MoritzMuehlenhoff mentioned.

Awesome! It is a huge help to see that other organizations have similar problems with Puppet, and how they address them (and how they have better naming conventions ;) ). However, a less charitable reading might be that these problems are endemic to Puppet and that a lot of different organizations have spent a lot of time addressing them.

In the meantime, in the absence of real testing environments, what do we (SREs who don't own Puppet/config mgmt testing, and are just using Puppet as a means to an end) do when we need to make changes in production? Here's what I do, I imagine most people's workflow is probably the same, but please respond if not.

My current change workflow

code change -> run thru CI -> deploy canary -> rollback -> start again

Here's an example from yesterday:

  • 2 PM CST: submit initial patch (Simple change to remove extraneous config)
  • 2:06 PM: deploy (and break) canary
  • 2:20 PM: Start downloading operations/puppet docker image so I can run rspec
  • 2:28 PM: Start writing tests (redownloading Ruby code…each failure/change cycle takes 5m)
  • 3:04 PM deploy fix
  • 3:07 PM verify the fix and restart services

How would this be different under Ansible?

  • I could render the template live on the server before committing changes, so I wouldn't make the mistake in the first place.
  • I could more easily reason out variable values. Puppet's immutable variables lead to gross patterns like` $_var = $var ` , which leads to the headache of "plumbing" values through different variables, in different plans. (Mutable variables bring other problems, but I'd argue that clarity and immediate feedback work better for humans, not to mention we are hacking around the spirit of immutability when we do this).
  • I could use my Python knowledge to write a unit test. Even though I'm working with a brilliant SWE and long-time Wikimedian, neither of us touch Ruby enough to get good at it. Correct me if I'm wrong, but the only place we really use it at the Foundation is for Puppet-related things. This is an expensive context switch.

Add all these things up and it takes an hour to do something with Puppet that would take minutes with Ansible. Multiply that across every commit that has to be rolled back, and it's clear that we have a significant opportunity to reduce toil. You might argue that it has more to do with our implementation of Puppet than Puppet itself. Fair point, but cleaning up the existing repo just so we can stay on a dying product seems like a bad investment. We're not going to get significantly better at Puppet DSL, and neither is the rest of the world. Ansible also has the benefits of a latecomer, similar to how Prometheus was able to take lessons learned from previous monitoring solutions and simplify for dynamic environments.

Why not just migrate to Kubernetes?

Migrating to Kubernetes would help, but my prior employer was running 7,000 jobs in an orchestrator , in an environment of equal or greater complexity, and we still had plenty of Ansible code (a few dozen roles, plus operational playbooks...operations are something we can't do with Puppet, so we maintain our own custom solution, Spicerack/Cumin).

Not to mention, rewriting an application to run in Kubernetes requires extensive coordination with SWEs and ServiceOps. It is considerably more resource-intensive than translating your current Puppet roles into Ansible playbooks. It's also considerably riskier: if not done right, could reduce stability for your applications. (This is not to hand-wave away the work needed building a new testing pipeline, writing integrations, etc for Ansible, which will be considerable, just comparing the work required for an individual role to move to Ansible vs. Kubernetes.).

Conclusion
If you understand YAML, Python, and general config management concepts, you're 90% of the way to understanding Ansible. It has its own problems, including variable code quality between modules, even core modules (I have heard some bad stuff about their apt module specifically). It also has too many ways to do the same thing , but overall I'd say it's a better product than Puppet. Even if it were worse, it has a much brighter future than Puppet.

There was a Wikipedia before Puppet, and there will be a Wikipedia after Puppet. Even if we don't end up changing anything, I think it's to separate the tool from the history, and see what else is out there.

How would this be different under Ansible?

  • I could render the template live on the server before committing changes, so I wouldn't make the mistake in the first place.

I have found this extremely helpful as well. At my last job we regurarly ran Puppet noops in production, which allowed us to render a template with full production data. I have made some effort to make that possible here at the foundation with bolt, but it has significant gaps due to our use of PuppetDB and having private data that is only available to the puppet masters.

How would this be different under Ansible?

  • I could render the template live on the server before committing changes, so I wouldn't make the mistake in the first place.

I have found this extremely helpful as well. At my last job we regurarly ran Puppet noops in production, which allowed us to render a template with full production data. I have made some effort to make that possible here at the foundation with bolt, but it has significant gaps due to our use of PuppetDB and having private data that is only available to the puppet masters.

Oh wow! I remember us talking bolt a few months ago. How possible is it to feed it arbitrary hieradata + Puppet plan + template and have it do its thing? Secrets and production data aren't that important to me personally (others may disagree), but just being able to see a fully-rendered template and compare to what's already in prod before committing would be huge. It looks like rspec does show you rendered templates, but unless I'm wrong, they only show up if the template fails.

I'll be very straightforward and say that while I think ansible has some merits (and some drawbacks) compared to puppet, there is really no justification for the huge cost of a transition right now for our production environment. I would be quite interested, OTOH, to explore alternative configuration management strategies if we are to start managing assets on public clouds. I would personally try to go for immutable infra there, so terraform + VM or docker images, for instance.

I don't think there is a productive and actionable outcome of the discussion in this task, nor that we've made progress in the discussion. I would suggest we close it as declined.

@bking I've seen others already pointed you to our tools to help your puppet workflow, but if you want to enhance your productivity using puppet, I'm sure both me and any other engineer with long familiarity with our environment can show you how we've maximized it.

As a general note, I'll say that productivity arguments when choosing a programming language / programming tool leave me perplexed, often - I find that I spend most time thinking about what to do and how to do it, and if I struggle with the correctness of the code, I should probably have written tests to help me.

Oh wow! I remember us talking bolt a few months ago. How possible is it
to feed it arbitrary hieradata + Puppet plan + template and have it do
its thing? Secrets and production data aren't that important to me
personally (others may disagree), but just being able to see a
fully-rendered template and compare to what's already in prod before
committing would be huge. It looks like rspec does show you rendered
templates, but unless I'm wrong, they only show up if the template
fails.

If the piece of code your are testing doesn't fall into any of the
caveat buckets I mentioned, then it works pretty well, e.g.:

$ cd puppet
$ sed -i 's/30G/50G/' hieradata/role/common/elasticsearch/cirrus.yaml
$ bolt-wmf -t elastic1060.eqiad.wmnet -- apply --noop
<snip>
Notice: /Stage[main]/Elasticsearch/Elasticsearch::Instance[production-search-eqiad]/File[/etc/elasticsearch/production-search-eqiad/jvm.options]/content:
--- /etc/elasticsearch/production-search-eqiad/jvm.options    2022-11-15 21:20:38.961089912 +0000
+++ /tmp/puppet-file20221117-2319842-1jl4tml  2022-11-17 21:40:20.664669985 +0000
@@ -19,8 +19,8 @@
# Xms represents the initial size of total heap space
# Xmx represents the maximum size of total heap space

--Xms30G
--Xmx30G
+-Xms50G
+-Xmx50G

################################################################
## Numa Awareness
<snip>

I have some ideas on how to make bolt like functionality work better
with our infrastructure, e.g. compile the catalog on the puppet master,
but I haven't found the time to explore them as of yet.

In T321874#8401334, @Joe wrote: I’ll be very straightforward and say that while I think ansible has some merits (and some drawbacks) compared to puppet, there is really no justification for the huge cost of a transition right now for our production environment. I would be quite interested, OTOH, to explore alternative configuration management strategies if we are to start managing assets on public clouds. I would personally try to go for immutable infra there, so terraform + VM or docker images, for instance.

I agree that a wholesale migration would be extremely difficult, for marginal value.

I don’t think there is a productive and actionable outcome of the discussion in this task, nor that we’ve made progress in the discussion. I would suggest we close it as declined.

That said, I found this conversation quite helpful. @bking’s description of his workflow and struggles with developing on our infrastructure mirror my own and provide validation of where our infrastructure development workflow could improve. At the very least I would love to see better documentation on how other folks develop Puppet code, because there may be quite a bit of variation and some folks may have developed superior methods. Ideally, I would like to spend time developing better tools to improve our workflows.

I can definitely relate with the long (and stressful!) cycles of Puppet patches you mention @bking and that one of my main motivations for starting Pontoon almost three years ago now.

My workflow nowadays looks like the following:

  1. Write puppet.git change(s) locally
  2. git push to my team's (o11y) Pontoon stack on Cloud VPS
  3. run-puppet-agent on the relevant/affected Pontoon hosts
  4. GOTO 1 until me and Puppet are happy with the result
  5. utils/run_ci_locally.sh (or just send the patch for review and let CI run)
  6. Merge/deploy the patch in production as usual after review

Needless to say, I'm a fan of the peace of mind that having a sandbox similar to production has provided me. Hope that helps!

I don't think there is a productive and actionable outcome of the discussion in this task, nor that we've made progress in the discussion. I would suggest we close it as declined.

Agreed. In retrospect, convincing someone of the value of a tool on a message board is NOT a winning strategy. But, I would caution against reading too much into peoples' responses (or lack thereof). I have had people tell me privately, "Thanks for bringing this up, I wouldn't have the energy," and "From reading the rest of the thread I think it might make sense for any net-new projects to try out Ansible and/or Terraform. The Puppet vs Ansible argument can become really religious. I would also say that the companies that are behind the projects and products are just as important as the community around them."

As a general note, I'll say that productivity arguments when choosing a programming language / programming tool leave me perplexed, often - I find that I spend most time thinking about what to do and how to do it, and if I struggle with the correctness of the code, I should probably have written tests to help me.

When it comes to config management, most people prefer to write markup and avoid thinking about it much, if it all. Puppet realizes their heavyweight approach is losing; that's why Bolt is a feature-by-feature copy of Ansible (agentless, YAML plans etc). It's also why a lot of people who used to get paid to write Puppet are now looking for a job .

Expressing My Gratitude

Thanks to everyone who suggested productivity improvements, jbond, jhathaway and fgiunchedi in particular. Jessie, I found your bolt-wmf repo ;) happy to help work on that anytime. All of you have already made a huge difference in my day-to-day productivity, and I am very much in y'all's debt.

What's Next? Ansible Demo Opportunities

If you've never really used Ansible and would like a demo, respond here or hit me up on email or IRC. I have friends at Red Hat and they've offered to present Ansible Automation Platform , which does config management, operations, and ad-hoc commands (we use Puppet, Spicerack, and Cumin). We'd use the upstream community version, AWX , formerly known as Ansible Tower. They are aware of our open-source policy, so there won't be any expectations that we buy the commercial version. (Or if you just want a quick primer on how it works and what's different, ping me anytime).

Since the holidays are upon us, I'd like to keep the ticket open until mid-January or so if that is OK with everyone else. I'll hit the SRE email list after we return from break, and if at least 10 people are interested, I'll schedule the demo.

If not enough people are interested by then, I'll close the ticket and in accordance with Mad Ned's Principles of "Do It Anyway" and "Two-and-Done, " and never mention Ansible for WMF again.

I can definitely relate with the long (and stressful!) cycles of Puppet patches you mention @bking and that one of my main motivations for starting Pontoon almost three years ago now.

Needless to say, I'm a fan of the peace of mind that having a sandbox similar to production has provided me. Hope that helps!

Thanks, I will definitely give Pontoon another go. I'm also extremely excited that Terraform is now an option for WMCS . I've built out a skeleton for using it on deployment-prep, maybe there is some opportunity for integrations there? No idea, but do hit me up if that interests you.

I don't think there is a productive and actionable outcome of the discussion in this task, nor that we've made progress in the discussion. I would suggest we close it as declined.

Agreed. In retrospect, convincing someone of the value of a tool on a message board is NOT a winning strategy. But, I would caution against reading too much into peoples' responses (or lack thereof). I have had people tell me privately, "Thanks for bringing this up, I wouldn't have the energy," and "this tends to be a religious war, you should only suggest Ansible for net-new environments."

I think describing this discussion as a religion war is quite unfair. Attributing the comment to someone else isnt' going to make it less disrespectful of the people who dedicated time and efforts to discussing this topic with you.

Also, given your comment is attributing some form of emotional attachment to Puppet to anyone who's disagreeing with you, let me state it clearly: I just consider the value of rewriting an extremely large code repository to a new language expensive and unjustifiable unless there is a compelling reason to do so. I didn't see any presented here. In a void without prior art, I'd probably pick ansible today just because I prefer writing python to writing ruby.

Aklapper renamed this task from Consider alternative configuration managment tooling to Consider alternative configuration management tooling.Nov 22 2022, 4:51 PM

I don't think there is a productive and actionable outcome of the discussion in this task, nor that we've made progress in the discussion. I would suggest we close it as declined.

Agreed. In retrospect, convincing someone of the value of a tool on a message board is NOT a winning strategy. But, I would caution against reading too much into peoples' responses (or lack thereof). I have had people tell me privately, "Thanks for bringing this up, I wouldn't have the energy," and "this tends to be a religious war, you should only suggest Ansible for net-new environments."

I think describing this discussion as a religion war is quite unfair. Attributing the comment to someone else isnt' going to make it less disrespectful of the people who dedicated time and efforts to discussing this topic with you.

I've updated my post to the exact quote. I don't think this has offended anyone who did engage, but if it did please let me know and I will remove it from this page entirely. I don't know how to read "attributing the comment to someone else," can you clarify? It sounds like you are accusing me of sockpuppeting. If you don't believe me, I can ask the person if they are OK with me revealing their personal details.

Also, given your comment is attributing some form of emotional attachment to Puppet to anyone who's disagreeing with you,

I apologize if that was the impression you got. I mentioned this primarily due to an offline discussion with Mark about this topic, specifically about why not many people were engaging. People might be afraid to broach the topic, too busy, never saw the ticket, etc. The main point being being that engagement on Phab tickets != interest in a particular topic.

let me state it clearly: I just consider the value of rewriting an extremely large code repository to a new language expensive and unjustifiable unless there is a compelling reason to do so. I didn't see any presented here. In a void without prior art, I'd probably pick ansible today just because I prefer writing python to writing ruby.

Yes, but when has anyone ever decided to replace a tool they've been using for years with one they've barely used, based entirely on a message board discussion? In retrospect, this is not the appropriate venue and I was extremely lucky to get anything beyond "it's too much work." Which sounds like what you're saying. Given my approach, yours is a perfectly reasonable response.

Your input as a WMF veteran and exemplary SRE is extremely important, and I don't take that lightly. But as you stated earlier, there is probably no productive and actionable outcome here in this thread. Bringing us back to the ticket topic, "Consider alternative configuration management tooling. I've outlined what I think should be the next steps a few posts above this one, notably without a further commitment to change.

At the very least, I think it would be healthy to see what has changed in the config management world in the ~14 years since WMF adopted Puppet, to monitor the ecosystem of one our of most important applications, etc. If you have any objections or modifications to this plan, please let me know.

Hey everyone,

I think this discussion would benefit greatly from a higher bandwidth venue than phabricator. It's quite clear there are pain points regarding the current status quo and I 'd like us to see what we can do to make it better. I 'll take an action item to setup something. Let's give it a shot and see if it fares better than phabricator.