(suggestions for a better title most welcome)
Currently the operations puppet repository is used by both cloud services and the production network. However, the use of puppet has some subtitle differences. The main differences are the ENC and hiera structure. Over the years this has caused a number of issues where for instance a member of the SRE team (most often the OP) introduces a change which is tested with production in mind, is green in CI and PCC. however when the change is deployed and merged it causes issues with cloud services, often due to missing hiera defaults but sometimes triggered by some of the other subtle differences.
Other then day to day issues, the use of a shared repo also means that the velocity of change is pinned to the slowest moving part. Cloud services has to deal with a lot more nuanced installations with many varying factors. This means that certain things such as dropping support for an old version of puppet is often more difficult and slower to achieve as it requires more coordination and communication with a wider audience.
Over the years, there have been discussions on how we could ease this pain with many solutions proposed and all though some small changes have been achieved the underlining issues still persists. This task is an effort to re-invigorate those discussions in an effort to try and resolve this issues once and for all. At the very least i think it would be useful to link all of the various efforts that have been proposed over time and document the difference and nuance between the cloud services and production puppet environments
== Divergent ==
=== Hiera structure ===
One of the main difference between production and the Cloud environment is the structure used by hiera. This has been documented in T255787 but will include here as
==== Production ====
* make use of the wmflib::expand_paths for common and site expansion
* make use of a `$_role` variable created with the role function (see below)
==== Cloud ====
* makes use of cloudlib::httpyaml to fetch data from openstack.
* has some additional parts to the hierarcy e.g.
** "cloud/%{::wmcs_deployment}/%{::labsproject}/hosts/%{::hostname}.yaml"
** "cloud/%{::wmcs_deployment}/%{::labsproject}/common.yaml"
** "cloud/%{::wmcs_deployment}.yaml"
** "cloud.yaml"
* can also use a secret repo out side of git (on the puppetmaster FS)
** /etc/puppet/secret/hieradata/%{::labsproject}.yaml
* use a different hierarchy in the private repo
** "labs/%{::labsproject}/common.yaml"
** "%{::labsproject}.yaml"
** "labs.yaml"
=== Node classifier ===
The node classifier is essentially used to provide a list of classes that should be applied to a node as well as some additional parameters (aka hiera keys)
==== Production ====
In the production environment we use the [[ https://github.com/wikimedia/puppet/blob/production/manifests/site.pp | site.pp ]] manifest along with a custom [[ https://github.com/wikimedia/puppet/blob/production/manifests/site.pp | role function ]]. When called with e.g. `role(foo::bar)` it dose two things:
* load the the class `role::foo::bar`
* inject a global variable (node parameter) `$_role = foo/bar` into the manifest. The main use case for this is to look up role specific parameters in hiera (as noted above)
Recently there has been [[ https://gerrit.wikimedia.org/r/c/operations/puppet/+/680266 | some effort ]] to add the role variable to the cloud node classifier however its currently stalled (see comments on change).
==== Cloud ====
The cloud environment uses a [[ https://github.com/wikimedia/puppet/blob/production/modules/openstack/files/puppet/master/labs-puppet-enc.py | custom script ]] which queries the openstack api to produce a list of classes and additional hiera keys to apply to a node. This functionality enables community members to easily test out classes from the puppet repo, swapping hiera values and pairing different profiles classes without the need to make a commit to the puppet repository.
=== PuppetDB ===
The cloud environment doesn't have a puppetdb installation (although some individual projects may). This means that any use of either [[ https://puppet.com/docs/puppet/7/lang_exported.html | exported resources ]] of functions which rely on puppetdb e.g. `puppetdb_query` wont work in production. The lack of puppetdb also means that the cumin puppetdb backbend dose not function in the cloud environment, however i think that issue is out of scope of this task.
=== Global Variables ===
the puppet repo configures a number of global variables via the [[ https://github.com/wikimedia/puppet/blob/production/manifests/realm.pp | realm.pp ]] manifest. some of theses variables are the same in both environments some are different and some only exist in the cloud environment. The two variables which differ are
* `$realm` this points to either 'labs' or 'production' dependent on which DNS domain a node is in.
* `$nameserveres` In production this points to the production anycast service . The cloud environment set this to the cloud service dns servers which has logic to auto populate entries for nodes created in openstack
The `$realm` variable is also used extensively in the puppet policy
==== Only in cloud ====
Most of theses variables are used to provide additional lookup paths in hiera (see above)
* `$labsproject` this points to the openstack/horizon project of the node
* `$wmcs_deployment` This indicates either the cloud production (eqiad1) or development environment (codfw1dev)
* `$projectgroup` this is equal to `"project-${labsproject}"` (not sure of the use case hoping cloud services can clarify
In order to produce the variables above some temporary variables where also used, however as they are defined in realm.pp they will also be injected into node scope as such we list them here to be explicit
* `$pieces` this equals ` $_trusted_certname.split('[.]')`
* `$dnsconfig` used to populate the `$nameservers` variable
== Possible ways forward ==
* It should be possible to completely drop the realm variable and relay instead on hiera to control different logic paths. this will likely require a lot of refactoring however it should reduce the number of code paths which differer moving the majority of the problem to hiera
* Add puppetdb to cloud services. I suspect this has been investigated many times and likely difficult to support for instance with there own puppet masters
* Explore the possibility of adding wmflib::expand_path to the cloud services hiera. this could be an additional level with the lowest hiera priority. this one of the areas that cause the most day to day pain and feels like it could be a quick win
* inject role and use the role variable in cloud services hiera structure [[ https://gerrit.wikimedia.org/r/c/operations/puppet/+/680266 | 680266 ]]
* separate puppet repos (in some form). This is something that has been discussed a few times with many different proposals, i think it would be useful to try and resurrect some of those discussions/ideas