This proposal is partly the result of the discussions we had at the 2016 TechOps offsite, and of my readings before and after it.
The goal of this proposal is to make an incremental shift in how we organize our puppet code in the tree, to better standardize the code structure and make it simpler to read and to debug. It draws ideas from most puppet best practices and should be immediately applicable to new code we write from now on, without disrupting the existing code.
**General principles**
# The code should be organized in roles, profiles, and modules. Let's see in detail what each of these things will mean.
# __Modules should be basic units of functionality__ (e.g. "set up, configure and run HHVM") and __must not use classes from other modules__, and mostly avoid, wherever possible, to use defines from other modules as well. __No hiera call, explicit or implicit, should happen within a module__. These rules will ensure the amount of WMF-specific code that makes it into modules is minimal, and improve debugging/refactoring/testing of modules as they don't really depend on each other. Keeping up with the HHVM module example, the base class `hhvm` is a good example of what should be in a module, while `hhvm::admin` is a good example of what should __not__ be in a module; surely not in this one: it is a class to configure apache to forward requests to HHVM, depends mostly on another module (apache, in fact) and also adds ferm rules, which of course requires the WMF-specific `network::constants` class.
# __Profiles are collections of resources in modules that represent an high-level functionality__ ("a webserver able to serve mediawiki"), should only have parameters that default to __explicit hiera calls__. This means that a typical profile will either need all of its hiera variables declared, or to be included by an ENC (like horizon) with its parameters declared. Unless specifically needed, no resource should be added to a profile using the `include class` method, but with explicit class instantiations. If a profile need another one as a precondition, it must be listed with a `require ::profile::foo` at the start of the class, but this should be mostly avoided. Most of our current "roles" should be profiles. Following our example, an apache webserver that proxies to a local HHVM installation should be configured via a 'profile::hhvm::web' class; a mediawiki installation served through such a webserver should be configured via a 'profile::mediawiki::web' class.
# __Roles should only include profiles; A role can include more than one profile, but no conditionals, hiera calls, etc are allowed__. So following our example, we should have a 'role::mediawiki::web' that just includes 'profile::mediawiki::web' and 'profile::hhvm::web'. Inheritance can be used between roles, but it is strongly discouraged: for instance, it should be remembered that inheritance will __not__ work in hiera.
# Any node in site.pp must include exactly one role, with the "role()" function. No exceptions allowed.
# Almost all hiera definitions should go in the "role" heirarchy. Only exceptions can be shared data structures that can be used by many profiles or to feed different modules with their data. Those should go in the `common/$site` global hierarchy. A good example is, in our codebase, `ganglia_clusters`, or any list of servers (the memcached hosts for mediawiki, for example).
# Per-host hiera should only be used to allow tweaking some knobs for testing or to maybe declare canaries in a cluster. It should not be used to add/subtract functionality. If you need to do that, add a new role and configure it accordingly, within its own hiera namespace.
**Expected advantages**
This will have a series of advantages, namely:
- Make it easier to write general-purpose, non-wmf-specific modules
- Ease of debug: it will be easy to see where something happens (typically, the profiles that are part of your role), hiera calls will all be explicit in the code and mostly contained into a single file (the one for the role declared in `site.pp`). It will thus be pretty easy to track down a variable lookup
- Using configurable profiles will allow easier reuse of the same code when we have more than one cluster, see e.g. the use of our current `role::elasticsearch` in different practical roles
- Labs and production roles might differ without harm; so for instance a production role might include standard and base::firewall, while a labs role might include some different classes.
On the other hand, some drawbacks have to be expected:
- a stricter code structure will sometimes present its hurdles and make it harder to solve some problems
- given the limits we just imposed, sometimes it will be hard to abstract some common code to a separate class and this might cause some more repetitions.
- The distinctions between what is a profile and what is not can be ambiguous and create some controversy.
**A practical example**
If you look at the `role::elasticsearch::*` roles, and you rename `role::elasticsearch::common` to `profile::elasticsearch::common`, you'll have a pretty good idea of what to expect. Also, the old-style `role::etcd` can be compared to the newer `role::etcd::{common,networking,kubernetes}` that were written following more or less the role/profile pattern. It's easy to see how the old-fashioned `role::etcd` can be molded easily in deriving from the "profile" (`role::etcd::common`), while accomodating all the use-cases with the original `role::etcd` would have required adding more and more parameters, and using regexes to distinguish between roles.
Also, since swift has always been considered a benchmark for the complexity of our manifests, an example change molding the swift roles into using the proposed standard has been posted to gerrit and refers to this task.