Right now the Prometheus instances we have in puppet are defined "de facto" by the various profile::prometheus::<instance> we have. The parent task documents the need to be able to move instances over to different hardware pairs in the same site.
To this end, I belive we should move to a central parameter/configuration data structure to define prometheus instances. This is not dissimilar to what we're doing for prometheus + k8s in kubernetes::clusters, though in this case we'd move the definition from there to this new configuration.
I took a look at the current puppet code and I think something like this would work:
prometheus::instances: ops: port: 9900 retention_time: <string> retention_size: <string> thanos_upload: <bool> k8s_master_url: <url> # if defined enables k8s support pki_name: <string> # if defined will acquire a client cert from this intermediate hosts: <array> # a list of hosts where this instance is deployed to k8s: port: ... k8s_master_url: kubemaster.svc.$::site.wmnet ... k8s-staging: ...
Similarly to kubernetes::clusters_defaults, there will be a defaults data structure and support functions to extract data as needed.
The hosts array is what enables instances to be allocated to a set of hosts, in other words a given host H will get a given instance only if its hostname appears in hosts.
The hosts array also enables building a map of instance => hostname:port/instance to be able to reverse-proxy requests made to prometheus.svc.SITE.wmnet/instance to the right place. Said reverse-proxy will be done by apache on prometheus hosts like we do today, although today in practice the map looks like instance => localhost:port/instance In other words, all prometheus hosts in SITE will be pooled behind prometheus.svc.SITE.wmnet, and they will route requests to the right host:port/instance as needed.
As part of this work we'll also have to complete T326657: Add prometheus-https load balancer since oauth2-proxy will be a requirement too