Page MenuHomePhabricator

Configure Prometheus instance centrally
Closed, ResolvedPublic

Description

Right now the Prometheus instances we have in puppet are defined "de facto" by the various profile::prometheus::<instance> we have. The parent task documents the need to be able to move instances over to different hardware pairs in the same site.

To this end, I belive we should move to a central parameter/configuration data structure to define prometheus instances. This is not dissimilar to what we're doing for prometheus + k8s in kubernetes::clusters, though in this case we'd move the definition from there to this new configuration.

I took a look at the current puppet code and I think something like this would work:

prometheus::instances:
  ops:
    port: 9900
    retention_time: <string>
    retention_size: <string>
    thanos_upload: <bool>
    k8s_master_url: <url> # if defined enables k8s support
    pki_name: <string> # if defined will acquire a client cert from this intermediate
    hosts: <array> # a list of hosts where this instance is deployed to
  k8s:
     port: ...
     k8s_master_url: kubemaster.svc.$::site.wmnet
     ...
  k8s-staging:
     ...

Similarly to kubernetes::clusters_defaults, there will be a defaults data structure and support functions to extract data as needed.

The hosts array is what enables instances to be allocated to a set of hosts, in other words a given host H will get a given instance only if its hostname appears in hosts.

The hosts array also enables building a map of instance => hostname:port/instance to be able to reverse-proxy requests made to prometheus.svc.SITE.wmnet/instance to the right place. Said reverse-proxy will be done by apache on prometheus hosts like we do today, although today in practice the map looks like instance => localhost:port/instance In other words, all prometheus hosts in SITE will be pooled behind prometheus.svc.SITE.wmnet, and they will route requests to the right host:port/instance as needed.

As part of this work we'll also have to complete T326657: Add prometheus-https load balancer since oauth2-proxy will be a requirement too

Event Timeline

Change #1057187 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: remove absented resources

https://gerrit.wikimedia.org/r/1057187

Change #1057188 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: clean up legacy parameters

https://gerrit.wikimedia.org/r/1057188

Also cc @akosiaris, @JMeybohm, @Clement_Goubert for input on the above, this change will effectively move prometheus k8s configuration out of kubernetes::clusters. In the process we'll resolve the k8s / prometheus cluster nomenclature discrepancy in the sense that from prometheus' POV a k8s cluster is reachable through its master API url and not by cluster name, hope that makes sense and let me know what you think !

Change #1057187 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: remove absented resources

https://gerrit.wikimedia.org/r/1057187

Change #1057188 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: clean up legacy parameters

https://gerrit.wikimedia.org/r/1057188

Thanks for this. Overall LGTM as a plan, I do have a clarifying question regarding the nomenclature discrepancy though? What exactly will the change be? Do we expect breaking labels/datasources in Grafana due to abovesaid change?

Thanks for this. Overall LGTM as a plan, I do have a clarifying question regarding the nomenclature discrepancy though? What exactly will the change be? Do we expect breaking labels/datasources in Grafana due to abovesaid change?

Yeah it wasn't super clear; what I meant is that right now we're doing this to pick the prometheus instance name from the k8s cluster name: $k8s_cluster = pick($k8s_config['prometheus']['name'], "k8s-${cluster_name}") i.e. we have to explictly call out prometheus' name because sometimes prometheus and k8s names don't match.

In the proposed approach the k8s cluster is identified by prometheus by its master API url only, in other words no label changes expected since at that level everything will remain the same (i.e. prometheus instance names are unchanged)

lmata triaged this task as Medium priority.Oct 30 2024, 6:56 PM

Change #1104630 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] thanos: default to -15d for sidecar min_time

https://gerrit.wikimedia.org/r/1104630

Change #1104631 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: refactor common functionality

https://gerrit.wikimedia.org/r/1104631

Change #1104630 merged by Filippo Giunchedi:

[operations/puppet@production] thanos: default to -15d for sidecar min_time

https://gerrit.wikimedia.org/r/1104630

Change #1104631 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: refactor common functionality

https://gerrit.wikimedia.org/r/1104631

Change #1104980 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] WIP prometheus instances

https://gerrit.wikimedia.org/r/1104980

Change #1108746 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: migrate ops instance to prometheus::instances

https://gerrit.wikimedia.org/r/1108746

Change #1108772 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] WIP: prometheus: k8s instances migration

https://gerrit.wikimedia.org/r/1108772

Change #1104980 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: deploy instances from a single configuration

https://gerrit.wikimedia.org/r/1104980

Change #1109031 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] cloud: fix tools prometheus

https://gerrit.wikimedia.org/r/1109031

Change #1109031 merged by David Caro:

[operations/puppet@production] cloud: fix tools prometheus

https://gerrit.wikimedia.org/r/1109031

Change #1109680 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: add initial lv size to prometheus::instances

https://gerrit.wikimedia.org/r/1109680

Change #1108746 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: migrate ops instance to prometheus::instances

https://gerrit.wikimedia.org/r/1108746

Change #1108772 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: k8s instances migration to prometheus::instances

https://gerrit.wikimedia.org/r/1108772

Change #1109680 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: add initial lv size to prometheus::instances

https://gerrit.wikimedia.org/r/1109680

fgiunchedi claimed this task.

This is done, we now configure all production prometheus instances from prometheus::instances

Change #1111625 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: notify apache2 as needed

https://gerrit.wikimedia.org/r/1111625

Change #1111625 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: notify apache2 as needed

https://gerrit.wikimedia.org/r/1111625

Change #1112014 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: reverse proxy for instances belonging to the host' site too

https://gerrit.wikimedia.org/r/1112014

Change #1112014 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: reverse proxy for instances belonging to the host' site too

https://gerrit.wikimedia.org/r/1112014

Change #1112186 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: serve apache vhost on localhost too

https://gerrit.wikimedia.org/r/1112186

Change #1112186 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: serve apache vhost on localhost too

https://gerrit.wikimedia.org/r/1112186