Page MenuHomePhabricator

Move all o11y services to discovery.wmnet
Closed, ResolvedPublic

Description

We have a few hostnames in hieradata/common/profile/trafficserver/backend.yaml that should be moved to discovery records for easier operations (e.g. reimage/flip/etc).

The ultimate goal is to simplify operations wrt the current status quo for each service.

Services:

  • Grafana
  • Logstash
  • Pyrra
  • Prometheus
  • Thanos

Since different services require different strategies, the following sections outline the trade offs and solutions on a per-service basis.

grafana

This is the trickiest of all I think, ideally I (Filippo) would like a single patch or command to flip the active/standby grafana host. Note that whatever points to grafana.w.o should be also reflected in profile::grafana::active_host (and profile::grafana::standby_host) for the "singleton" units (such as syncing ldap users) to follow.

A pontential solution could look like T357384

logstash [done]

This ties in with moving the read path for logs, moving to a confctl controlled discovery.wmnet record would make flipping datacenters for logstash to be quicker and in line with other services too. What do you think @colewhite? - SGTM!

prometheus [done]

We need to point to individual hosts because we're using mod_auth_cas. When we move to oauth2-proxy for SSO authentication then we can replace those with prometheus.svc.SITE.wmnet. This is basically T326657

pyrra (includes slo/slos) [done]
  • point to thanos-web.discovery.wmnet

Event Timeline

Change #1024808 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] trafficserver: Add discovery entries for grafana and grafana-next

https://gerrit.wikimedia.org/r/1024808

Change #1024806 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/dns@master] wmnet: Add discovery entries for grafana and grafana-next

https://gerrit.wikimedia.org/r/1024806

Change #1025445 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] trafficserver: Add discovery entries for prometheus hosts

https://gerrit.wikimedia.org/r/1025445

Change #1025447 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/dns@master] wmnet: Add discovery entries for the Prometheus hosts

https://gerrit.wikimedia.org/r/1025447

andrea.denisse changed the task status from Open to In Progress.Apr 30 2024, 4:37 PM

Change #1024806 merged by Andrea Denisse:

[operations/dns@master] wmnet: Add discovery entries for grafana and grafana-next

https://gerrit.wikimedia.org/r/1024806

Change #1024808 merged by Andrea Denisse:

[operations/puppet@production] trafficserver: Add discovery entries for grafana and grafana-next

https://gerrit.wikimedia.org/r/1024808

andrea.denisse changed the task status from In Progress to Open.May 15 2024, 7:50 PM

Changing the status to open as I'm currently working on T359255.

andrea.denisse renamed this task from Move o11y all services to discovery.wmnet to Move all o11y services to discovery.wmnet.May 17 2024, 4:53 PM
andrea.denisse changed the task status from Open to In Progress.May 20 2024, 6:17 PM

Change #1034626 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] grafana: Use DNS for active/standby host detection for failovers

https://gerrit.wikimedia.org/r/1034626

(from the task description)
pyrra (includes slo/slos)
I believe we could point these to thanos-query.discovery.wmnet right away; what do you think @herron ?

Yes, this should work. Let's try it and then if we find we need more granular control over it we can always split them up.

Change #1035541 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] trafficserver: point pyrra to thanos discovery record

https://gerrit.wikimedia.org/r/1035541

Change #1035541 merged by Herron:

[operations/puppet@production] trafficserver: point pyrra to thanos discovery record

https://gerrit.wikimedia.org/r/1035541

Change #1036315 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] grafana: Use DNS for active/standby host detection for failovers

https://gerrit.wikimedia.org/r/1036315

Change #1039236 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] traffic: Add discovery entries for the pyrra, slo, and slos domains

https://gerrit.wikimedia.org/r/1039236

Change #1039236 merged by Andrea Denisse:

[operations/puppet@production] traffic: Add discovery entries for the pyrra, slo, and slos domains

https://gerrit.wikimedia.org/r/1039236

Change #1039406 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] conftool: Integrate logstash with active-passive configuration

https://gerrit.wikimedia.org/r/1039406

Change #1039882 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/dns@master] discovery: Add metafo entry for logstash

https://gerrit.wikimedia.org/r/1039882

Change #1039887 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] traffic: Route logstash.w.o to logstash.discovery.wmnet

https://gerrit.wikimedia.org/r/1039887

Change #1039406 merged by Andrea Denisse:

[operations/puppet@production] conftool: Integrate logstash with active-passive configuration

https://gerrit.wikimedia.org/r/1039406

Change #1039882 merged by Andrea Denisse:

[operations/dns@master] discovery: Add metafo entry for logstash

https://gerrit.wikimedia.org/r/1039882

Change #1039887 merged by Andrea Denisse:

[operations/puppet@production] traffic: Route logstash.w.o to logstash.discovery.wmnet

https://gerrit.wikimedia.org/r/1039887

Change #1042299 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] logging: add logstash.discovery.wmnet to alt names

https://gerrit.wikimedia.org/r/1042299

Change #1042299 merged by Andrea Denisse:

[operations/puppet@production] logging: add logstash.discovery.wmnet to alt names

https://gerrit.wikimedia.org/r/1042299

@andrea.denisse FYI; I'm untangling this from the "Phase out cergen for Observability services" task (so that it no longer shows up in the dependency tree for the Puppet 5 shutdown.

Change #1025445 abandoned by Andrea Denisse:

[operations/puppet@production] trafficserver: Add discovery entries for prometheus hosts

https://gerrit.wikimedia.org/r/1025445

I'll close this task as resolved as all of our services use discovery records now.
I'll continue the work on grafana failovers in T357384. I moved my WIP patches regarding Grafana failovers to that task.

Change #1025447 abandoned by Andrea Denisse:

[operations/dns@master] wmnet: Add discovery entries for the Prometheus hosts

https://gerrit.wikimedia.org/r/1025447