Page MenuHomePhabricator

Move all o11y services to discovery.wmnet
Open, In Progress, Needs TriagePublic

Description

We have a few hostnames in hieradata/common/profile/trafficserver/backend.yaml that should be moved to discovery records for easier operations (e.g. reimage/flip/etc).

The ultimate goal is to simplify operations wrt the current status quo for each service.

Namely:

target: http://grafana-rw.wikimedia.org
replacement: https://grafana1002.eqiad.wmnet
target: http://grafana-next-rw.wikimedia.org
replacement: https://grafana2001.codfw.wmnet
target: http://grafana.wikimedia.org
replacement: https://grafana1002.eqiad.wmnet
target: http://grafana-next.wikimedia.org
replacement: https://grafana2001.codfw.wmnet
target: http://logstash.wikimedia.org
replacement: https://kibana7.svc.eqiad.wmnet
target: http://prometheus-eqiad.wikimedia.org
replacement: https://prometheus1005.eqiad.wmnet
target: http://prometheus-codfw.wikimedia.org
replacement: https://prometheus2005.codfw.wmnet
target: http://pyrra.wikimedia.org
replacement: http://titan1001.eqiad.wmnet
target: http://slo.wikimedia.org
replacement: http://titan1001.eqiad.wmnet
target: http://slos.wikimedia.org
replacement: http://titan1001.eqiad.wmnet

Since different services require different strategies, the following sections outline the trade offs and solutions on a per-service basis.

grafana

This is the trickiest of all I think, ideally I (Filippo) would like a single patch or command to flip the active/standby grafana host. Note that whatever points to grafana.w.o should be also reflected in profile::grafana::active_host (and profile::grafana::standby_host) for the "singleton" units (such as syncing ldap users) to follow.

A pontential solution could look like this:

  • introduce grafana.discovery.wmnet being a CNAME to the active host
  • the trafficserver configuration above points to grafana.discovery.wmnet
  • change the puppet logic to detect the active host so that grafana.discovery.wmnet gets resolved, and if it points to the same address as the host puppet is running host, then we're on the active host, otherwise we're in standby host
  • we also need to make sure we can serve grafana / grafana-next from any grafana host (right now we need to change profile::grafana::domain and profile::grafana::domainrw between codfw and eqiad when we flip). This should be doable by moving to an implementation where we have a set of "base" names (grafana, grafana-next) and then the redirect apache rules handle a list of such names and their redirect (basically redirect to base + "-rw" as needed)

In this scenario a codfw/eqiad grafana flip translates to a single DNS patch to move grafana.discovery.wmnet and grafana-next.discovery.wmnet as needed.

A DNS patch I (Filippo) is good enough for now given how infrequently we move grafana around, alternatively we can move grafana.discovery.wmnet to be controlled by conftool, in which case we can use confctl --object-type discovery to pool/depool

logstash

This ties in with moving the read path for logs, moving to a confctl controlled discovery.wmnet record would make flipping datacenters for logstash to be quicker and in line with other services too. What do you think @colewhite? - SGTM!

prometheus

We need to point to individual hosts because we're using mod_auth_cas. When we move to oauth2-proxy for SSO authentication then we can replace those with prometheus.svc.SITE.wmnet. This is basically T326657

pyrra (includes slo/slos)
  • point to thanos-web.discovery.wmnet

Event Timeline

Change #1024808 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] trafficserver: Add discovery entries for grafana and grafana-next

https://gerrit.wikimedia.org/r/1024808

Change #1024806 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/dns@master] wmnet: Add discovery entries for grafana and grafana-next

https://gerrit.wikimedia.org/r/1024806

Change #1025445 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] trafficserver: Add discovery entries for prometheus hosts

https://gerrit.wikimedia.org/r/1025445

Change #1025447 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/dns@master] wmnet: Add discovery entries for the Prometheus hosts

https://gerrit.wikimedia.org/r/1025447

andrea.denisse changed the task status from Open to In Progress.Apr 30 2024, 4:37 PM

Change #1024806 merged by Andrea Denisse:

[operations/dns@master] wmnet: Add discovery entries for grafana and grafana-next

https://gerrit.wikimedia.org/r/1024806

Change #1024808 merged by Andrea Denisse:

[operations/puppet@production] trafficserver: Add discovery entries for grafana and grafana-next

https://gerrit.wikimedia.org/r/1024808

andrea.denisse changed the task status from In Progress to Open.Wed, May 15, 7:50 PM

Changing the status to open as I'm currently working on T359255.

andrea.denisse renamed this task from Move o11y all services to discovery.wmnet to Move all o11y services to discovery.wmnet.Fri, May 17, 4:53 PM
andrea.denisse changed the task status from Open to In Progress.Mon, May 20, 6:17 PM

Change #1034626 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] grafana: Use DNS for active/standby host detection for failovers

https://gerrit.wikimedia.org/r/1034626

(from the task description)
pyrra (includes slo/slos)
I believe we could point these to thanos-query.discovery.wmnet right away; what do you think @herron ?

Yes, this should work. Let's try it and then if we find we need more granular control over it we can always split them up.

Change #1035541 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] trafficserver: point pyrra to thanos discovery record

https://gerrit.wikimedia.org/r/1035541

Change #1035541 merged by Herron:

[operations/puppet@production] trafficserver: point pyrra to thanos discovery record

https://gerrit.wikimedia.org/r/1035541

Change #1036315 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] grafana: Use DNS for active/standby host detection for failovers

https://gerrit.wikimedia.org/r/1036315