Move all o11y services to discovery.wmnet
Open, In Progress, Needs TriagePublic
Actions

Assigned To

Authored By

	fgiunchedi
	Feb 1 2024, 11:54 AM

Description

We have a few hostnames in hieradata/common/profile/trafficserver/backend.yaml that should be moved to discovery records for easier operations (e.g. reimage/flip/etc).

The ultimate goal is to simplify operations wrt the current status quo for each service.

Namely:

target: http://grafana-rw.wikimedia.org
replacement: https://grafana1002.eqiad.wmnet
target: http://grafana-next-rw.wikimedia.org
replacement: https://grafana2001.codfw.wmnet
target: http://grafana.wikimedia.org
replacement: https://grafana1002.eqiad.wmnet
target: http://grafana-next.wikimedia.org
replacement: https://grafana2001.codfw.wmnet
target: http://logstash.wikimedia.org
replacement: https://kibana7.svc.eqiad.wmnet
target: http://prometheus-eqiad.wikimedia.org
replacement: https://prometheus1005.eqiad.wmnet
target: http://prometheus-codfw.wikimedia.org
replacement: https://prometheus2005.codfw.wmnet
target: http://pyrra.wikimedia.org
replacement: http://titan1001.eqiad.wmnet
target: http://slo.wikimedia.org
replacement: http://titan1001.eqiad.wmnet
target: http://slos.wikimedia.org
replacement: http://titan1001.eqiad.wmnet

Since different services require different strategies, the following sections outline the trade offs and solutions on a per-service basis.

grafana

This is the trickiest of all I think, ideally I (Filippo) would like a single patch or command to flip the active/standby grafana host. Note that whatever points to grafana.w.o should be also reflected in profile::grafana::active_host (and profile::grafana::standby_host) for the "singleton" units (such as syncing ldap users) to follow.

A pontential solution could look like this:

introduce grafana.discovery.wmnet being a CNAME to the active host
the trafficserver configuration above points to grafana.discovery.wmnet
change the puppet logic to detect the active host so that grafana.discovery.wmnet gets resolved, and if it points to the same address as the host puppet is running host, then we're on the active host, otherwise we're in standby host
we also need to make sure we can serve grafana / grafana-next from any grafana host (right now we need to change profile::grafana::domain and profile::grafana::domainrw between codfw and eqiad when we flip). This should be doable by moving to an implementation where we have a set of "base" names (grafana, grafana-next) and then the redirect apache rules handle a list of such names and their redirect (basically redirect to base + "-rw" as needed)

In this scenario a codfw/eqiad grafana flip translates to a single DNS patch to move grafana.discovery.wmnet and grafana-next.discovery.wmnet as needed.

A DNS patch I (Filippo) is good enough for now given how infrequently we move grafana around, alternatively we can move grafana.discovery.wmnet to be controlled by conftool, in which case we can use confctl --object-type discovery to pool/depool

logstash

This ties in with moving the read path for logs, moving to a confctl controlled discovery.wmnet record would make flipping datacenters for logstash to be quicker and in line with other services too. What do you think @colewhite? - SGTM!

prometheus

We need to point to individual hosts because we're using mod_auth_cas. When we move to oauth2-proxy for SSO authentication then we can replace those with prometheus.svc.SITE.wmnet. This is basically T326657

pyrra (includes slo/slos)

point to thanos-web.discovery.wmnet

Details

Subject	Repo	Branch	Lines +/-
grafana: Use DNS for active/standby host detection for failovers	operations/puppet	production	+2 -2
trafficserver: point pyrra to thanos discovery record	operations/puppet	production	+1 -1
grafana: Use DNS for active/standby host detection for failovers	operations/puppet	production	+40 -21
trafficserver: Add discovery entries for prometheus hosts	operations/puppet	production	+6 -6
trafficserver: Add discovery entries for grafana and grafana-next	operations/puppet	production	+7 -7
wmnet: Add discovery entries for the Prometheus hosts	operations/dns	master	+66 -66
wmnet: Add discovery entries for grafana and grafana-next	operations/dns	master	+4 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T330490 Next steps for Puppet 7
Open	None	T365798 Shutdown of Puppet 5 servers
Open	MoritzMuehlenhoff	T357750 Phase out cergen
Resolved	andrea.denisse	T360414 Phase out cergen for Observability services
In Progress	andrea.denisse	T356386 Move all o11y services to discovery.wmnet

Event Timeline

fgiunchedi created this task.Feb 1 2024, 11:54 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 1 2024, 11:54 AM

lmata edited projects, added Observability-Metrics; removed observability.Feb 7 2024, 3:20 PM

fgiunchedi mentioned this in T357384: Simplify Grafana failovers.Feb 13 2024, 9:23 AM

lmata added a project: SRE Observability (FY2023/2024-Q4).Feb 19 2024, 3:20 PM

andrea.denisse subscribed.Mar 12 2024, 4:00 PM

colewhite subscribed.Mar 12 2024, 4:01 PM

herron subscribed.Mar 12 2024, 4:12 PM

andrea.denisse added a parent task: T360414: Phase out cergen for Observability services.Apr 4 2024, 5:53 PM

fgiunchedi mentioned this in T360414: Phase out cergen for Observability services.Apr 10 2024, 9:24 AM

andrea.denisse claimed this task.Apr 29 2024, 6:31 PM

andrea.denisse updated the task description. (Show Details)Apr 29 2024, 6:37 PM

Change #1024808 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] trafficserver: Add discovery entries for grafana and grafana-next

https://gerrit.wikimedia.org/r/1024808

Change #1024806 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/dns@master] wmnet: Add discovery entries for grafana and grafana-next

https://gerrit.wikimedia.org/r/1024806

Change #1025445 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] trafficserver: Add discovery entries for prometheus hosts

https://gerrit.wikimedia.org/r/1025445

Change #1025447 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/dns@master] wmnet: Add discovery entries for the Prometheus hosts

https://gerrit.wikimedia.org/r/1025447

andrea.denisse changed the task status from Open to In Progress.Apr 30 2024, 4:37 PM

Change #1024806 merged by Andrea Denisse:

[operations/dns@master] wmnet: Add discovery entries for grafana and grafana-next

https://gerrit.wikimedia.org/r/1024806

Change #1024808 merged by Andrea Denisse:

[operations/puppet@production] trafficserver: Add discovery entries for grafana and grafana-next

https://gerrit.wikimedia.org/r/1024808

andrea.denisse updated the task description. (Show Details)May 2 2024, 6:50 PM

fgiunchedi updated the task description. (Show Details)Mon, May 13, 10:31 AM

fgiunchedi updated the task description. (Show Details)

colewhite updated the task description. (Show Details)Mon, May 13, 3:23 PM

Changing the status to open as I'm currently working on T359255.

andrea.denisse renamed this task from Move o11y all services to discovery.wmnet to Move all o11y services to discovery.wmnet.Fri, May 17, 4:53 PM

andrea.denisse changed the task status from Open to In Progress.Mon, May 20, 6:17 PM

Change #1034626 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] grafana: Use DNS for active/standby host detection for failovers

https://gerrit.wikimedia.org/r/1034626

andrea.denisse updated the task description. (Show Details)Wed, May 22, 12:07 AM

andrea.denisse updated the task description. (Show Details)Wed, May 22, 2:52 PM

(from the task description)
pyrra (includes slo/slos)
I believe we could point these to thanos-query.discovery.wmnet right away; what do you think @herron ?

Yes, this should work. Let's try it and then if we find we need more granular control over it we can always split them up.

Change #1035541 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] trafficserver: point pyrra to thanos discovery record

https://gerrit.wikimedia.org/r/1035541

fgiunchedi updated the task description. (Show Details)Fri, May 24, 9:40 AM

Change #1035541 merged by Herron: