We have a few hostnames in hieradata/common/profile/trafficserver/backend.yaml that should be moved to discovery records for easier operations (e.g. reimage/flip/etc).
The ultimate goal is to simplify operations wrt the current status quo for each service.
Namely:
target: http://grafana-rw.wikimedia.org replacement: https://grafana1002.eqiad.wmnet target: http://grafana-next-rw.wikimedia.org replacement: https://grafana2001.codfw.wmnet target: http://grafana.wikimedia.org replacement: https://grafana1002.eqiad.wmnet target: http://grafana-next.wikimedia.org replacement: https://grafana2001.codfw.wmnet target: http://logstash.wikimedia.org replacement: https://kibana7.svc.eqiad.wmnet target: http://prometheus-eqiad.wikimedia.org replacement: https://prometheus1005.eqiad.wmnet target: http://prometheus-codfw.wikimedia.org replacement: https://prometheus2005.codfw.wmnet target: http://pyrra.wikimedia.org replacement: http://titan1001.eqiad.wmnet target: http://slo.wikimedia.org replacement: http://titan1001.eqiad.wmnet target: http://slos.wikimedia.org replacement: http://titan1001.eqiad.wmnet
Since different services require different strategies, the following sections outline the trade offs and solutions on a per-service basis.
grafana
This is the trickiest of all I think, ideally I (Filippo) would like a single patch or command to flip the active/standby grafana host. Note that whatever points to grafana.w.o should be also reflected in profile::grafana::active_host (and profile::grafana::standby_host) for the "singleton" units (such as syncing ldap users) to follow.
A pontential solution could look like this:
- introduce grafana.discovery.wmnet being a CNAME to the active host
- the trafficserver configuration above points to grafana.discovery.wmnet
- change the puppet logic to detect the active host so that grafana.discovery.wmnet gets resolved, and if it points to the same address as the host puppet is running host, then we're on the active host, otherwise we're in standby host
- we also need to make sure we can serve grafana / grafana-next from any grafana host (right now we need to change profile::grafana::domain and profile::grafana::domainrw between codfw and eqiad when we flip). This should be doable by moving to an implementation where we have a set of "base" names (grafana, grafana-next) and then the redirect apache rules handle a list of such names and their redirect (basically redirect to base + "-rw" as needed)
In this scenario a codfw/eqiad grafana flip translates to a single DNS patch to move grafana.discovery.wmnet and grafana-next.discovery.wmnet as needed.
A DNS patch I (Filippo) is good enough for now given how infrequently we move grafana around, alternatively we can move grafana.discovery.wmnet to be controlled by conftool, in which case we can use confctl --object-type discovery to pool/depool
logstash
This ties in with moving the read path for logs, moving to a confctl controlled discovery.wmnet record would make flipping datacenters for logstash to be quicker and in line with other services too. What do you think @colewhite? - SGTM!
prometheus
We need to point to individual hosts because we're using mod_auth_cas. When we move to oauth2-proxy for SSO authentication then we can replace those with prometheus.svc.SITE.wmnet. This is basically T326657
pyrra (includes slo/slos)
- point to thanos-web.discovery.wmnet