Page MenuHomePhabricator

Simplify Grafana failovers
Open, MediumPublic

Description

During the parent task we ran into an issue where Grafana failovers also require moving hiera variables around, which isn't expected and we've falled twice already into this pitfall.

The procedure is documented now on wikitech thanks to @andrea.denisse, which is good and we should be working on making switchovers even simpler.

This is the list of actions I believe we should take

  • Move grafana to discovery records (T356386)
  • Make sure we serve all grafana vhosts from all hosts, instead of the grafana / grafana-next distinction we have now. (right now we need to change profile::grafana::domain and profile::grafana::domainrw between codfw and eqiad when we flip). This should be doable by moving to an implementation where we have a set of "base" names (grafana, grafana-next) and then the redirect apache rules handle a list of such names and their redirect (basically redirect to base + "-rw" as needed)
  • Protect the standby host(s) by refusing write actions to grafana API at the apache level
  • Make sure the sync timers follow the active/passive host, bonus points if this happens without a patch to puppet. change the puppet logic to detect the active host so that grafana.discovery.wmnet gets resolved, and if it points to the same address as the host puppet is running host, then we're on the active host, otherwise we're in standby host

In this scenario a codfw/eqiad grafana flip translates to a single DNS patch to move grafana.discovery.wmnet and grafana-next.discovery.wmnet as needed.

A DNS patch I (Filippo) is good enough for now given how infrequently we move grafana around, alternatively we can move grafana.discovery.wmnet to be controlled by conftool, in which case we can use confctl --object-type discovery to pool/depool

Event Timeline

lmata triaged this task as Medium priority.Oct 31 2024, 9:27 PM
lmata moved this task from Inbox to Prioritized on the Observability-Metrics board.

Change #1034626 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] grafana: Use DNS for active/standby host detection for failovers

https://gerrit.wikimedia.org/r/1034626

Change #1036315 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] grafana: Use DNS for active/standby host detection for failovers

https://gerrit.wikimedia.org/r/1036315