In T257412 a lot of work has been done to add an initial support/idea of redundancy for Hive/Presto/Meta-Database/Oozie/etc..
Some info collected in https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Coordinator
The current status is the following:
- We have two coordinators, an-coord1001 and an-coord1002, both running the Hive daemons (server2 and metastore) and mariadb
- We deployed analytics-hive.eqiad.wmnet, an endpoint (DNS CNAME under the hood) to be able to failover hive traffic transparently to the users (caveat: see db config in the wiki page mentioned above). This allows us to avoid restarting a ton of jobs when we need to migrate the traffic.
- an-coord1002's mariadb is a replica of an-coord1001's (with monitoring/alarming/etc..)
The goal for this initial work was to reduce the impact of hw failures on an-coord1001, from hours of pain to something that a single SRE could manage with some dns/puppet patches. Some documentation was added in Wikitech about what to do.
There are some question marks to follow up on (in my opinion):
- The Analytics-Meta mariadb instance is running multiple databases, that is not what suggested by Data Persistence (better one instance per db, for isolation and better replication). We could think about moving the databases out of the coordinators, to a dedicated node, but we'd need to have (in my opinion) the same structure that we have now (active dbs on one node, replica on another one, and backups on db1108). Some sort of automatic failover could be added as well, see for example what we do for dbproxies in production. Regardless of where we keep the databases on, we should have an endpoint like analytics-meta.eqiad.wmnet to avoid hardcoding an-coord1001.eqiad.wmnet in multiple places.
- Think about where future databases will run. Airflow is being discussed in T272973, and if we go for multi-tenancy/stacks it may be a mistake (in my opinion) to keep adding database on an-coord1001 (as opposed to have other dedicated nodes replicating to db1108).
- The Presto coordinator on an-coord1001 would probably need something like analytics-presto.eqiad.wmnet for a better and more transparent failover for users.
- Oozie doesn't support being Active/Standby, and hopefully we'll switch to Airflow, so it is fine in my opinion to just move it (via puppet) to different hosts when needed (like an-coord1001 -> an-coord1002)