Page MenuHomePhabricator

Analytics coordinator failover improvements
Closed, ResolvedPublic

Description

In T257412 a lot of work has been done to add an initial support/idea of redundancy for Hive/Presto/Meta-Database/Oozie/etc..

Some info collected in https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Coordinator

The current status is the following:

  • We have two coordinators, an-coord1001 and an-coord1002, both running the Hive daemons (server2 and metastore) and mariadb
  • We deployed analytics-hive.eqiad.wmnet, an endpoint (DNS CNAME under the hood) to be able to failover hive traffic transparently to the users (caveat: see db config in the wiki page mentioned above). This allows us to avoid restarting a ton of jobs when we need to migrate the traffic.
  • an-coord1002's mariadb is a replica of an-coord1001's (with monitoring/alarming/etc..)

The goal for this initial work was to reduce the impact of hw failures on an-coord1001, from hours of pain to something that a single SRE could manage with some dns/puppet patches. Some documentation was added in Wikitech about what to do.

There are some question marks to follow up on (in my opinion):

  • The Analytics-Meta mariadb instance is running multiple databases, that is not what suggested by Data Persistence (better one instance per db, for isolation and better replication). We could think about moving the databases out of the coordinators, to a dedicated node, but we'd need to have (in my opinion) the same structure that we have now (active dbs on one node, replica on another one, and backups on db1108). Some sort of automatic failover could be added as well, see for example what we do for dbproxies in production. Regardless of where we keep the databases on, we should have an endpoint like analytics-meta.eqiad.wmnet to avoid hardcoding an-coord1001.eqiad.wmnet in multiple places.
  • Think about where future databases will run. Airflow is being discussed in T272973, and if we go for multi-tenancy/stacks it may be a mistake (in my opinion) to keep adding database on an-coord1001 (as opposed to have other dedicated nodes replicating to db1108).
  • The Presto coordinator on an-coord1001 would probably need something like analytics-presto.eqiad.wmnet for a better and more transparent failover for users.
  • Oozie doesn't support being Active/Standby, and hopefully we'll switch to Airflow, so it is fine in my opinion to just move it (via puppet) to different hosts when needed (like an-coord1001 -> an-coord1002)

Related Objects

Event Timeline

Milimetric lowered the priority of this task from High to Medium.
BTullis claimed this task.
BTullis subscribed.

I believe that we can now close this ticket, given the work that we have undertaken since it was created.

We now have two Hadoop coordinators, an-coord1003 and an-coord1004.
They use the same puppet role.

The only services running on them are:

  • hive-server2
  • hive-metastore
  • presto-server (acting in the coordinator role)

Both hive and presto now use DNS aliases (analytics-hive.eqiad.wmnet and analytics-presto.eqiad.wmnet) to determine which of the hosts is the active one.

The MariaDB databases have now been moved to their own hosts: an-mariadb100[1-2] and I think that we should address issues around MariaDB role switching/failover in a separate ticket.

Similarly, we may want to investigate running a true high-availability presto cluster using a disaggregated coordinator configuration, but I think that should be distinct from the work that was carried out under this ticket.