Analytics coordinator failover improvements
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	elukey
	Apr 22 2021, 7:12 AM

Description

In T257412 a lot of work has been done to add an initial support/idea of redundancy for Hive/Presto/Meta-Database/Oozie/etc..

Some info collected in https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Coordinator

The current status is the following:

We have two coordinators, an-coord1001 and an-coord1002, both running the Hive daemons (server2 and metastore) and mariadb
We deployed analytics-hive.eqiad.wmnet, an endpoint (DNS CNAME under the hood) to be able to failover hive traffic transparently to the users (caveat: see db config in the wiki page mentioned above). This allows us to avoid restarting a ton of jobs when we need to migrate the traffic.
an-coord1002's mariadb is a replica of an-coord1001's (with monitoring/alarming/etc..)

The goal for this initial work was to reduce the impact of hw failures on an-coord1001, from hours of pain to something that a single SRE could manage with some dns/puppet patches. Some documentation was added in Wikitech about what to do.

There are some question marks to follow up on (in my opinion):

The Analytics-Meta mariadb instance is running multiple databases, that is not what suggested by Data Persistence (better one instance per db, for isolation and better replication). We could think about moving the databases out of the coordinators, to a dedicated node, but we'd need to have (in my opinion) the same structure that we have now (active dbs on one node, replica on another one, and backups on db1108). Some sort of automatic failover could be added as well, see for example what we do for dbproxies in production. Regardless of where we keep the databases on, we should have an endpoint like analytics-meta.eqiad.wmnet to avoid hardcoding an-coord1001.eqiad.wmnet in multiple places.

Think about where future databases will run. Airflow is being discussed in T272973, and if we go for multi-tenancy/stacks it may be a mistake (in my opinion) to keep adding database on an-coord1001 (as opposed to have other dedicated nodes replicating to db1108).

The Presto coordinator on an-coord1001 would probably need something like analytics-presto.eqiad.wmnet for a better and more transparent failover for users.

Oozie doesn't support being Active/Standby, and hopefully we'll switch to Airflow, so it is fine in my opinion to just move it (via puppet) to different hosts when needed (like an-coord1001 -> an-coord1002)

Related Objects
Search...

Status	Assigned	Task
Resolved	BTullis	T280905 Analytics coordinator failover improvements
Resolved	BTullis	T273642 Add analytics-presto.eqiad.wmnet CNAME for Presto coordinator failover
Resolved	BTullis	T284150 Bring an-mariadb100[12] into service
Duplicate	None	T279440 Data drifts between superset_production on an-coord1001 and db1108
		Unknown Object (Task)
Resolved	RobH	T289632 Q1:(Need By: ASAP) rack/setup/install an-db100[12].eqiad.wmnet
Resolved	BTullis	T295312 Recreate analytics-meta replica on db1108 from master on an-coord1001
Resolved	BTullis	T295551 Validate integrity of the failover replica database an-coord1002 against its primary an-coord1001
Declined	BTullis	T287967 Use corosync and pacemaker for presto coordinator active/standby configuration
Declined	BTullis	T287864 Deploy an-test-coord1002 to facilitate failover testing of analytics coordinator role
Resolved	BTullis	T289664 Site: Eqiad - 1 VM request for analytics test cluster - coordinator replica role
		Unknown Object (Task)
Resolved	BTullis	T293938 (Need By: TBD) rack/setup/install an-test-coord1002

Event Timeline

elukey created this task.Apr 22 2021, 7:12 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 22 2021, 7:12 AM

elukey added a subtask: T273642: Add analytics-presto.eqiad.wmnet CNAME for Presto coordinator failover.Apr 22 2021, 7:12 AM

elukey added a subscriber: JAllemandou.

• fdans moved this task from Incoming to Operational Excellence on the Analytics board.May 3 2021, 3:38 PM

Milimetric triaged this task as High priority.May 10 2021, 3:49 PM

Milimetric lowered the priority of this task from High to Medium.

Ottomata added a subtask: T279440: Data drifts between superset_production on an-coord1001 and db1108.Jun 2 2021, 2:00 PM

Ottomata mentioned this in T284150: Bring an-mariadb100[12] into service.Jun 2 2021, 2:09 PM

BTullis mentioned this in T273642: Add analytics-presto.eqiad.wmnet CNAME for Presto coordinator failover.Aug 6 2021, 1:48 PM

BTullis added a subtask: T287864: Deploy an-test-coord1002 to facilitate failover testing of analytics coordinator role.

BTullis closed subtask T273642: Add analytics-presto.eqiad.wmnet CNAME for Presto coordinator failover as Resolved.Aug 23 2021, 2:42 PM

odimitrijevic added a project: Data-Engineering.Jan 6 2022, 1:57 AM

odimitrijevic moved this task from Incoming (new tickets) to Ops Week on the Data-Engineering board.

odimitrijevic removed a project: Analytics.Jan 12 2022, 12:30 AM

BTullis closed subtask T287967: Use corosync and pacemaker for presto coordinator active/standby configuration as Declined.Apr 13 2022, 3:33 PM

BTullis mentioned this in T336062: Decommission an-test-coord1002.May 5 2023, 3:05 PM

BTullis closed subtask T287864: Deploy an-test-coord1002 to facilitate failover testing of analytics coordinator role as Declined.

JArguello-WMF moved this task from Ops Week to Event Platform Backlog on the Data-Engineering board.Jun 29 2023, 11:15 PM

BTullis edited projects, added Data-Platform-SRE; removed Data-Engineering.Jul 14 2023, 11:49 PM

BTullis closed subtask T284150: Bring an-mariadb100[12] into service as Resolved.Nov 17 2023, 2:31 PM

Gehel moved this task from Incoming to Toil / Automation on the Data-Platform-SRE board.Dec 7 2023, 1:55 PM

I believe that we can now close this ticket, given the work that we have undertaken since it was created.

We now have two Hadoop coordinators, an-coord1003 and an-coord1004.
They use the same puppet role.

The only services running on them are:

hive-server2
hive-metastore
presto-server (acting in the coordinator role)

Both hive and presto now use DNS aliases (analytics-hive.eqiad.wmnet and analytics-presto.eqiad.wmnet) to determine which of the hosts is the active one.

The MariaDB databases have now been moved to their own hosts: an-mariadb100[1-2] and I think that we should address issues around MariaDB role switching/failover in a separate ticket.

Similarly, we may want to investigate running a true high-availability presto cluster using a disaggregated coordinator configuration, but I think that should be distinct from the work that was carried out under this ticket.

I have created T360769: Investigate high-availability and managed failover mechanisms for the analytics_meta MariaDB instances to track the work relating to the remaining MariaDB work.

I have created T360771: Consider running presto with disaggregated coordinators to facilitate routine maintenance to track the work relating to the disaggregated presto coordinators.

Analytics coordinator failover improvementsClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Analytics coordinator failover improvements
Closed, ResolvedPublic
Actions

Related Objects
Search...