Consider running presto with disaggregated coordinators to facilitate routine maintenance
Open, HighPublic
Actions

Assigned To

None

Authored By

	BTullis
	Mar 22 2024, 12:47 PM

Description

We currently run presto in a mode whereby its coordinator role is a single-point-of-failure.
Although we run two instances of the presto coordinator process, each of them is unaware of the other and believes that it alone knows the true state of the presto cluster.

All worker nodes register to a single coordinator (or discovery server) which we set to be analytics-presto.eqiad.wmnet
This is in fact a DNS CNAME that points to either an-coord1003 or an-coord1004.

When we wish to take down the active coordinator for maintenance, what we have to do is to change the DNS alias and then issue a full cluster restart in order to force the workers to re-register with the replacement coordinator. This causes downtime for the cluster.

A more sophisticated configuration is to use disaggregated coordinators, which share a common view of the cluster and any of them may be used.

However, deploying this configuration requires the use of a new presto component called the resource manager.
We have not yet decided how and where these resource manager instances should run.

Acceptance criteria

Evaluate whether or not the disaggregated coordinator setup is likely to be valuable for us

Related Objects

Mentioned In: T280905: Analytics coordinator failover improvements

Event Timeline

BTullis created this task.Mar 22 2024, 12:47 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 22 2024, 12:47 PM

BTullis mentioned this in T280905: Analytics coordinator failover improvements.Mar 22 2024, 12:48 PM

Aklapper added a project: Data-Platform-SRE.Mar 25 2024, 9:45 AM

Gehel triaged this task as High priority.Mar 27 2024, 3:24 PM

Gehel moved this task from Incoming to Toil / Automation on the Data-Platform-SRE board.

Consider running presto with disaggregated coordinators to facilitate routine maintenanceOpen, HighPublicActions

Description

Acceptance criteria

Related Objects

Event Timeline

Consider running presto with disaggregated coordinators to facilitate routine maintenance
Open, HighPublic
Actions