Page MenuHomePhabricator

Establish active/active multi-dc support for Toolhub
Open, LowPublic

Description

In the current (2021) Wikimedia production environment, it is trivial to deploy the Toolhub Django application to Kubernetes clusters in both the eqiad and codfw datacenters. There are also MariaDB and Elasticsearch services available in both datacenters for Toolhub to use. These data persistence services however are not currently active/active for data writes which in turn prevents the Django application from being run in an active/active configuration.

The MariaDB service that is available for "misc" service use (m5 in the case of Toolhub) provides primary/replica replication of data which must be manually switched to reverse the direction of replication from eqiad->codfw to codfw->eqiad (some discussion at T271480#7229236). Writes to the follower database will not propagate to the leader and will very likely lead to breaking replication via primary key collisions. This fact is the primary blocker to Toolhub serving live traffic from both eqiad and codfw. It could in theory be solved by creating a multi-master database section and moving Toolhub's data to it.

The Elasticsearch services in eqiad and codfw are managed as entirely separate clusters with no knowledge of each other. CirrusSearch deals with this by performing writes in both data centers to maintain eventual consistency parity between the two clusters. A similar approach could be taken by Toolhub if it had some cross-data center queue to allow each dc to notify the other of new database content in need of indexing. With the currently expected size of the Toolhub search index (very, very small) it would also be potentially possible to just reindex everything periodically or to run a background job that compared the source of truth in the database with the local Elasticsearch index to provide eventual consistency to search.

Related Objects

StatusSubtypeAssignedTask
OpenNone
Resolvedbd808
Resolvedbd808
Resolvedbd808
Resolvedbd808
Resolvedbd808
Resolvedbd808
Resolvedbd808
Resolvedbd808
ResolvedMarostegui
DuplicateNone
Resolvedbd808
OpenNone
ResolvedBUG REPORTbd808
Resolvedbd808
Resolvedsbassett
Resolvedbd808
Resolvedbd808
OpenNone
Resolvedbd808
Resolvedbd808
Resolvedbd808
StalledNone
OpenNone
ResolvedJdforrester-WMF
Resolvedbd808
Resolvedbd808
Resolvedbd808
DeclinedNone
Resolvedbd808
Resolvedbd808
Resolvedbd808
ResolvedLegoktm
ResolvedLegoktm
Resolvedbd808
Resolvedbd808
Resolvedbd808
Resolvedbd808
OpenNone
Resolvedbd808

Event Timeline

bd808 created this task.

Change 711763 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[operations/cookbooks@master] switchdc: Exclude toolhub, lacking active/active db

https://gerrit.wikimedia.org/r/711763

Change 711763 merged by jenkins-bot:

[operations/cookbooks@master] switchdc: Exclude toolhub, lacking active/active db

https://gerrit.wikimedia.org/r/711763

Change 724462 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[operations/deployment-charts@master] toolhub: disable crawler cron in codfw

https://gerrit.wikimedia.org/r/724462

Change 724462 merged by jenkins-bot:

[operations/deployment-charts@master] toolhub: disable crawler cron in codfw

https://gerrit.wikimedia.org/r/724462