Page MenuHomePhabricator

What should happen to Toolhub during the 2023 DC switch?
Closed, ResolvedPublicBUG REPORT

Description

Toolhub does not have a working Kubernetes deployment outside of eqiad (T288685: Establish active/active multi-dc support for Toolhub). Who should I work with to try and prevent this from causing problems for either Toolhub or SREs?

Services used:

  • Kubernetes cluster with HTTP ingress

Will remain functional apart from maintenance window for kubernetes upgrade T307943: Update Kubernetes clusters to v1.23

  • m5 MariaDB cluster

m5 is not in scope for the database switchover, will have some short RO periods (1 minute for master switchover) for maintenance

  • search-chi-eqiad Elasticsearch cluster

No planned maintenance during the switchover

Event Timeline

Some IRC discussion from 2023-01-30 in the #wikimedia-serviceops channel:

[16:36:45] <bd808> I think I should start asking sooner rather than later how the DC switch (T327920) will effect Toolhub which currently only exists in eqiad largely because of a lack of an active-active master database for it (T288685).
[16:54:02] <jayme> bd808: we just talked about that it probably needs to move to tha aux cluster finally
[16:54:18] <jayme> cc claime
[16:55:12] <claime> yep, but we may want to upgrade the cluster to 1.23 before any service is on it though
[16:55:41] <bd808> cool. I didn't even know there was an aux cluster :)
[16:56:05] <_joe_> bd808: nobody expects the aux cluster!
[16:56:48] <cdanis> the aux cluster is also currently eqiad-only but that will very likely change at some point in the future
[16:57:20] <_joe_> I just want to go on record saying it's possible to connect to eqiad databases from codfw
[16:57:31] <_joe_> I don't think the traffic toolhub does makes it impossible to do
[16:57:55] <_joe_> but the latency might be killer
[16:58:39] <claime> bd808: It's new-ish https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters#aux
[16:59:39] <bd808> The app actually doesn't talk to the db much outside of when the crawler job runs. Most of the data is fetched at runtime from the Elasticsearch cluster. But that would also need attention to become functional active-active.
[17:02:43] <bd808> _j.oe_ is correct though that it could be made to work with the db connection backhauled to eqiad if that actually has value.

@JMeybohm and @Clement_Goubert Do you expect the aux cluster to be ready for the Toolhub workload in time to make this part of the fix for avoiding disruption during the DC switch? Or should I invest in testing the feasibility of connecting to the m5 and search-chi-eqiad services from the codfw k8s cluster?

@Gehel is there planned maintenance for the search-chi-eqiad cluster during the time that codfw is the active DC that active use by Toolhub would block?

@Marostegui is there planned maintenance for the m5 cluster during the time that codfw is the active DC that active use by Toolhub would block?

@Marostegui is there planned maintenance for the m5 cluster during the time that codfw is the active DC that active use by Toolhub would block?

There might be some maintenance due to the switches maintenance in eqiad and/or due to kernel reboots.
But either way, it would be a master switchover, which implies around 1 minute of RO (like the usual maintenance).

bd808 triaged this task as High priority.Feb 9 2023, 8:31 PM
bd808 changed the subtype of this task from "Task" to "Bug Report".

@JMeybohm and @Clement_Goubert Do you expect the aux cluster to be ready for the Toolhub workload in time to make this part of the fix for avoiding disruption during the DC switch?

I don't think we should tie this to aux-k8s being ready, or actually managing to migrate the application. We are less than 20 days away from the switchover, let's not constraint ourselves this way.

Or should I invest in testing the feasibility of connecting to the m5 and search-chi-eqiad services from the codfw k8s cluster?

We can also just skip moving the application to codfw, let it in eqiad and just accept downtime during the various maintenance windows. The biggest one would be when the wikikube eqiad cluster will be re-initialized and should last a few hours (which availability wise would put the service in the 99.95% per year bucket, which is pretty impressive). Assuming proper prior notification to the technical communities of course.

I am also thinking, from a database point of view. If we want to use m5 databases in codfw _just_ for toolhub, that is going to be very confusing. Right now m5 is RO on codfw, and eqiad is RW.

MariaDB (or mysql) doesn't have the ability to set RW for just a database, it is a global flag. So what I am thinking is, if we want to enable RW on m5 codfw, we should either move all services using eqiad m5 to codfw for the DC switchover duration, or otherwise we are going to end up with multi-master on m5 with two databases possibly being written at the same time.

We've never thought of a situation where we'd have just one service using mX being moved, it would be either all of them or none of them.

Right now this is what lives in m5:

+---------------------+
| Database            |
+---------------------+
| cxserverdb          |
| heartbeat           |
| idm                 |
| idm_staging         |
| information_schema  |
| labsdbaccounts      |
| mailman3            |
| mailman3web         |
| mysql               |
| performance_schema  |
| striker             |
| sys                 |
| test_labsdbaccounts |
| toolhub             |
+---------------------+
14 rows in set (0.001 sec)

I am also thinking, from a database point of view. If we want to use m5 databases in codfw _just_ for toolhub, that is going to be very confusing. Right now m5 is RO on codfw, and eqiad is RW.

MariaDB (or mysql) doesn't have the ability to set RW for just a database, it is a global flag. So what I am thinking is, if we want to enable RW on m5 codfw, we should either move all services using eqiad m5 to codfw for the DC switchover duration, or otherwise we are going to end up with multi-master on m5 with two databases possibly being written at the same time.

Yes, that is why this was never the plan. Messing with RO/RW state of mX clusters has always been explicitly out of scope exactly because of all of this extra complexity.

@Gehel is there planned maintenance for the search-chi-eqiad cluster during the time that codfw is the active DC that active use by Toolhub would block?

No planned maintenance on Elasticsearch, everything should work as expected.

Is there any blocker from making toolhub in eqiad use codfw db instead of eqiad db (during the switchover)? It would make it slower but not much, specially given this:

<bd808> The app actually doesn't talk to the db much outside of when the crawler job runs.

Is there any blocker from making toolhub in eqiad use codfw db instead of eqiad db (during the switchover)? It would make it slower but not much, specially given this:

<bd808> The app actually doesn't talk to the db much outside of when the crawler job runs.

Let's not do that, check my comment above. It can be a mess

I think there is a misunderstanding, I'm not suggesting to make eqiad RW in m5 (there won't be anything touching the db in eqiad). During the switchover, all of eqiad db will be RO. The appserver will be in eqiad but making cross-dc connection to codfw for the database read and writes. Similar to what's the plan for wikitech. Am I missing something here? (What I mean is during the switchover where the main dc is codfw)

Right now this is what lives in m5:

+---------------------+
| Database            |
+---------------------+
| cxserverdb          |
| heartbeat           |
| idm                 |
| idm_staging         |
| information_schema  |
| labsdbaccounts      |
| mailman3            |
| mailman3web         |
| mysql               |
| performance_schema  |
| striker             |
| sys                 |
| test_labsdbaccounts |
| toolhub             |
+---------------------+
14 rows in set (0.001 sec)

(there won't be anything touching the db in eqiad)

I don't think that is correct based on the current m5 databases. I would expect that mailman3 and striker are both staying in eqiad while MediaWiki and its direct support services wander over to codfw. The labsdbaccounts db is for Toolforge's maintain-dbusers service which is certainly staying in eqiad along with all of the rest of Cloud VPS.

bd808 renamed this task from Toolhub does not have a working Kubernetes deployment outside of eqiad to What should happen to Toolhub during the 2023 DC switch?.Feb 14 2023, 11:38 PM

I think there is a misunderstanding, I'm not suggesting to make eqiad RW in m5 (there won't be anything touching the db in eqiad). During the switchover, all of eqiad db will be RO. The appserver will be in eqiad but making cross-dc connection to codfw for the database read and writes. Similar to what's the plan for wikitech. Am I missing something here? (What I mean is during the switchover where the main dc is codfw)

m5 is not covered by the CORE_SECTIONS constant in spicerack.mysql_legacy, so in the current state, it won't be set read-only.

I think there is a misunderstanding, I'm not suggesting to make eqiad RW in m5 (there won't be anything touching the db in eqiad). During the switchover, all of eqiad db will be RO. The appserver will be in eqiad but making cross-dc connection to codfw for the database read and writes. Similar to what's the plan for wikitech. Am I missing something here? (What I mean is during the switchover where the main dc is codfw)

mX sections aren't switched over to codfw, they have never been switched to codfw and they won't be this switch over either. So they will remain untouched and ignored during the switch.

mX sections aren't switched over to codfw, they have never been switched to codfw and they won't be this switch over either. So they will remain untouched and ignored during the switch.

That was the context I was missing. My apologies for side-tracking the ticket.

Based on the various responses to my questions about specific dependencies (thanks to everyone who participated!), I think the broad answer is that Toolhub can remain in the eqiad wikikube cluster and continue to use both m5 and search-chi-eqiad as support services. This will have some known caveats. The biggest one seems to be that there will be multiple hours of downtime when the wikikube eqiad cluster is rebuilt with a newer Kubernetes version. There may also be other planned maintenance events for switches and machine reboots which reduce the uptime of the service.

Does that seem like a fair summary? Does anyone have an unaddressed concern?

That seems good to me, as long as you're ok with the downtimes and maintenances.

That seems good to me, as long as you're ok with the downtimes and maintenances.

It is not a perfect world solution, but I think that given the current constraints of time and effort the downtime will be manageable.

Clement_Goubert claimed this task.
Clement_Goubert updated the task description. (Show Details)

I will consider this task resolved then, feel free to re-open if there are anymore questions.