What should happen to Toolhub during the 2023 DC switch?
Closed, ResolvedPublicBUG REPORT
Actions

Assigned To

Authored By

	bd808
	Feb 9 2023, 5:45 PM

Description

In T327920#8570661, @bd808 wrote:

Toolhub does not have a working Kubernetes deployment outside of eqiad (T288685: Establish active/active multi-dc support for Toolhub). Who should I work with to try and prevent this from causing problems for either Toolhub or SREs?

Services used:

Kubernetes cluster with HTTP ingress

Will remain functional apart from maintenance window for kubernetes upgrade T307943: Update Kubernetes clusters to v1.23

m5 MariaDB cluster

m5 is not in scope for the database switchover, will have some short RO periods (1 minute for master switchover) for maintenance

search-chi-eqiad Elasticsearch cluster

No planned maintenance during the switchover

Related Objects
Search...

Status	Subtype	Assigned	Task
Resolved		Clement_Goubert	T327920 March 2023 Datacenter Switchover
Resolved		Clement_Goubert	T328770 March 2023 Datacenter Switchover Blockers
Resolved	BUG REPORT	Clement_Goubert	T329319 What should happen to Toolhub during the 2023 DC switch?

Event Timeline

bd808 created this task.Feb 9 2023, 5:45 PM

Some IRC discussion from 2023-01-30 in the #wikimedia-serviceops channel:

[16:36:45] <bd808> I think I should start asking sooner rather than later how the DC switch (T327920) will effect Toolhub which currently only exists in eqiad largely because of a lack of an active-active master database for it (T288685).
[16:54:02] <jayme> bd808: we just talked about that it probably needs to move to tha aux cluster finally
[16:54:18] <jayme> cc claime
[16:55:12] <claime> yep, but we may want to upgrade the cluster to 1.23 before any service is on it though
[16:55:41] <bd808> cool. I didn't even know there was an aux cluster :)
[16:56:05] <_joe_> bd808: nobody expects the aux cluster!
[16:56:48] <cdanis> the aux cluster is also currently eqiad-only but that will very likely change at some point in the future
[16:57:20] <_joe_> I just want to go on record saying it's possible to connect to eqiad databases from codfw
[16:57:31] <_joe_> I don't think the traffic toolhub does makes it impossible to do
[16:57:55] <_joe_> but the latency might be killer
[16:58:39] <claime> bd808: It's new-ish https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters#aux
[16:59:39] <bd808> The app actually doesn't talk to the db much outside of when the crawler job runs. Most of the data is fetched at runtime from the Elasticsearch cluster. But that would also need attention to become functional active-active.
[17:02:43] <bd808> _j.oe_ is correct though that it could be made to work with the db connection backhauled to eqiad if that actually has value.

@JMeybohm and @Clement_Goubert Do you expect the aux cluster to be ready for the Toolhub workload in time to make this part of the fix for avoiding disruption during the DC switch? Or should I invest in testing the feasibility of connecting to the m5 and search-chi-eqiad services from the codfw k8s cluster?

@Gehel is there planned maintenance for the search-chi-eqiad cluster during the time that codfw is the active DC that active use by Toolhub would block?

@Marostegui is there planned maintenance for the m5 cluster during the time that codfw is the active DC that active use by Toolhub would block?

bd808 updated the task description. (Show Details)Feb 9 2023, 6:00 PM

In T329319#8602423, @bd808 wrote:

@Marostegui is there planned maintenance for the m5 cluster during the time that codfw is the active DC that active use by Toolhub would block?

There might be some maintenance due to the switches maintenance in eqiad and/or due to kernel reboots.
But either way, it would be a master switchover, which implies around 1 minute of RO (like the usual maintenance).

bd808 triaged this task as High priority.Feb 9 2023, 8:31 PM

bd808 changed the subtype of this task from "Task" to "Bug Report".

bd808 mentioned this in T329193: March 2023 Datacenter Switchover Excluded services.Feb 9 2023, 10:35 PM

In T329319#8602389, @bd808 wrote:

@JMeybohm and @Clement_Goubert Do you expect the aux cluster to be ready for the Toolhub workload in time to make this part of the fix for avoiding disruption during the DC switch?

I don't think we should tie this to aux-k8s being ready, or actually managing to migrate the application. We are less than 20 days away from the switchover, let's not constraint ourselves this way.

Or should I invest in testing the feasibility of connecting to the m5 and search-chi-eqiad services from the codfw k8s cluster?

We can also just skip moving the application to codfw, let it in eqiad and just accept downtime during the various maintenance windows. The biggest one would be when the wikikube eqiad cluster will be re-initialized and should last a few hours (which availability wise would put the service in the 99.95% per year bucket, which is pretty impressive). Assuming proper prior notification to the technical communities of course.

I am also thinking, from a database point of view. If we want to use m5 databases in codfw _just_ for toolhub, that is going to be very confusing. Right now m5 is RO on codfw, and eqiad is RW.

MariaDB (or mysql) doesn't have the ability to set RW for just a database, it is a global flag. So what I am thinking is, if we want to enable RW on m5 codfw, we should either move all services using eqiad m5 to codfw for the DC switchover duration, or otherwise we are going to end up with multi-master on m5 with two databases possibly being written at the same time.

We've never thought of a situation where we'd have just one service using mX being moved, it would be either all of them or none of them.

Right now this is what lives in m5:

+---------------------+
| Database            |
+---------------------+
| cxserverdb          |
| heartbeat           |
| idm                 |
| idm_staging         |
| information_schema  |
| labsdbaccounts      |
| mailman3            |
| mailman3web         |
| mysql               |
| performance_schema  |
| striker             |
| sys                 |
| test_labsdbaccounts |
| toolhub             |
+---------------------+
14 rows in set (0.001 sec)

In T329319#8604689, @Marostegui wrote:

I am also thinking, from a database point of view. If we want to use m5 databases in codfw _just_ for toolhub, that is going to be very confusing. Right now m5 is RO on codfw, and eqiad is RW.

MariaDB (or mysql) doesn't have the ability to set RW for just a database, it is a global flag. So what I am thinking is, if we want to enable RW on m5 codfw, we should either move all services using eqiad m5 to codfw for the DC switchover duration, or otherwise we are going to end up with multi-master on m5 with two databases possibly being written at the same time.

Yes, that is why this was never the plan. Messing with RO/RW state of mX clusters has always been explicitly out of scope exactly because of all of this extra complexity.

Clement_Goubert moved this task from Incoming 🐫 to Doing 😎 on the serviceops board.Feb 13 2023, 4:25 PM

In T329319#8602419, @bd808 wrote:

@Gehel is there planned maintenance for the search-chi-eqiad cluster during the time that codfw is the active DC that active use by Toolhub would block?

No planned maintenance on Elasticsearch, everything should work as expected.

Is there any blocker from making toolhub in eqiad use codfw db instead of eqiad db (during the switchover)? It would make it slower but not much, specially given this:

<bd808> The app actually doesn't talk to the db much outside of when the crawler job runs.

Ladsgroup mentioned this in T328768: Wikitech issues for datacentre switchover (March 2023).Feb 14 2023, 8:18 PM

In T329319#8615985, @Ladsgroup wrote:

Is there any blocker from making toolhub in eqiad use codfw db instead of eqiad db (during the switchover)? It would make it slower but not much, specially given this:

<bd808> The app actually doesn't talk to the db much outside of when the crawler job runs.

Let's not do that, check my comment above. It can be a mess

I think there is a misunderstanding, I'm not suggesting to make eqiad RW in m5 (there won't be anything touching the db in eqiad). During the switchover, all of eqiad db will be RO. The appserver will be in eqiad but making cross-dc connection to codfw for the database read and writes. Similar to what's the plan for wikitech. Am I missing something here? (What I mean is during the switchover where the main dc is codfw)

In T329319#8604689, @Marostegui wrote:

Right now this is what lives in m5:

+---------------------+
| Database            |
+---------------------+
| cxserverdb          |
| heartbeat           |
| idm                 |
| idm_staging         |
| information_schema  |
| labsdbaccounts      |
| mailman3            |
| mailman3web         |
| mysql               |
| performance_schema  |
| striker             |
| sys                 |
| test_labsdbaccounts |
| toolhub             |
+---------------------+
14 rows in set (0.001 sec)

In T329319#8616304, @Ladsgroup wrote:

(there won't be anything touching the db in eqiad)

I don't think that is correct based on the current m5 databases. I would expect that mailman3 and striker are both staying in eqiad while MediaWiki and its direct support services wander over to codfw. The labsdbaccounts db is for Toolforge's maintain-dbusers service which is certainly staying in eqiad along with all of the rest of Cloud VPS.

bd808 renamed this task from Toolhub does not have a working Kubernetes deployment outside of eqiad to What should happen to Toolhub during the 2023 DC switch?.Feb 14 2023, 11:38 PM

In T329319#8616304, @Ladsgroup wrote:

I think there is a misunderstanding, I'm not suggesting to make eqiad RW in m5 (there won't be anything touching the db in eqiad). During the switchover, all of eqiad db will be RO. The appserver will be in eqiad but making cross-dc connection to codfw for the database read and writes. Similar to what's the plan for wikitech. Am I missing something here? (What I mean is during the switchover where the main dc is codfw)

m5 is not covered by the CORE_SECTIONS constant in spicerack.mysql_legacy, so in the current state, it won't be set read-only.

In T329319#8616304, @Ladsgroup wrote:

I think there is a misunderstanding, I'm not suggesting to make eqiad RW in m5 (there won't be anything touching the db in eqiad). During the switchover, all of eqiad db will be RO. The appserver will be in eqiad but making cross-dc connection to codfw for the database read and writes. Similar to what's the plan for wikitech. Am I missing something here? (What I mean is during the switchover where the main dc is codfw)

mX sections aren't switched over to codfw, they have never been switched to codfw and they won't be this switch over either. So they will remain untouched and ignored during the switch.

In T329319#8617660, @Marostegui wrote:

mX sections aren't switched over to codfw, they have never been switched to codfw and they won't be this switch over either. So they will remain untouched and ignored during the switch.

That was the context I was missing. My apologies for side-tracking the ticket.

Based on the various responses to my questions about specific dependencies (thanks to everyone who participated!), I think the broad answer is that Toolhub can remain in the eqiad wikikube cluster and continue to use both m5 and search-chi-eqiad as support services. This will have some known caveats. The biggest one seems to be that there will be multiple hours of downtime when the wikikube eqiad cluster is rebuilt with a newer Kubernetes version. There may also be other planned maintenance events for switches and machine reboots which reduce the uptime of the service.

Does that seem like a fair summary? Does anyone have an unaddressed concern?

That seems good to me, as long as you're ok with the downtimes and maintenances.

In T329319#8621299, @Clement_Goubert wrote:

That seems good to me, as long as you're ok with the downtimes and maintenances.

It is not a perfect world solution, but I think that given the current constraints of time and effort the downtime will be manageable.

I will consider this task resolved then, feel free to re-open if there are anymore questions.

akosiaris mentioned this in T331126: Update wikikube eqiad to k8s 1.23.Mar 3 2023, 1:33 PM

bd808 mentioned this in T340950: Improve logging and alerting if toolhub is down.Jul 20 2023, 11:24 PM

What should happen to Toolhub during the 2023 DC switch?Closed, ResolvedPublicBUG REPORTActions

Description

Related ObjectsSearch...

Event Timeline

What should happen to Toolhub during the 2023 DC switch?
Closed, ResolvedPublicBUG REPORT
Actions

Related Objects
Search...