Maniphest T291918

Re-think how we separate traffic to mediawiki in clusters.
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	jijiki
	Sep 28 2021, 11:23 AM

Description

We currently have 4 mediawiki clusters:

api
app
jobrunners/videoscalers
parsoid

Now that we are moving to k8s, there have been thoughts that maybe we could use a different configuration to service our needs.

Specifically:

One cluster to serve live users. This means any wiki page (so /wiki/... or /w/index.php) or any api calls (/w/{rest,api}.php with a session token) NOT coming from a public cloud. This will be separated in -rw and -ro subcategories as it will be served by all datacenters
One cluster to serve external API requests and any other request from the public clouds. This will also have both rw and ro endpoints
One cluster to serve Calls from Toolforge. We might want to join this with the external api cluster at least at first, but I think it's good to keep the two groups separated.
One cluster to serve internal requests. So say when a service needs to make an api call to prepare a response to a live client. This might be potentially be the same cluster as the one serving live users.
One cluster to serve asynch processing, which would include MediaWiki jobs but also calls from other services that need to update their cached content, like restbase-async or the WDQS Updater or the upcoming Search Update pipeline
One cluster (probably on baremetal at least at the start!) for running videoscaling.
One cluster (of 2/4 pods) for mwdebug/testing
One cluster (of 2/4 pods) for wikitech

The reason of the proposal is separation of concerns - we want to be able to privilege, under duress, the live users over anything else. Ideally we have a simple script that allows us to scale down everything else and give our full power to our live users. There is a special provision here for Toolforge because we know a lot of important tools are run there, which are fundamental for the good functioning of the wikis.

Related Objects

Mentioned In: T321786: Deploy mediawiki kubernetes services
T321201: Deploy new mw-debug service
T290536: Serve production traffic via Kubernetes
Mentioned Here: T290536: Serve production traffic via Kubernetes

Event Timeline

jijiki created this task.Sep 28 2021, 11:23 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 28 2021, 11:23 AM

• taavi subscribed.Sep 28 2021, 11:24 AM

RhinosF1 subscribed.Sep 28 2021, 11:40 AM

jijiki mentioned this in T290536: Serve production traffic via Kubernetes.Sep 28 2021, 11:43 AM

I think the title is misleading, I spent 10 minutes trying to figure out what partitioning schemes had to do with moving to kubernetes :D Amending it.

Joe renamed this task from Repartition mediawiki servers to Re-think how we separate traffic to mediawiki in clusters..Sep 29 2021, 6:43 AM

The first scenario I proposed in T290536 goes as follows:

One cluster for first deploy/debug purposes (kube-mwdebug)
One cluster to serve internal requests to the API (and possibly to wiki pages) (kube-api-internal)
One cluster to serve public API traffic (kube-api-external)
One cluster to serve the website, both mobile and desktop (kube-wikis)
One cluster for jobrunning (kube-jobrunner)
One cluster for videoscaling (if we can't move it to shellbox)

We could then split these clusters further between group0/group1/group2 wikis, or across database sections, but that would probably be done with a logical split at the kubernetes level (with say an ingress, or announcing services IPs) and not at an LVS level and represents a second layer of complication that we shouldn't get into right now.

In T291918#7386927, @Joe wrote:

The first scenario I proposed in T290536 goes as follows:

One cluster for first deploy/debug purposes (kube-mwdebug)

One cluster to serve internal requests to the API (and possibly to wiki pages) (kube-api-internal)

One cluster to serve public API traffic (kube-api-external)

One cluster to serve the website, both mobile and desktop (kube-wikis)

One cluster for jobrunning (kube-jobrunner)

One cluster for videoscaling (if we can't move it to shellbox)

We could then split these clusters further between group0/group1/group2 wikis, or across database sections, but that would probably be done with a logical split at the kubernetes level (with say an ingress, or announcing services IPs) and not at an LVS level and represents a second layer of complication that we shouldn't get into right now.

I think this is a sensible separation, we have the flexibility to make adjustments as things progress.

Naming things is hard though, I do not agree with the kube prefix, in the future after baremetal mediawiki servers are gone, it will be an prefix that does not mean much. Moreover, we probably want the discovery URLs to derive from the name of the services. I would propose:

mw-debug
mw-api-int
mw-api-ext
mw-app (we have been calling them app for years, we know it mostly means our real user traffic)
mw-jobrunner
mw-videoscaling

In T291918#7387656, @jijiki wrote:

Naming things is hard though, I do not agree with the kube prefix, in the future after baremetal mediawiki servers are gone, it will be an prefix that does not mean much. Moreover, we probably want the discovery URLs to derive from the name of the services. I would propose:

The kube- prefix is noly needed now; it will not be needed once we've moved everything to kubernetes as we will be able to remove the conditionals from mediawiki-config that depend on the servergroup.
Of course, the servergroup is just a label that has a specific use, we can mostly ignore it.

I forgot to add: we probably also want to migrate wikitech early in the process. It will need us to add php-ldap to our debug image, but it should allow us to dogfood the new installation early *and* to normalize as much as possible wikitech in the process.

In T291918#7387775, @Joe wrote:

In T291918#7387656, @jijiki wrote:

Naming things is hard though, I do not agree with the kube prefix, in the future after baremetal mediawiki servers are gone, it will be an prefix that does not mean much. Moreover, we probably want the discovery URLs to derive from the name of the services. I would propose:

The kube- prefix is noly needed now; it will not be needed once we've moved everything to kubernetes as we will be able to remove the conditionals from mediawiki-config that depend on the servergroup.
Of course, the servergroup is just a label that has a specific use, we can mostly ignore it.

Understood, I believe though, using the 'finalised' naming scheme from the get-go, will save us some cleaning up in the future.

In T291918#7387778, @Joe wrote:

I forgot to add: we probably also want to migrate wikitech early in the process. It will need us to add php-ldap to our debug image, but it should allow us to dogfood the new installation early *and* to normalize as much as possible wikitech in the process.

We can have a go soon enough even using the mwdebug service we have now, or create a new one for it.

jijiki moved this task from Incoming 🐫 to 🙈🙉🙊Backlog on the serviceops board.Sep 28 2022, 2:20 PM

Joe updated the task description. (Show Details)Sep 30 2022, 7:51 AM

In T291918#7387775, @Joe wrote:

In T291918#7387656, @jijiki wrote:

Naming things is hard though, I do not agree with the kube prefix, in the future after baremetal mediawiki servers are gone, it will be an prefix that does not mean much. Moreover, we probably want the discovery URLs to derive from the name of the services. I would propose:

The kube- prefix is noly needed now; it will not be needed once we've moved everything to kubernetes as we will be able to remove the conditionals from mediawiki-config that depend on the servergroup.
Of course, the servergroup is just a label that has a specific use, we can mostly ignore it.

Just to clarify:

the kube- prefix is only needed for the SERVERGROUP env variable we pass to mediawiki. We can use kube- there, and mw- for the namespaces.

In T291918#7387656, @jijiki wrote:

Naming things is hard though, I do not agree with the kube prefix, in the future after baremetal mediawiki servers are gone, it will be an prefix that does not mean much. Moreover, we probably want the discovery URLs to derive from the name of the services. I would propose:

mw-debug

mw-api-int

mw-api-ext

mw-app (we have been calling them app for years, we know it mostly means our real user traffic)

mw-jobrunner

mw-videoscaling

I agree with most of these with regard to naming, except mw-app. People who have been here long may know it's the real user traffic, but it's not clear if you do not already have that background info. A migration like this is an occasion to revisit "tradition" and fix meaning drift.
A few suggestions:

mw-main <= Frontrunner for me
mw-live
mw-web

mw-main is probably the least misleading one, yes. I would like mw-web more, but it's going to mislead a lot of people into thinking it's just requests to wiki pages, while I'd like to also add api requests coming from browsers (to a certain approximation).

This may be a stupid question but why would the api requests coming from browsers not go to the endpoint mapped to mw-api-ext?

In T291918#8328103, @Clement_Goubert wrote:

In T291918#7387656, @jijiki wrote:

Naming things is hard though, I do not agree with the kube prefix, in the future after baremetal mediawiki servers are gone, it will be an prefix that does not mean much. Moreover, we probably want the discovery URLs to derive from the name of the services. I would propose:

mw-debug

mw-api-int

mw-api-ext

mw-app (we have been calling them app for years, we know it mostly means our real user traffic)

mw-jobrunner

mw-videoscaling

I agree with most of these with regard to naming, except mw-app. People who have been here long may know it's the real user traffic, but it's not clear if you do not already have that background info. A migration like this is an occasion to revisit "tradition" and fix meaning drift.
A few suggestions:

mw-main <= Frontrunner for me

mw-live

mw-web

Given that main is kind of generic, I think I would vote for mw-web.

In T291918#8328319, @Clement_Goubert wrote:

This may be a stupid question but why would the api requests coming from browsers not go to the endpoint mapped to mw-api-ext?

Because the api traffic from actual users (which means, if we refer to the mediawiki api and coming from outside of production, browsers and the mobile apps) is what we want to preserve. In moments of high pressure, we could decide to dedicate more resources to that pool rather than to the api pool for bots, who can safely retry later or wait in queue for 10 seconds if needed.

Given both of your answers, I think mw-web is actually the better choice, barring calling it mw-real-users which is kind of weird.
The API calls from browsers still fall under the "web-browsing" umbrella from a functional and grouping standpoint so the moniker still makes sense, and it is more descriptive of what the cluster actually does than the generic mw-main.
Thoughts?

Since there seems to be consensus on everything but mw-{app,main,web}, I'll consider these other service names as valid going forward unless told otherwise:

mw-debug
mw-api-int
mw-api-ext
mw-jobrunner
mw-videoscaling

NB: This will require switching the current mwdebug service to the new mw-debug service name, with all the configuration changes that entails in the traffic layer, the service discovery and the custom deployment scripts.

Clement_Goubert moved this task from Backlog to In Progress on the MW-on-K8s board.Oct 19 2022, 1:30 PM

Clement_Goubert mentioned this in T321201: Deploy new mw-debug service.Oct 19 2022, 2:28 PM

Settling on mw-web as there's been no contrary opinion in a week.

Clement_Goubert mentioned this in T321786: Deploy mediawiki kubernetes services.Oct 27 2022, 10:43 AM

Krinkle updated the task description. (Show Details)Oct 27 2022, 4:15 PM

• dancy subscribed.Nov 10 2022, 3:39 PM