Page MenuHomePhabricator

Re-think how we separate traffic to mediawiki in clusters.
Open, HighPublic

Description

We currently have 4 mediawiki clusters:

  • api
  • app
  • jobrunners/videoscalers
  • parsoid

Now that we are moving to k8s, there have been thoughts that maybe we could use a different configuration to service our needs.

Specifically:

  • One cluster to serve live users. This means any wiki page (so /wiki/... or /w/index.php) or any api calls (/w/{rest,api}.php with a session token) NOT coming from a public cloud. This will be separated in -rw and -ro subcategories as it will be served by all datacenters
  • One cluster to serve external API requests and any other request from the public clouds. This will also have both rw and ro endpoints
  • One cluster to serve Calls from Toolforge. We might want to join this with the external api cluster at least at first, but I think it's good to keep the two groups separated.
  • One cluster to serve internal requests. So say when a service needs to make an api call to prepare a response to a live client. This might be potentially be the same cluster as the one serving live users.
  • One cluster to serve asynch processing, which would include MediaWiki jobs but also calls from other services that need to update their cached content, like restbase-async or the WDQS Updater or the upcoming Search Update pipeline
  • One cluster (probably on baremetal at least at the start!) for running videoscaling.
  • One cluster (of 2/4 pods) for mwdebug/testing
  • One cluster (of 2/4 pods) for wikitech

The reason of the proposal is separation of concerns - we want to be able to privilege, under duress, the live users over anything else. Ideally we have a simple script that allows us to scale down everything else and give our full power to our live users. There is a special provision here for Toolforge because we know a lot of important tools are run there, which are fundamental for the good functioning of the wikis.

Event Timeline

Joe triaged this task as High priority.Sep 29 2021, 6:42 AM
Joe subscribed.

I think the title is misleading, I spent 10 minutes trying to figure out what partitioning schemes had to do with moving to kubernetes :D Amending it.

Joe renamed this task from Repartition mediawiki servers to Re-think how we separate traffic to mediawiki in clusters..Sep 29 2021, 6:43 AM

The first scenario I proposed in T290536 goes as follows:

  • One cluster for first deploy/debug purposes (kube-mwdebug)
  • One cluster to serve internal requests to the API (and possibly to wiki pages) (kube-api-internal)
  • One cluster to serve public API traffic (kube-api-external)
  • One cluster to serve the website, both mobile and desktop (kube-wikis)
  • One cluster for jobrunning (kube-jobrunner)
  • One cluster for videoscaling (if we can't move it to shellbox)

We could then split these clusters further between group0/group1/group2 wikis, or across database sections, but that would probably be done with a logical split at the kubernetes level (with say an ingress, or announcing services IPs) and not at an LVS level and represents a second layer of complication that we shouldn't get into right now.

The first scenario I proposed in T290536 goes as follows:

  • One cluster for first deploy/debug purposes (kube-mwdebug)
  • One cluster to serve internal requests to the API (and possibly to wiki pages) (kube-api-internal)
  • One cluster to serve public API traffic (kube-api-external)
  • One cluster to serve the website, both mobile and desktop (kube-wikis)
  • One cluster for jobrunning (kube-jobrunner)
  • One cluster for videoscaling (if we can't move it to shellbox)

We could then split these clusters further between group0/group1/group2 wikis, or across database sections, but that would probably be done with a logical split at the kubernetes level (with say an ingress, or announcing services IPs) and not at an LVS level and represents a second layer of complication that we shouldn't get into right now.

I think this is a sensible separation, we have the flexibility to make adjustments as things progress.

Naming things is hard though, I do not agree with the kube prefix, in the future after baremetal mediawiki servers are gone, it will be an prefix that does not mean much. Moreover, we probably want the discovery URLs to derive from the name of the services. I would propose:

  • mw-debug
  • mw-api-int
  • mw-api-ext
  • mw-app (we have been calling them app for years, we know it mostly means our real user traffic)
  • mw-jobrunner
  • mw-videoscaling

Naming things is hard though, I do not agree with the kube prefix, in the future after baremetal mediawiki servers are gone, it will be an prefix that does not mean much. Moreover, we probably want the discovery URLs to derive from the name of the services. I would propose:

The kube- prefix is noly needed now; it will not be needed once we've moved everything to kubernetes as we will be able to remove the conditionals from mediawiki-config that depend on the servergroup.
Of course, the servergroup is just a label that has a specific use, we can mostly ignore it.

I forgot to add: we probably also want to migrate wikitech early in the process. It will need us to add php-ldap to our debug image, but it should allow us to dogfood the new installation early *and* to normalize as much as possible wikitech in the process.

Naming things is hard though, I do not agree with the kube prefix, in the future after baremetal mediawiki servers are gone, it will be an prefix that does not mean much. Moreover, we probably want the discovery URLs to derive from the name of the services. I would propose:

The kube- prefix is noly needed now; it will not be needed once we've moved everything to kubernetes as we will be able to remove the conditionals from mediawiki-config that depend on the servergroup.
Of course, the servergroup is just a label that has a specific use, we can mostly ignore it.

Understood, I believe though, using the 'finalised' naming scheme from the get-go, will save us some cleaning up in the future.

I forgot to add: we probably also want to migrate wikitech early in the process. It will need us to add php-ldap to our debug image, but it should allow us to dogfood the new installation early *and* to normalize as much as possible wikitech in the process.

We can have a go soon enough even using the mwdebug service we have now, or create a new one for it.

Naming things is hard though, I do not agree with the kube prefix, in the future after baremetal mediawiki servers are gone, it will be an prefix that does not mean much. Moreover, we probably want the discovery URLs to derive from the name of the services. I would propose:

The kube- prefix is noly needed now; it will not be needed once we've moved everything to kubernetes as we will be able to remove the conditionals from mediawiki-config that depend on the servergroup.
Of course, the servergroup is just a label that has a specific use, we can mostly ignore it.

Just to clarify:

the kube- prefix is only needed for the SERVERGROUP env variable we pass to mediawiki. We can use kube- there, and mw- for the namespaces.

Naming things is hard though, I do not agree with the kube prefix, in the future after baremetal mediawiki servers are gone, it will be an prefix that does not mean much. Moreover, we probably want the discovery URLs to derive from the name of the services. I would propose:

  • mw-debug
  • mw-api-int
  • mw-api-ext
  • mw-app (we have been calling them app for years, we know it mostly means our real user traffic)
  • mw-jobrunner
  • mw-videoscaling

I agree with most of these with regard to naming, except mw-app. People who have been here long may know it's the real user traffic, but it's not clear if you do not already have that background info. A migration like this is an occasion to revisit "tradition" and fix meaning drift.
A few suggestions:

  • mw-main <= Frontrunner for me
  • mw-live
  • mw-web

mw-main is probably the least misleading one, yes. I would like mw-web more, but it's going to mislead a lot of people into thinking it's just requests to wiki pages, while I'd like to also add api requests coming from browsers (to a certain approximation).

This may be a stupid question but why would the api requests coming from browsers not go to the endpoint mapped to mw-api-ext?

Naming things is hard though, I do not agree with the kube prefix, in the future after baremetal mediawiki servers are gone, it will be an prefix that does not mean much. Moreover, we probably want the discovery URLs to derive from the name of the services. I would propose:

  • mw-debug
  • mw-api-int
  • mw-api-ext
  • mw-app (we have been calling them app for years, we know it mostly means our real user traffic)
  • mw-jobrunner
  • mw-videoscaling

I agree with most of these with regard to naming, except mw-app. People who have been here long may know it's the real user traffic, but it's not clear if you do not already have that background info. A migration like this is an occasion to revisit "tradition" and fix meaning drift.
A few suggestions:

  • mw-main <= Frontrunner for me
  • mw-live
  • mw-web

Given that main is kind of generic, I think I would vote for mw-web.

This may be a stupid question but why would the api requests coming from browsers not go to the endpoint mapped to mw-api-ext?

Because the api traffic from actual users (which means, if we refer to the mediawiki api and coming from outside of production, browsers and the mobile apps) is what we want to preserve. In moments of high pressure, we could decide to dedicate more resources to that pool rather than to the api pool for bots, who can safely retry later or wait in queue for 10 seconds if needed.

Given both of your answers, I think mw-web is actually the better choice, barring calling it mw-real-users which is kind of weird.
The API calls from browsers still fall under the "web-browsing" umbrella from a functional and grouping standpoint so the moniker still makes sense, and it is more descriptive of what the cluster actually does than the generic mw-main.
Thoughts?

Since there seems to be consensus on everything but mw-{app,main,web}, I'll consider these other service names as valid going forward unless told otherwise:

  • mw-debug
  • mw-api-int
  • mw-api-ext
  • mw-jobrunner
  • mw-videoscaling

NB: This will require switching the current mwdebug service to the new mw-debug service name, with all the configuration changes that entails in the traffic layer, the service discovery and the custom deployment scripts.

Settling on mw-web as there's been no contrary opinion in a week.