Page MenuHomePhabricator

Serve production traffic via Kubernetes
Open, MediumPublic

Description

As we are getting closer and closer to a fully functional MW-on-K8s image, we can start discussing our testing in production and rolling out options.
(Task description will be updated as we are figuring out our next steps)

Background

History
When we migrated to PHP7, users were served by PHP7 based on the existence of a cookie, set by mediawiki, and the X-Seven header, set at the traffic layer. On the application layer, Apache would route a request to the relevant backend, HHVM or php-fpm.

Having a cookie allowed us to first let beta users in, and in turn progressively increase the amount of anonymous user traffic served via PHP7. We then continued by converting API servers to php7_only servers, and finally converted all jobrunners. Additionally, we split the outer caching layer into PHP7 rendered pages and HHVM rendered pages (vary-slotting). Note that we didn't though do this for parsercache too, At the time, we were only using Varnish, so all this logic was written in VCL, with some additional apache config.

Now
This migration is slightly different this time:

  • Caching layer consists of Varnish and ATS (VCL and LUA)
  • Decision of where to route an incoming request will be taken at the caching layer
  • We have 4 mediawiki clusters: api, app, jobrunners, and parsoid
  • we are older

Proposed Plans

After a brief discussion with Traffic and Performance-Team, we have:

Proposal #1: URL routing

Given that app and api servers share the same configuration, and assuming that initially we will have the same discovery URL, eg mw-k8s-rw.discovery.wmnet,
we can start by routing a some low traffic URLs to Kubernetes, for example https://en.wikipedia.org/wiki/Barack_Obama. When we are more comfortable, we can start migrating some small wikis, and eventually, migrate them all.

Prons

  • No complex and dangerous VCL and LUA changes
  • Edge cache will not be polluted since we will always have the k8s rendered article
  • Easy edge cache invalidation (single pages or entire wikis)

Cons

  • Less control over traffic served
  • Won't be able to create a beta feature
  • Longer rolling out strategy
  • Slightly complex rollbacks (traffic layer change + edge cache invalidation)

Beta users

In parsercache we have the ability to specify a key prefix and TTL for specific objects. Additionally, logged-in users bypass caches in our traffic layer, that being said, we could possibly have beta users:

  • A user has a special cookie indicating they are part of the k8s beta
  • When a server is storing in parcercache (and in turn, in memcached), uses a key prefix and a shorter TTL (cache slotting/Vary)
  • Beta users can always compare a page by simply opening it as anonymous
  • Beta users are more likely to report problems.
  • We can run this for as long as we want

Prons

  • No edge and parser cache pollution (pages rendered by k8s mixed with pages render by baremetal servers)
  • User reports

Cons

  • Browsing will be considerably slower for beta users, we could consider fiddling with the TTL a bit

Rollout Example

  1. X-Wikimedia-Debug
  2. Beta users/parsercache slotting
  3. Low traffic urls
  4. Low traffic wikis from group0
  5. Some group1 wikis
  6. Parsoid (?)
  7. All wikis except enwiki
  8. enwiki (Fin)

Note: Running jobs, timers, and standalone scripts are going to be approached differently

Proposal #2: Use a k8s cookie

Users with the cookie will be routed to k8s, and will have their own cache in the traffic layer (varnish+ATS). This is similar to how we rolled out PHP7, the difference is that previously, routing and cookie setting took place within the application layer, while now we have to do this in the traffic layer.

Prons

  • We have previous experience in rolling out
  • Beta users
  • Better control over amount of traffic served
  • Easier to roll back (?)

Cons

  • Complex VCL and LUA changes for edge cache slotting (not enough test coverage there)
  • Edge cache invalidation issue (ie how do we invalidate only k8s rendered cache from the edge?)
  • Where will we calculate if an anonymous user should get the k8s cookie or not?
  • Traffic would like to avoid this solution

Proposal #3: Per cluster rollout

We can create kubernetes services to server some (initially internal) traffic, and then do a per cluster migration. For instance, we could create api-internal-r{w,o}.discovery.wment service, and then start moving services to start using it.

Roll out phase 1: Migrate low traffic wikis to Kubernetes

After discussions, serviceops has decided to mix and match ideas from the above proposals.

  • T292707 Wikitech is ideal to dogfood mw-on-k8s, there are challenges though that we need to over come T292707
  • Serve testwiki from mwdebug or a new service

Event Timeline

akosiaris triaged this task as Medium priority.Sep 9 2021, 12:56 PM
jijiki updated the task description. (Show Details)
jijiki updated the task description. (Show Details)
jijiki added a subscriber: Krinkle.
dancy updated the task description. (Show Details)

I have some alternative ideas. Specifically, right now we have a limited number of different clusters, due to the complexity of correctly sizing such clusters on bare metal and the complications coming from the fact that switching clusters for a server basically meant a reimage.

Kubernetes removes most of such limitations, and I think we should move away from the current appserver/api split, to a more structured approach. This might also help with the migration.

First of all, I'd like to separate the traffic coming from internal requests to the mediawiki APIs from the external api traffic. This should allow us to easier sacrifice external api traffic when we're in an overload situation, while not sacrificing the internal traffic as well.

We could thus start with migrating the internal traffic first, starting with parsoid and the internal api traffic. It will be enough to change the pointer in envoy to api-internal-r{w,o}.discovery.wment to move each service to the new internal api cluster.

Likewise, we can progressively move external api traffic to api-external-rw.discovery.wmnet, and a fraction of the production traffic for the wikis to wiki-rw.discovery.wmnet, both clusters we'll build on kubernetes.

Once we've done this transition, we'll have acquired enough experience with the operations and difficulties of migrating traffic to kubernetes to allow us to move the rest of the traffic.

I am a bit doubtful that we really need to go through the usual multi-stage migration with beta users, splitting frontend caches, etc. in this case as we're not going to introduce new functionality, but I'm open to opposing opinions. I don't think routing a single URL (scenario 1) would really benefit us much if we go down this path.

I forgot to add: offering the beta feature would be nice, and given it only regards logged-in users, it would not need a split of caches on the edge (but please correct me if I'm wrong). That would probably need to use a cookie-based routing approach, but is much easier to implement than a full cache split.

I have some alternative ideas. Specifically, right now we have a limited number of different clusters, due to the complexity of correctly sizing such clusters on bare metal and the complications coming from the fact that switching clusters for a server basically meant a reimage.

Kubernetes removes most of such limitations, and I think we should move away from the current appserver/api split, to a more structured approach. This might also help with the migration.

First of all, I'd like to separate the traffic coming from internal requests to the mediawiki APIs from the external api traffic. This should allow us to easier sacrifice external api traffic when we're in an overload situation, while not sacrificing the internal traffic as well.

We could thus start with migrating the internal traffic first, starting with parsoid and the internal api traffic. It will be enough to change the pointer in envoy to api-internal-r{w,o}.discovery.wment to move each service to the new internal api cluster.

Ι like the idea of dogfooding, definitely api-internal-ro.discovery.wmnet is a good start. My concern is, if we have migrated services one by one, if, for any emergency reason, we want to temporarily switch them all back to api-r{w,o}, will take a considerable amount of time (redeploying every service using api-internal-ro.discovery.wmnet ). Please correct me if I am missing something

Likewise, we can progressively move external api traffic to api-external-rw.discovery.wmnet, and a fraction of the production traffic for the wikis to wiki-rw.discovery.wmnet, both clusters we'll build on kubernetes.

What are our options in sending a fraction of the prod traffic?

Once we've done this transition, we'll have acquired enough experience with the operations and difficulties of migrating traffic to kubernetes to allow us to move the rest of the traffic.

I am a bit doubtful that we really need to go through the usual multi-stage migration with beta users, splitting frontend caches, etc. in this case as we're not going to introduce new functionality, but I'm open to opposing opinions. I don't think routing a single URL (scenario 1) would really benefit us much if we go down this path.

Maybe single urls is a bit too much, but I think there is value in sending some small wikis and later on one of the larger one, before directing all traffic. Users are good at finding broken stuff, and the blast radius is very much under control.

I forgot to add: offering the beta feature would be nice, and given it only regards logged-in users, it would not need a split of caches on the edge (but please correct me if I'm wrong). That would probably need to use a cookie-based routing approach, but is much easier to implement than a full cache split.

I agree that beta users is something we should do, and yes, I do not see any reason to slot our external caches. Which leads to the other question, should we slot parsercache or not?

Generally, I think initially start with beta users + switch services to the internal API, and then revisit our next steps.

ps. I will update the task description as we are working out details of this roll-out plan

I have some alternative ideas. Specifically, right now we have a limited number of different clusters, due to the complexity of correctly sizing such clusters on bare metal and the complications coming from the fact that switching clusters for a server basically meant a reimage.

Kubernetes removes most of such limitations, and I think we should move away from the current appserver/api split, to a more structured approach. This might also help with the migration.

First of all, I'd like to separate the traffic coming from internal requests to the mediawiki APIs from the external api traffic. This should allow us to easier sacrifice external api traffic when we're in an overload situation, while not sacrificing the internal traffic as well.

We could thus start with migrating the internal traffic first, starting with parsoid and the internal api traffic. It will be enough to change the pointer in envoy to api-internal-r{w,o}.discovery.wment to move each service to the new internal api cluster.

Ι like the idea of dogfooding, definitely api-internal-ro.discovery.wmnet is a good start. My concern is, if we have migrated services one by one, if, for any emergency reason, we want to temporarily switch them all back to api-r{w,o}, will take a considerable amount of time (redeploying every service using api-internal-ro.discovery.wmnet ). Please correct me if I am missing something

Likewise, we can progressively move external api traffic to api-external-rw.discovery.wmnet, and a fraction of the production traffic for the wikis to wiki-rw.discovery.wmnet, both clusters we'll build on kubernetes.

Similarly, we can split again jobrunners vs videoscalers functionally again. We are also no longer be limited in 1 instance of mwdebug (although exposing them might be a bit more involved) but as many as we want.

What are our options in sending a fraction of the prod traffic?

FWIW, the cookie approach would be my preferred one. I see more pros and I am not sure all of the cons necessarily apply. For example, why do we need to split the caches? Also, I am not sure I have fully understood "how do we invalidate k8s rendered cache?". Which cache does it refer to?

Once we've done this transition, we'll have acquired enough experience with the operations and difficulties of migrating traffic to kubernetes to allow us to move the rest of the traffic.

I am a bit doubtful that we really need to go through the usual multi-stage migration with beta users, splitting frontend caches, etc. in this case as we're not going to introduce new functionality, but I'm open to opposing opinions. I don't think routing a single URL (scenario 1) would really benefit us much if we go down this path.

Maybe single urls is a bit too much, but I think there is value in sending some small wikis and later on one of the larger one, before directing all traffic. Users are good at finding broken stuff, and the blast radius is very much under control.

If there is value in it, it's rather small IMHO. If anything I fear it's more psychological if anything. I find more value in sending a small part of the traffic of a large wiki than all traffic of a small wiki that might take days for a report to surface to us. Language based (on top of the size, e.g. which small wikipedia is ok to move to it? Is it easy for editors of that wikipedia to report to us? Or is language a barrier? ) criteria especially would make this extra difficult.

I forgot to add: offering the beta feature would be nice, and given it only regards logged-in users, it would not need a split of caches on the edge (but please correct me if I'm wrong). That would probably need to use a cookie-based routing approach, but is much easier to implement than a full cache split.

I agree that beta users is something we should do, and yes, I do not see any reason to slot our external caches. Which leads to the other question, should we slot parsercache or not?

Do we expect the output of baremetal vs k8s to be different ? We use the same versions for everything, don't we? Is there some exception?

Generally, I think initially start with beta users + switch services to the internal API, and then revisit our next steps.

+1

ps. I will update the task description as we are working out details of this roll-out plan

We could thus start with migrating the internal traffic first, starting with parsoid and the internal api traffic. It will be enough to change the pointer in envoy to api-internal-r{w,o}.discovery.wment to move each service to the new internal api cluster.

Ι like the idea of dogfooding, definitely api-internal-ro.discovery.wmnet is a good start. My concern is, if we have migrated services one by one, if, for any emergency reason, we want to temporarily switch them all back to api-r{w,o}, will take a considerable amount of time (redeploying every service using api-internal-ro.discovery.wmnet ). Please correct me if I am missing something

Likewise, we can progressively move external api traffic to api-external-rw.discovery.wmnet, and a fraction of the production traffic for the wikis to wiki-rw.discovery.wmnet, both clusters we'll build on kubernetes.

Similarly, we can split again jobrunners vs videoscalers functionally again. We are also no longer be limited in 1 instance of mwdebug (although exposing them might be a bit more involved) but as many as we want.

That is a good idea, I started a different task to discuss our options in partitioning our mediawiki server T291918

What are our options in sending a fraction of the prod traffic?

FWIW, the cookie approach would be my preferred one. I see more pros and I am not sure all of the cons necessarily apply. For example, why do we need to split the caches? Also, I am not sure I have fully understood "how do we invalidate k8s rendered cache?". Which cache does it refer to?

My question in the cookie approach actually was that, in order to control the amount of traffic towards k8s, are we going to use the same mechanism we used before (HHVM->PHP7) to decide if a user will get the k8s cookie or not (thus giving us control over how much "a fraction" of the traffic is)? Because the alternative would be what's in proposal 1, redirect some wikis only to use k8s.

Regarding "invalidate k8s rendered cache" refers to edge caches, sorry about this, I updated the description to avoid confusion.

Once we've done this transition, we'll have acquired enough experience with the operations and difficulties of migrating traffic to kubernetes to allow us to move the rest of the traffic.

I am a bit doubtful that we really need to go through the usual multi-stage migration with beta users, splitting frontend caches, etc. in this case as we're not going to introduce new functionality, but I'm open to opposing opinions. I don't think routing a single URL (scenario 1) would really benefit us much if we go down this path.

Maybe single urls is a bit too much, but I think there is value in sending some small wikis and later on one of the larger one, before directing all traffic. Users are good at finding broken stuff, and the blast radius is very much under control.

If there is value in it, it's rather small IMHO. If anything I fear it's more psychological if anything. I find more value in sending a small part of the traffic of a large wiki than all traffic of a small wiki that might take days for a report to surface to us. Language based (on top of the size, e.g. which small wikipedia is ok to move to it? Is it easy for editors of that wikipedia to report to us? Or is language a barrier? ) criteria especially would make this extra difficult.

I forgot to add: offering the beta feature would be nice, and given it only regards logged-in users, it would not need a split of caches on the edge (but please correct me if I'm wrong). That would probably need to use a cookie-based routing approach, but is much easier to implement than a full cache split.

I agree that beta users is something we should do, and yes, I do not see any reason to slot our external caches. Which leads to the other question, should we slot parsercache or not?

Do we expect the output of baremetal vs k8s to be different ? We use the same versions for everything, don't we? Is there some exception?

I admit that maybe it would be too much to do cache slotting since we do not expect many differences. I was just worrying if there are corner cases we have not thought about at all.

Generally, I think initially start with beta users + switch services to the internal API, and then revisit our next steps.

+1

🎉

That is a good idea, I started a different task to discuss our options in partitioning our mediawiki server T291918

Thanks

What are our options in sending a fraction of the prod traffic?

FWIW, the cookie approach would be my preferred one. I see more pros and I am not sure all of the cons necessarily apply. For example, why do we need to split the caches? Also, I am not sure I have fully understood "how do we invalidate k8s rendered cache?". Which cache does it refer to?

My question in the cookie approach actually was that, in order to control the amount of traffic towards k8s, are we going to use the same mechanism we used before (HHVM->PHP7) to decide if a user will get the k8s cookie or not (thus giving us control over how much "a fraction" of the traffic is)? Because the alternative would be what's in proposal 1, redirect some wikis only to use k8s.

That's currently my preferred way cause it's deterministic (Got the cookie? go to k8s!).

We could also do the LVS dance of course. Add a number of kubernetes nodes to the clusters and start shifting weights. We 've done that in the past and it kinda works. The problem is that a) it's not deterministic b) the kinda nature of it, due to envoy persistent connections messing up the scheme.

If we wanna go the random not persistent way, there is also the choice of having ATS do the balancing, which would avoid the persistent connections issue, but as far as I know we don't currently have support for it in puppet. I doubt Traffic would be thrilled to add it, plus it's duplicating what we do with LVS.

Regarding "invalidate k8s rendered cache" refers to edge caches, sorry about this, I updated the description to avoid confusion.

Thanks. I think the answer is the same way we invalidate edge caches currently.

Do we expect the output of baremetal vs k8s to be different ? We use the same versions for everything, don't we? Is there some exception?

I admit that maybe it would be too much to do cache slotting since we do not expect many differences.

It would be interesting to see if there will be ANY differences in fact once we have cleared k8s functionally for production.

I was just worrying if there are corner cases we have not thought about at all.

If we do find something, we can always purge those URLs at the edge caches though.

Generally, I think initially start with beta users + switch services to the internal API, and then revisit our next steps.

+1

🎉

That's currently my preferred way cause it's deterministic (Got the cookie? go to k8s!).

We could also do the LVS dance of course. Add a number of kubernetes nodes to the clusters and start shifting weights. We 've done that in the past and it kinda works. The problem is that a) it's not deterministic b) the kinda nature of it, due to envoy persistent connections messing up the scheme.

If we wanna go the random not persistent way, there is also the choice of having ATS do the balancing, which would avoid the persistent connections issue, but as far as I know we don't currently have support for it in puppet. I doubt Traffic would be thrilled to add it, plus it's duplicating what we do with LVS.

Regarding "invalidate k8s rendered cache" refers to edge caches, sorry about this, I updated the description to avoid confusion.

Thanks. I think the answer is the same way we invalidate edge caches currently.

That is my reservation when giving anonymous users a cookie, given we don't want to slot our edge caches, we won't be able to know which urls to invalidate (apart from what is reported to us ofc). If we go down this path, we will want to have a damage control plan. Serving certain urls and wikis via only k8s, sets some very specific boundaries to the blast radius.

Krinkle edited projects, added Performance-Team (Radar); removed Performance-Team.
Krinkle moved this task from Limbo to Watching on the Performance-Team (Radar) board.
This comment was removed by jijiki.