Change Details

As we are getting closer and closer to a fully functional #mw-on-k8s image, we can start discussing our testing in production and rolling out options. **(Task description will be updated as we are figuring out our next steps)** =Background= **History** When we migrated to PHP7, users were served by PHP7 based on the existence of a cookie, set by mediawiki, and the X-Seven header, set at the traffic layer. On the application layer, Apache would route a request to the relevant backend, HHVM or php-fpm. Having a cookie allowed us to first let beta users in, and in turn progressively increase the amount of anonymous user traffic served via PHP7. We then continued by converting API servers to `php7_only` servers, and finally converted all jobrunners. Additionally, we split the outer caching layer into PHP7 rendered pages and HHVM rendered pages (vary-slotting). Note that **we didn't though do this for parsercache too**, At the time, we were only using Varnish, so all this logic was written in VCL, with some additional apache config. **Now** This migration is slightly different this time: * Caching layer consists of Varnish and ATS (VCL and LUA) * Decision of where to route an incoming request will be taken at the caching layer * We have 4 mediawiki clusters: api, app, jobrunners, and parsoid * we are older = Proposed Plans = After a brief discussion with #traffic and #performance-team, we have: ==Proposal #1: URL routing== Given that app and api servers share the same configuration, and assuming that initially we will have the same discovery URL, eg mw-k8s-rw.discovery.wmnet, we can start by routing a some low traffic URLs to Kubernetes, for example https://en.wikipedia.org/wiki/Barack_Obama. When we are more comfortable, we can start migrating some small wikis, and eventually, migrate them all. **Prons** * No complex and dangerous VCL and LUA changes * Edge cache will not be polluted since we will always have the k8s rendered article * Easy edge cache invalidation (single pages or entire wikis) **Cons** * Less control over traffic served * ~~Won't be able to create a beta feature~~ * Longer rolling out strategy * Slightly complex rollbacks (traffic layer change + edge cache invalidation) === Beta users === In parsercache we have the ability to specify a key prefix and TTL for specific objects. Additionally, logged-in users bypass caches in our traffic layer, that being said, we could possibly have beta users: * A user has a special cookie indicating they are part of the k8s beta * When a server is storing in parcercache (and in turn, in memcached), uses a key prefix and a shorter TTL (cache slotting/Vary) * Beta users can always compare a page by simply opening it as anonymous * Beta users are more likely to report problems. * We can run this for as long as we want **Prons** * No edge and parser cache pollution (pages rendered by k8s mixed with pages render by baremetal servers) * User reports **Cons** * Browsing will be considerably slower for beta users, we could consider fiddling with the TTL a bit **Rollout Example** # ~~X-Wikimedia-Debug~~ # Beta users/parsercache slotting # Low traffic urls # Low traffic wikis from group0 # Some group1 wikis # Parsoid (?) # All wikis except enwiki # enwiki (Fin) **Note:** Running jobs, timers, and standalone scripts are going to be approached differently ==Proposal #2: Use a k8s cookie== Users with the cookie will be routed to k8s, and will have their own cache in the traffic layer (varnish+ATS). This is similar to how we rolled out PHP7, the difference is that previously, routing and cookie setting took place within the application layer, while now we have to do this in the traffic layer. **Prons** * We have previous experience in rolling out * Beta users * Better control over amount of traffic served * * Easier to roll back (?) **Cons** * Complex VCL and LUA changes for edge cache slotting (not enough test coverage there) * Edge cache invalidation issue (ie how do we invalidate only k8s rendered cache from the edge?) * Where will we calculate if an anonymous user should get the k8s cookie or not? * #Traffic would like to avoid this solution ==Proposal #3: Per cluster rollout== We can create kubernetes services to server some (initially internal) traffic, and then do a per cluster migration. For instance, we could create `api-internal-r{w,o}.discovery.wment` service, and then start moving services to start using it. = Roll out phase 1: Migrate low traffic wikis to Kubernetes= After discussions, #serviceops has decided to mix and match ideas from the above proposals. [x] Serve test2.wikipedia.org from k8s [] T292707 **Wikitech** is ideal to dogfood mw-on-k8s, there are challenges though that we need to over come T292707

As we are getting closer and closer to a fully functional #mw-on-k8s image, we can start discussing our testing in production and rolling out options. **(Task description will be updated as we are figuring out our next steps)** =Background= **History** When we migrated to PHP7, users were served by PHP7 based on the existence of a cookie, set by mediawiki, and the X-Seven header, set at the traffic layer. On the application layer, Apache would route a request to the relevant backend, HHVM or php-fpm. Having a cookie allowed us to first let beta users in, and in turn progressively increase the amount of anonymous user traffic served via PHP7. We then continued by converting API servers to `php7_only` servers, and finally converted all jobrunners. Additionally, we split the outer caching layer into PHP7 rendered pages and HHVM rendered pages (vary-slotting). Note that **we didn't though do this for parsercache too**, At the time, we were only using Varnish, so all this logic was written in VCL, with some additional apache config. **Now** This migration is slightly different this time: * Caching layer consists of Varnish and ATS (VCL and LUA) * Decision of where to route an incoming request will be taken at the caching layer * We have 4 mediawiki clusters: api, app, jobrunners, and parsoid * we are older = Proposed Plans = After a brief discussion with #traffic and #performance-team, we have: ==~~Proposal #1: URL routing~~== Given that app and api servers share the same configuration, and assuming that initially we will have the same discovery URL, eg mw-k8s-rw.discovery.wmnet, we can start by routing a some low traffic URLs to Kubernetes, for example https://en.wikipedia.org/wiki/Barack_Obama. When we are more comfortable, we can start migrating some small wikis, and eventually, migrate them all. **Prons** * No complex and dangerous VCL and LUA changes * Edge cache will not be polluted since we will always have the k8s rendered article * Easy edge cache invalidation (single pages or entire wikis) **Cons** * Less control over traffic served * ~~Won't be able to create a beta feature~~ * Longer rolling out strategy * Slightly complex rollbacks (traffic layer change + edge cache invalidation) === Beta users === In parsercache we have the ability to specify a key prefix and TTL for specific objects. Additionally, logged-in users bypass caches in our traffic layer, that being said, we could possibly have beta users: * A user has a special cookie indicating they are part of the k8s beta * When a server is storing in parcercache (and in turn, in memcached), uses a key prefix and a shorter TTL (cache slotting/Vary) * Beta users can always compare a page by simply opening it as anonymous * Beta users are more likely to report problems. * We can run this for as long as we want **Prons** * No edge and parser cache pollution (pages rendered by k8s mixed with pages render by baremetal servers) * User reports **Cons** * Browsing will be considerably slower for beta users, we could consider fiddling with the TTL a bit **Rollout Example** # ~~X-Wikimedia-Debug~~ # Beta users/parsercache slotting # Low traffic urls # Low traffic wikis from group0 # Some group1 wikis # Parsoid (?) # All wikis except enwiki # enwiki (Fin) **Note:** Running jobs, timers, and standalone scripts are going to be approached differently ==~~Proposal #2: Use a k8s cookie~~== Users with the cookie will be routed to k8s, and will have their own cache in the traffic layer (varnish+ATS). This is similar to how we rolled out PHP7, the difference is that previously, routing and cookie setting took place within the application layer, while now we have to do this in the traffic layer. **Prons** * We have previous experience in rolling out * Beta users * Better control over amount of traffic served * * Easier to roll back (?) **Cons** * Complex VCL and LUA changes for edge cache slotting (not enough test coverage there) * Edge cache invalidation issue (ie how do we invalidate only k8s rendered cache from the edge?) * Where will we calculate if an anonymous user should get the k8s cookie or not? * #Traffic would like to avoid this solution ==Proposal #3: Per cluster rollout (winner)== We can create kubernetes services to server some (initially internal) traffic, and then do a per cluster migration. For instance, we could create `api-internal-r{w,o}.discovery.wment` service, and then start moving services to start using it. = Roll out phase 1: Migrate low traffic wikis to Kubernetes= After discussions, #serviceops has decided to mix and match ideas from the above proposals. [x] Serve test2.wikipedia.org from k8s [] T292707 **Wikitech** is ideal to dogfood mw-on-k8s, there are challenges though that we need to over come T292707

As we are getting closer and closer to a fully functional #mw-on-k8s image, we can start discussing our testing in production and rolling out options. **(Task description will be updated as we are figuring out our next steps)** =Background= **History** When we migrated to PHP7, users were served by PHP7 based on the existence of a cookie, set by mediawiki, and the X-Seven header, set at the traffic layer. On the application layer, Apache would route a request to the relevant backend, HHVM or php-fpm. Having a cookie allowed us to first let beta users in, and in turn progressively increase the amount of anonymous user traffic served via PHP7. We then continued by converting API servers to `php7_only` servers, and finally converted all jobrunners. Additionally, we split the outer caching layer into PHP7 rendered pages and HHVM rendered pages (vary-slotting). Note that **we didn't though do this for parsercache too**, At the time, we were only using Varnish, so all this logic was written in VCL, with some additional apache config. **Now** This migration is slightly different this time: * Caching layer consists of Varnish and ATS (VCL and LUA) * Decision of where to route an incoming request will be taken at the caching layer * We have 4 mediawiki clusters: api, app, jobrunners, and parsoid * we are older = Proposed Plans = After a brief discussion with #traffic and #performance-team, we have: ====~~Proposal #1: URL routing==~~== Given that app and api servers share the same configuration, and assuming that initially we will have the same discovery URL, eg mw-k8s-rw.discovery.wmnet, we can start by routing a some low traffic URLs to Kubernetes, for example https://en.wikipedia.org/wiki/Barack_Obama. When we are more comfortable, we can start migrating some small wikis, and eventually, migrate them all. **Prons** * No complex and dangerous VCL and LUA changes * Edge cache will not be polluted since we will always have the k8s rendered article * Easy edge cache invalidation (single pages or entire wikis) **Cons** * Less control over traffic served * ~~Won't be able to create a beta feature~~ * Longer rolling out strategy * Slightly complex rollbacks (traffic layer change + edge cache invalidation) === Beta users === In parsercache we have the ability to specify a key prefix and TTL for specific objects. Additionally, logged-in users bypass caches in our traffic layer, that being said, we could possibly have beta users: * A user has a special cookie indicating they are part of the k8s beta * When a server is storing in parcercache (and in turn, in memcached), uses a key prefix and a shorter TTL (cache slotting/Vary) * Beta users can always compare a page by simply opening it as anonymous * Beta users are more likely to report problems. * We can run this for as long as we want **Prons** * No edge and parser cache pollution (pages rendered by k8s mixed with pages render by baremetal servers) * User reports **Cons** * Browsing will be considerably slower for beta users, we could consider fiddling with the TTL a bit **Rollout Example** # ~~X-Wikimedia-Debug~~ # Beta users/parsercache slotting # Low traffic urls # Low traffic wikis from group0 # Some group1 wikis # Parsoid (?) # All wikis except enwiki # enwiki (Fin) **Note:** Running jobs, timers, and standalone scripts are going to be approached differently ==~~Proposal #2: Use a k8s cookie~~== Users with the cookie will be routed to k8s, and will have their own cache in the traffic layer (varnish+ATS). This is similar to how we rolled out PHP7, the difference is that previously, routing and cookie setting took place within the application layer, while now we have to do this in the traffic layer. **Prons** * We have previous experience in rolling out * Beta users * Better control over amount of traffic served * * Easier to roll back (?) **Cons** * Complex VCL and LUA changes for edge cache slotting (not enough test coverage there) * Edge cache invalidation issue (ie how do we invalidate only k8s rendered cache from the edge?) * Where will we calculate if an anonymous user should get the k8s cookie or not? * #Traffic would like to avoid this solution ==Proposal #3: Per cluster rollout== (winner)== We can create kubernetes services to server some (initially internal) traffic, and then do a per cluster migration. For instance, we could create `api-internal-r{w,o}.discovery.wment` service, and then start moving services to start using it. = Roll out phase 1: Migrate low traffic wikis to Kubernetes= After discussions, #serviceops has decided to mix and match ideas from the above proposals. [x] Serve test2.wikipedia.org from k8s [] T292707 **Wikitech** is ideal to dogfood mw-on-k8s, there are challenges though that we need to over come T292707