Change Details

As we are getting closer and closer to a fully functional #mw-on-k8s image, we can start discussing our testing in production and rolling out options. ==Background== **History** When we migrated to PHP7, users were served by PHP7 based on the existence of a cookie, set by mediawiki, and the X-Seven header, set at the traffic layer. On the application layer, Apache would route a request to the relevant backend, HHVM or php-fpm. Having a cookie allowed us to first let beta users in, and in turn progressively increase the amount of anonymous user traffic served via PHP7. We then continued by converting API servers to `php7_only` servers, and finally converted all jobrunners. Additionally, we split the outer caching layer into PHP7 rendered pages and HHVM rendered pages (vary-slotting). Note that **we didn't though do this for parsercache too**, At the time, we were only using Varnish, so all this logic was written in VCL, with some additional apache config. **Now** This migration is slightly different this time: * Caching layer consists of Varnish and ATS (VCL and LUA) * Decision of where to route an incoming request will be taken at the caching layer * We have 4 mediawiki clusters: api, app, jobrunners, and parsoid * we are older After a brief discussion with #traffic and #performance-team, we have: ==Proposal #1: URL routing== Given that app and api servers share the same configuration, and assuming that initially we will have the same discovery URL, eg mw-k8s-rw.discovery.wmnet, we can start by routing a some low traffic URLs to Kubernetes, for example https://en.wikipedia.org/wiki/Barack_Obama. When we are more comfortable, we can start migrating some small wikis, and eventually, migrate them all. **Pros** * No complex and dangerous VCL and LUA changes * Cache will not be polluted since we will always have the k8s rendered article * Easy cache invalidation (single pages or entire wikis) **Cons** * Less control over traffic served * ~~Won't be able to create a beta feature~~ * Longer rolling out strategy * Slightly complex rollbacks (traffic layer change + cache invalidation) === Beta users === In parsercache we have the ability to specify a key prefix and TTL for specific objects. Additionally, logged-in users bypass caches in our traffic layer, that being said, we could possibly have beta users: * A user has a special cookie indicating they are part of the k8s beta * When a server is storing in parcercache (and in turn, in memcached), uses a key prefix and a shorter TTL (cache slotting/Vvry) * Beta users can always compare a page by simply opening it as anonymous * Beta users are more likely to report problems. * We can run this for as long as we want **Prons** * No cache pollution (pages rendered by k8s mixed with pages render by baremetal servers) * User reports **Cons** * Browsing will be considerably slower for beta users, we could consider fiddling with the TTL a bit === Rollout=== # ~~X-Wikimedia-Debug~~ # Beta users/parsercache slotting # Low traffic urls # Low traffic wikis from group0 # Some group1 wikis # Parsoid (?) # All wikis except enwiki # enwiki (Fin) **Note:** Running jobs, timers, and standalone scripts are going to be approached differently ==Proposal #2: Use a k8s cookie== Users with the cookie will be routed to k8s, and will have their own cache in the traffic layer (varnish+ATS). This is similar to how we rolled out PHP7, the difference is that previously, routing and cookie setting took place within the application layer, while now we have to do this in the traffic layer. **Prons** * We have previous experience in rolling out * Beta users * Better control over amount of traffic served * * Easier to roll back (?) **Cons** * Complex VCL and LUA changes for cache slotting (not enough test coverage there) * Cache invalidation issue (ie how do we invalidate k8s rendered cache?) * Where will we calculate if an anonymous user should get the k8s cookie or not? * #Traffic would like to avoid this solution

As we are getting closer and closer to a fully functional #mw-on-k8s image, we can start discussing our testing in production and rolling out options. ==Background== **History** When we migrated to PHP7, users were served by PHP7 based on the existence of a cookie, set by mediawiki, and the X-Seven header, set at the traffic layer. On the application layer, Apache would route a request to the relevant backend, HHVM or php-fpm. Having a cookie allowed us to first let beta users in, and in turn progressively increase the amount of anonymous user traffic served via PHP7. We then continued by converting API servers to `php7_only` servers, and finally converted all jobrunners. Additionally, we split the outer caching layer into PHP7 rendered pages and HHVM rendered pages (vary-slotting). Note that **we didn't though do this for parsercache too**, At the time, we were only using Varnish, so all this logic was written in VCL, with some additional apache config. **Now** This migration is slightly different this time: * Caching layer consists of Varnish and ATS (VCL and LUA) * Decision of where to route an incoming request will be taken at the caching layer * We have 4 mediawiki clusters: api, app, jobrunners, and parsoid * we are older After a brief discussion with #traffic and #performance-team, we have: ==Proposal #1: URL routing== Given that app and api servers share the same configuration, and assuming that initially we will have the same discovery URL, eg mw-k8s-rw.discovery.wmnet, we can start by routing a some low traffic URLs to Kubernetes, for example https://en.wikipedia.org/wiki/Barack_Obama. When we are more comfortable, we can start migrating some small wikis, and eventually, migrate them all. **Prons** * No complex and dangerous VCL and LUA changes * Cache will not be polluted since we will always have the k8s rendered article * Easy cache invalidation (single pages or entire wikis) **Cons** * Less control over traffic served * ~~Won't be able to create a beta feature~~ * Longer rolling out strategy * Slightly complex rollbacks (traffic layer change + cache invalidation) === Beta users === In parsercache we have the ability to specify a key prefix and TTL for specific objects. Additionally, logged-in users bypass caches in our traffic layer, that being said, we could possibly have beta users: * A user has a special cookie indicating they are part of the k8s beta * When a server is storing in parcercache (and in turn, in memcached), uses a key prefix and a shorter TTL (cache slotting/Vvry) * Beta users can always compare a page by simply opening it as anonymous * Beta users are more likely to report problems. * We can run this for as long as we want **Prons** * No cache pollution (pages rendered by k8s mixed with pages render by baremetal servers) * User reports **Cons** * Browsing will be considerably slower for beta users, we could consider fiddling with the TTL a bit === Rollout=== # ~~X-Wikimedia-Debug~~ # Beta users/parsercache slotting # Low traffic urls # Low traffic wikis from group0 # Some group1 wikis # Parsoid (?) # All wikis except enwiki # enwiki (Fin) **Note:** Running jobs, timers, and standalone scripts are going to be approached differently ==Proposal #2: Use a k8s cookie== Users with the cookie will be routed to k8s, and will have their own cache in the traffic layer (varnish+ATS). This is similar to how we rolled out PHP7, the difference is that previously, routing and cookie setting took place within the application layer, while now we have to do this in the traffic layer. **Prons** * We have previous experience in rolling out * Beta users * Better control over amount of traffic served * * Easier to roll back (?) **Cons** * Complex VCL and LUA changes for cache slotting (not enough test coverage there) * Cache invalidation issue (ie how do we invalidate k8s rendered cache?) * Where will we calculate if an anonymous user should get the k8s cookie or not? * #Traffic would like to avoid this solution