Page MenuHomePhabricator

Serve production traffic via Kubernetes
Open, MediumPublic

Assigned To
None
Authored By
jijiki
Sep 8 2021, 1:48 AM
Referenced Files
F34697834: c35.jpg
Oct 19 2021, 3:28 PM
F34697832: c20.jpg
Oct 19 2021, 3:28 PM
F34697829: c15.jpg
Oct 19 2021, 3:28 PM
F34697835: c30.jpg
Oct 19 2021, 3:28 PM
F34697833: c40.jpg
Oct 19 2021, 3:28 PM
F34697836: c25.jpg
Oct 19 2021, 3:28 PM
F34697825: c10.jpg
Oct 19 2021, 3:28 PM
Tokens
"Love" token, awarded by Ladsgroup.

Description

As we are getting closer and closer to a fully functional MW-on-K8s image, we can start discussing our testing in production and rolling out options.
(Task description will be updated as we are figuring out our next steps)

Background

History
When we migrated to PHP7, users were served by PHP7 based on the existence of a cookie, set by mediawiki, and the X-Seven header, set at the traffic layer. On the application layer, Apache would route a request to the relevant backend, HHVM or php-fpm.

Having a cookie allowed us to first let beta users in, and in turn progressively increase the amount of anonymous user traffic served via PHP7. We then continued by converting API servers to php7_only servers, and finally converted all jobrunners. Additionally, we split the outer caching layer into PHP7 rendered pages and HHVM rendered pages (vary-slotting). Note that we didn't though do this for parsercache too, At the time, we were only using Varnish, so all this logic was written in VCL, with some additional apache config.

Now
This migration is slightly different this time:

  • Caching layer consists of Varnish and ATS (VCL and LUA)
  • Decision of where to route an incoming request will be taken at the caching layer
  • We have 4 mediawiki clusters: api, app, jobrunners, and parsoid
  • we are older

Proposed Plans

After a brief discussion with Traffic and Performance-Team, we have:

Proposal #1: URL routing

Given that app and api servers share the same configuration, and assuming that initially we will have the same discovery URL, eg mw-k8s-rw.discovery.wmnet,
we can start by routing a some low traffic URLs to Kubernetes, for example https://en.wikipedia.org/wiki/Barack_Obama. When we are more comfortable, we can start migrating some small wikis, and eventually, migrate them all.

Prons

  • No complex and dangerous VCL and LUA changes
  • Edge cache will not be polluted since we will always have the k8s rendered article
  • Easy edge cache invalidation (single pages or entire wikis)

Cons

  • Less control over traffic served
  • Won't be able to create a beta feature
  • Longer rolling out strategy
  • Slightly complex rollbacks (traffic layer change + edge cache invalidation)

Beta users

In parsercache we have the ability to specify a key prefix and TTL for specific objects. Additionally, logged-in users bypass caches in our traffic layer, that being said, we could possibly have beta users:

  • A user has a special cookie indicating they are part of the k8s beta
  • When a server is storing in parcercache (and in turn, in memcached), uses a key prefix and a shorter TTL (cache slotting/Vary)
  • Beta users can always compare a page by simply opening it as anonymous
  • Beta users are more likely to report problems.
  • We can run this for as long as we want

Prons

  • No edge and parser cache pollution (pages rendered by k8s mixed with pages render by baremetal servers)
  • User reports

Cons

  • Browsing will be considerably slower for beta users, we could consider fiddling with the TTL a bit

Rollout Example

  1. X-Wikimedia-Debug
  2. Beta users/parsercache slotting
  3. Low traffic urls
  4. Low traffic wikis from group0
  5. Some group1 wikis
  6. Parsoid (?)
  7. All wikis except enwiki
  8. enwiki (Fin)

Note: Running jobs, timers, and standalone scripts are going to be approached differently

Proposal #2: Use a k8s cookie

Users with the cookie will be routed to k8s, and will have their own cache in the traffic layer (varnish+ATS). This is similar to how we rolled out PHP7, the difference is that previously, routing and cookie setting took place within the application layer, while now we have to do this in the traffic layer.

Prons

  • We have previous experience in rolling out
  • Beta users
  • Better control over amount of traffic served
  • Easier to roll back (?)

Cons

  • Complex VCL and LUA changes for edge cache slotting (not enough test coverage there)
  • Edge cache invalidation issue (ie how do we invalidate only k8s rendered cache from the edge?)
  • Where will we calculate if an anonymous user should get the k8s cookie or not?
  • Traffic would like to avoid this solution

Proposal #3: Per cluster rollout (winner)

We can create kubernetes services to server some (initially internal) traffic, and then do a per cluster migration. For instance, we could create api-internal-r{w,o}.discovery.wment service, and then start moving services to start using it.
This has been used at the beginning for all migrations, and will continue to be used for T333120: Migrate internal traffic to k8s

Proposal #4: Percentage-based global traffic redirect (followup to winner)

See T336038: Add traffic sampling support to mw-on-k8s.lua ATS script
A LUA script was added to ATS. It supports:

  • Sending any percentage of traffic for a domain to mw-on-k8s (including 0 and 100%)
  • Sending any percentage of global traffic to mw-on-k8s

This approach will be used going forward, with the current thresholds described in Roll out phase 2

After discussions, serviceops has decided to mix and match ideas from the above proposals.

Roll out

Roll out phase 1: Start serving a small portion of content from specific wikis

Roll out phase 2: Migrate global traffic by increments

Related Objects

StatusSubtypeAssignedTask
StalledNone
OpenNone
OpenNone
OpenNone
StalledNone
OpenNone
StalledNone
StalledKrinkle
OpenNone
StalledNone
OpenNone
Resolvedjijiki
Resolvedaaron
Openjijiki
In ProgressClement_Goubert
ResolvedClement_Goubert
ResolvedClement_Goubert
ResolvedClement_Goubert
StalledClement_Goubert
ResolvedClement_Goubert
ResolvedClement_Goubert
ResolvedClement_Goubert
ResolvedJoe
Resolvedcolewhite
ResolvedClement_Goubert
ResolvedClement_Goubert
In ProgressClement_Goubert
ResolvedClement_Goubert
ResolvedClement_Goubert
ResolvedClement_Goubert
InvalidClement_Goubert
ResolvedJoe
ResolvedClement_Goubert
ResolvedClement_Goubert
ResolvedClement_Goubert
ResolvedClement_Goubert
ResolvedJoe
ResolvedJoe
ResolvedJoe
ResolvedJMeybohm
ResolvedJoe
ResolvedClement_Goubert
ResolvedClement_Goubert
ResolvedClement_Goubert
DeclinedClement_Goubert
ResolvedClement_Goubert
OpenNone
StalledKrinkle
Resolvedjijiki
ResolvedJoe
ResolvedJoe
ResolvedClement_Goubert
ResolvedBUG REPORTClement_Goubert
ResolvedClement_Goubert
ResolvedClement_Goubert
ResolvedClement_Goubert
ResolvedClement_Goubert
ResolvedClement_Goubert
ResolvedJoe
ResolvedClement_Goubert
ResolvedClement_Goubert
ResolvedJclark-ctr
ResolvedJMeybohm
ResolvedJoe
ResolvedJoe
ResolvedNone
Resolvedjijiki
Resolvedjijiki
Resolveddancy
Resolveddancy
ResolvedJoe
ResolvedJoe
Resolvedjeena
ResolvedJoe
ResolvedJoe
Resolveddancy
ResolvedJoe
Resolved• dpifke
Resolveddancy
ResolvedJoe
ResolvedClement_Goubert
Resolvedcolewhite
Resolvedjijiki
Resolved• dpifke
ResolvedLegoktm
ResolvedClement_Goubert
ResolvedJMeybohm
ResolvedClement_Goubert
ResolvedClement_Goubert
OpenNone
OpenClement_Goubert
In ProgressClement_Goubert
ResolvedClement_Goubert
ResolvedClement_Goubert
Resolvedhnowlan
Resolvedakosiaris
Openhnowlan
ResolvedClement_Goubert
ResolvedNone
ResolvedDreamy_Jazz
ResolvedPRODUCTION ERRORDreamy_Jazz
Resolvedkostajh
Resolvedjijiki
OpenNone
Resolvedkamila
ResolvedClement_Goubert
OpenClement_Goubert
Resolvedakosiaris
OpenNone
Resolvedakosiaris
Resolveddancy
ResolvedClement_Goubert
ResolvedClement_Goubert
ResolvedClement_Goubert
ResolvedClement_Goubert
OpenClement_Goubert
Openjijiki
ResolvedJoe
In Progressjijiki
OpenNone

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Is there any particular reason that the "[ ] Wikitech is ideal to dogfood mw-on-k8s, there are challenges though that we need to over come T292707" step was skipped in the rollout plan? Is there anything that I can do to help get action on T292707: Migrate Wikitech to Kubernetes?

Change 957241 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] mw-on-k8s: Fix test-commons redirect

https://gerrit.wikimedia.org/r/957241

Change 957241 merged by Clément Goubert:

[operations/puppet@production] mw-on-k8s: Fix test-commons redirect

https://gerrit.wikimedia.org/r/957241

Mentioned in SAL (#wikimedia-operations) [2023-09-13T08:46:30Z] <claime> Running puppet on cp-text P:trafficserver::backend - T290536

Change 961351 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] mw-on-k8s: Remove wikidata exception

https://gerrit.wikimedia.org/r/961351

Change 961351 merged by Clément Goubert:

[operations/puppet@production] mw-on-k8s: Remove wikidata exception

https://gerrit.wikimedia.org/r/961351