Page MenuHomePhabricator

Route testwiki traffic to the Pretrain MVP environment
Open, MediumPublic

Description

Our plan of record is to use k8s Ingress to route testwiki traffic to the associated Pretrain deployments - i.e., direct traffic bound for test.wikipedia.org to the appropriate services by path-dependent use case (generic web or API, jobrunner, etc.).

While Ingress is ideal for this kind of application, and converges toward ideas explored to enable Single Version MediaWiki, we're not yet at a point where we want make Ingress load-bearing for the existing set of critical MW-on-k8s services (e.g., mw-web).

What that means is that we must externally shunt testwiki-bound traffic to Ingress, which in practice means that we will need to:

  1. Divert external testwiki traffic to Ingress at ATS.
    • Note that this is limited to cases where ATS is the last-hop toward MW-on-k8s.
    • In practical terms, that means generic web traffic and X-Wikimedia-Debug traffic (we will have a Pretain-debug), and not traffic diverted to REST Gateway.
  2. Divert external testwiki traffic to Ingress at REST Gateway.
    • In practical terms, this means writing envoy configuration to divert to a different upstream cluster based on Host.
  3. Divert internal testwiki traffic to Ingress.
    • In practical terms, this is mainly focused on internal API traffic and jobrunner traffic, which will again rely on Host-header diversion via the envoy service mesh (i.e., for all services where the relevant listener(s) are enabled).
    • One area where that gets tricky is changeprop. Although the jobqueue deployment thereof now also applies the appropriate Host header to outbound requests (T395451), we don't actually use the service mesh on either. Unless we were to implement diversion directly in changeprop (strongly not prefered), the most straightforward approach is to imbue them with service mesh.
    • Another area that needs further investigation is internal API traffic from Knative workloads running in the ml-serve k8s clusters, which appear to use a completely different mechanism to proxy egress HTTP to internal services (see e.g. net_istio.mesh.service_entries). If this turns out to be too complex to easily accommodate, then that may influence our decisions around Ingress adoption (see Why not Ingress-all-the-things? below).

While some additional research and preparation can happen sooner, we'll need complete the basic service turnup before any of this can be enabled.


Additional discussion

Why not Ingress-all-the-things? - When developing the proposal for the Pretrain MVP environment, we prioritized avoiding significant intrusive architectural changes for the existing MW-on-k8s services. However, if we were to migrate both mw-jobrunner and mw-api-int to Ingress, that would supersede all of the work in #3 (and in many ways result in a simpler-to-reason-about end state - diversion is consolidated in one place rather than being distributed in mesh configuration). One might further argue that if Single-Version MediaWiki is likely to become the future and follow the path set out last year (Ingress as PoR), then maybe this would be valuable regardless. There are many more pros and cons and practicalities than I'm detailing here, but I remain open to a change in approach here.