Page MenuHomePhabricator

Toolforge ingress: decide on final layout of north-south proxy setup
Open, HighPublic

Description

In T228500: Toolforge: evaluate ingress mechanism we discussed several setups for the north-south traffic and proxy setup. With north-south I mean traffic between end users (internet) and the pod containing the tool webservice.

Related things to decide:

  • Do we want to introduce $tool.$domain.org yes or not. My feeling is yes. Also, if we introduce this pattern, do it only for toolforge.org
  • Do we want to introduce toolforge.org yes or not. My feeling is that yes.
  • Will the legacy k8s be aware of the 2 things above? i.e, would we introduce either $tool.$domain.org or toolforge.org/$tool in the old k8s deployment. My feeling is that we don't want this, as will be a lot of work that will only be valid for the compat/migration period between k8s deployments.
  • Will the web grid be aware of the things above? i.e, would we introduce either $tool.$domain.org or toolforge.org/$tool in the web grid. My feeling is that this can be done later after the new k8s is already in place.
  • SSL termination

Will try to summarize here the different options:

Diagram 0: the current setup. Dynamicproxy redirects tools.wmflabs.org/$tool to the right backend (be it the web grid or the legacy k8s).
Diagram 1: we introduce a new proxy in front of both the current setup and the new k8s. This proxy knows how to redirect *.toolforge.org to the new k8s and tools.wmflabs.org/$tool to dynamicproxy.
Diagram 2: the new k8s acts as proxy for the current setup, by means of the ingress. We can create an ingress rule to redirect all tools.wmflabs.org/$tool traffic to dynamicproxy
Diagram 3: proposed by @bd808 we update dynamicproxy to be in from of both the legacy setup and the new k8s.
Diagram 4: split setup. The current setup and the new k8s are totally separated. This is perhaps the most simple setup.

Event Timeline

aborrero triaged this task as High priority.Fri, Sep 27, 12:47 PM
aborrero moved this task from Inbox to Important on the cloud-services-team (Kanban) board.
aborrero updated the task description. (Show Details)Fri, Sep 27, 5:21 PM
aborrero added a comment.EditedMon, Sep 30, 3:58 PM

For the record, @bd808 mentioned another option: having dynamicproxy understand how to forward to the new k8s cluster.

I've been playing with option 2, and here my tests:

Create a service and ingress object like the following:

root@toolsbeta-test-k8s-master-1:~# cat toolforge-legacy.yaml 
apiVersion: v1
kind: Service
metadata:
  name: toolforge-legacy
spec:
  type: ExternalName
  externalName: tools.wmflabs.org
---
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: toolforge-legacy
  annotations:
    kubernetes.io/ingress.class: "nginx"
spec:
  rules:
  - host: tools.wmflabs.org
    http:
      paths:
      - path: /
        backend:
          serviceName: toolforge-legacy
          servicePort: 80

Load it and try!

aborrero@toolsbeta-test-k8s-lb-01:~ $ curl localhost/openstack-browser -H "Host:tools.wmflabs.org" -L
<!DOCTYPE HTML>
[..]
          <a class="navbar-brand" href="/openstack-browser/">OpenStack browser</a>
        </div>
[..]

The nginx-ingress pod seems very happy processing this:

192.168.44.192 - [192.168.44.192] - - [04/Oct/2019:10:47:22 +0000] "GET /openstack-browser HTTP/1.1" 301 185 "-" "curl/7.52.1" 97 0.004 [default-toolforge-legacy-80] [] 172.16.6.39:80 185 0.004 301 3e16cd459c76f2542fb4d4409b4b0203
192.168.44.192 - [192.168.44.192] - - [04/Oct/2019:10:48:12 +0000] "GET /openstack-browser/project HTTP/1.1" 301 185 "-" "curl/7.52.1" 105 0.000 [default-toolforge-legacy-80] [] 172.16.6.39:80 185 0.000 301 f4d4fd5b98934828e74fd9ded97b6c10

This option seems pretty straight forward. The client recvs a 301 redirect for SSL from dynamicproxy:

< HTTP/1.1 301 Moved Permanently
< Server: openresty/1.15.8.1
< Location: https://tools.wmflabs.org/openstack-browser

So in this setup we may not even care about handling SSL for the legacy toolforge (web grid + legacy k8s).

Conclusions: this option seems simple and straightforward, unless I'm overlooking something else.

aborrero updated the task description. (Show Details)Fri, Oct 4, 11:06 AM
Bstorm added a comment.Fri, Oct 4, 4:37 PM

My first inclination is that #3 is the most straightforward and supportable, but I know I am biased a bit because I am most accustomed to supporting that sort of setup in Toolforge. It also can simply be a matter of teaching software that is already in our laps how to route to "new" k8s vs old and doesn't require a new domain name scheme that may be unpopular and damaging to many tools (we don't know yet, but I remember the trouble I caused when I changed schemas on the wiki replicas--I know for many the change will be very exciting and good).

If we want technology that other people develop for and support (like Kubernetes) to be our future infrastructure foundation, option 2 makes more sense because then our custom stuff can be more easily deprecated since it is behind that. It also would allow us to start thinking of Kubernetes technologies as more of the "Toolforge platform", using things like CRDs and operators as customizations (even ones other people develop like Open Policy Agent), etc.--or at least our glue hacks can run inside k8s, which keeps them live for us, lol.

If we did option 2 and introduced the new domain, but we also allowed path-based routing for those who needed it (I know that's trickier) it might be a good balance. I think the noisiest voices on the topic want to switch to subdomains, but I'm thinking of things like CORS rules and wiki restrictions that may bite tool authors and be easier to handle in paths than subdomains. A lot of that is broken by changing domain ANYWAY, so maybe that doesn't matter. I'm just trying to get my thoughts down somewhere.

Mentioned in SAL (#wikimedia-cloud) [2019-10-08T12:27:56Z] <arturo> created VM toolsbeta-test-proxy-01 for testing stuff related to T234037

Mentioned in SAL (#wikimedia-cloud) [2019-10-08T14:14:54Z] <arturo> created puppet prefix toolsbeta-test-proxy for testing stuff related to T234037

We had a meeting yesterday 2019-10-10 and we decided to try option 3 first, with fallback to option 2.

The general front proxy will be dynamicproxy, which will keep more or less the same setup but include a fall through route to the new k8s deployment.
Also, we will try introducing the toolforge.org domain (T234617) if we manage to address T235252: Toolforge: SSL support for new domain toolforge.org in time.

Mentioned in SAL (#wikimedia-cloud) [2019-10-14T12:26:04Z] <arturo> created security group arturo-test-dynamicproxy-backend to tests stuff related to T234037

Ok, I've been playing with the dynamicproxy nginx+lua components and I have a working setup. I disabled SSL/https in my tests until we handle T235252: Toolforge: SSL support for new domain toolforge.org.

This is more or less the diagram of the setup:

Right now, the LUA code has a fall-through mechanism to direct by default to the admin tool, which gracefully handles the "Tool not found" situation.
In the setup we agreed on to accommodate the new cluster, this mechanism should be different, because now the fall-through proxy is for the new k8s cluster. This is probably something to handle in T234032: Toolforge ingress: create a default landing page for unknown/default URLs

Anyway, the changes in the LUA code are mostly to prevent it from generating the fall-through:

--- 1.lua	2019-10-14 17:59:16.429212877 +0200
+++ 2.lua	2019-10-14 17:59:29.221265917 +0200
@@ -40,37 +40,13 @@
 end
 
 if not route then
-    -- No routes defined for this uri, try the default (admin) prefix instead
-    rest = ngx.var.uri
-    routes_arr = red:hgetall('prefix:admin')
-    if routes_arr then
-        local routes = red:array_to_hash(routes_arr)
-        for pattern, backend in pairs(routes) do
-            if ngx.re.match(rest, pattern) then
-                route = backend
-                break
-            end
-        end
-    end
+    -- No routes defined for this uri, hope nginx can handle this! (new k8s cluster?)
+    ngx.exit(ngx.OK)
 end
 
 -- Use a connection pool of 256 connections with a 32s idle timeout
 -- This also closes the current redis connection.
 red:set_keepalive(1000 * 32, 256)
 
-if route then
-    ngx.var.backend = route
-    ngx.exit(ngx.OK)
-else
-    -- Oh noes!  Even the admin prefix is dead!
-    -- Fall back to the static site
-    if rest then
-        -- the URI had a slash, so the user clearly expected /something/
-        -- there.  Fail because there is no registered webservice.
-        ngx.exit(503)
-    else
-        ngx.var.backend = ''
-        ngx.exit(ngx.OK)
-    end
-end
-
+ngx.var.backend = route
+ngx.exit(ngx.OK)

The change in the nginx side is very small. We simply add a backend if LUA couldn't find it. This backend is the haproxy of the new k8s cluster.

[..]
        set $backend '';

        access_by_lua_file /etc/nginx/lua/urlproxy.lua;

        if ($backend = '') {
            # no backend was found in redis, send this to the new k8s cluster
            set $backend 'http://toolsbeta-k8s-master.toolsbeta.wmflabs.org:80';
        }

        proxy_pass $backend;
[..]

I decided to target haproxy instead of a worker node directly for a couple of reasons:

  • the list of backend servers to use by haproxy is maintained in hiera.
  • we need some way to know which worker nodes we have, and to live-prove them. I think haproxy works fine for this.
  • we are using haproxy for the new k8s apiservers anyway. So this is reusing a piece of infra we already have.
  • I considered storing the info about the worker nodes in redis (or in nginx somehow) but I don't think that would very elegant.

Results, the same nginx handle both domains and URI schemes:

aborrero@tools-test-proxy-01:~$ curl -L localhost:80 -H "Host:hello.toolforge.org" 2>/dev/null ; echo | head
Hello World!
aborrero@tools-test-proxy-01:~$ curl -L localhost:80/openstack-browser -H "Host:tools.wmflabs.org" 2>/dev/null | head
<!DOCTYPE HTML>
<html lang="en">
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
    <meta http-equiv="Content-Language" content="en-us">
    <meta charset="utf-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="initial-scale=1.0, user-scalable=yes, width=device-width">
    <meta http-equiv="imagetoolbar" content="no">
    <meta name="robots" content="noindex">

Worth noting that all my tests were conducted in a tools-proxy server running Debian Buster (T235059)

TL;DR: this works just fine. I will prepare patches, documentation and a follow-up plan, since this seems to be reaching a reasonable shape.