Page MenuHomePhabricator

Sane ingress methods for PAWS
Open, HighPublic

Description

Investigate current ingress architecture for PAWS and move to a more manageable and understandable one.

PAWS has three main ingress endpoints the jupyterhub proxy (https://paws.wmflabs.org ), the deploy-hook (https://paws-deploy-hook.wmflabs.org ) and the paws-public interface (https://paws-public.wmflabs.org ).

Before I broke the ingress controller for the PAWS cluster on May 19, each was using a different configuration. Let's use this task to attempt to bring them all into sane, documented methods of ingress.

Description was:
PAWS still uses a proxy in the PAWS VPS project for public ingress but used a mix of kubelego and nginx ingress controller to provide ingress into the kubernetes pods.

In an attempt to upgrade to cert-manager, instead of kubelego, to provide certificate renewal (an expired certificate impeded the auto-deploy feature from working), I apparently broke ingress for PAWS.

After trying to fix it, I gave up and used the same workaround implemented in PAWS-beta for ingress (changing the kubernetes service type to NodePort and pointing the proxy to that on any cluster node).

That is not a great design, however, as it will fail on a single node failure.

We should investigate and deploy Kubernetes ingress for PAWS-beta and then push this to PAWS as well. Ideally, we incorporate this into the helm chart and keep it all automated to deploy.

Event Timeline

Chicocvenancio triaged this task as Normal priority.May 22 2018, 4:53 PM

As I try to further understand Kubernetes ingress, it seems this is not quite needed or beneficial to PAWS. In the end, we will still have to use a NodePort to expose the ingress controller, and be subjected to the single point of failure of the node chosen to be on the DNS record. As is, I'm tending to define the current workaround as the final archicteture, porting it to the deploy-hook entrypoint and documenting the whole thing.

It seems the ingress controller will only provide the general benefits of a reverse proxy in a kubernetes cluster, since we have a few options of reverse proxies outside the cluster, I don't see this a inherently beneficial move.

I'd love some input from someone more experienced on kubernetes on this issue.

Chicocvenancio renamed this task from Use a sane ingress controllers for PAWS to Sane ingress methods for PAWS.May 22 2018, 9:24 PM
Chicocvenancio updated the task description. (Show Details)

https://github.com/yuvipanda/paws/pull/27 this should take care of all necessary configuration for Kubernetes itself, there is still the matter of sending traffic to the kubernetes cluster (at any node, the master makes most sense to me).

Currently, this is using the paws-proxy-01 instance, which then has a nginx to route the appropriate backend. I think we should ditch this and use the web-proxy available to horizon to setup the ingress, as is currently done in PAWS-beta. This also requires a security group change, it can be a copy of security group 932 (@toolsbeta, IE, ports 32611-32612 accessible from 10.0.0.0/8).

There is still the matter of paws-public before being able to completely ditch the paws-proxy-01 instance (and the whole VPS project). I'll look into this tomorrow.

The move from DNS pointing to paws-proxy-01 to web-proxies through horizon might involve downtime if something goes wrong, (and needs to be done in the tools VPS project). How should we handle this?

Chicocvenancio added a subscriber: Bstorm.EditedMay 25 2018, 1:15 AM

@Bstorm's discovered paws-public is basically a kubernetes tool running in toolforge.

I propose the following tasks:

  • Edit the security groups in the tools vps project to allow access to the ports 32611 and 32612 from the tools-proxy-01 (can be 10.0.0.0/8)
  • Remove the web-proxy for paws.wmflabs.org in the paws vps project
  • Create a web-proxy in the tools vps project pointing to tools-paws-master-01 at port 32611 for paws.wmflabs.org
  • Remove the web-proxy for paws-deploy-hook.wmflabs.org in the paws vps project
  • Create a web-proxy in the tools vps project pointing to tools-paws-master-01 at port 32612 for paws-deploy-hook.wmflabs.org
  • Edit the paws redirection in toolforge to also redirect to paws-public if the host is paws-public.wmflabs.org (could this live in the redirects vps project?)
  • Create a web-proxy in the tools vps project pointing to tools-proxy-01 for paws-public.wmflabs.org (will this work? the idea is to have requests for paws-public.wmflabs.org be dealt with by the general toolforge rules)

With that the paws vps project can be deleted fully and we will have an easier to reason about ingress for each component. IE, web-proxies pointing to nodeports of the kubernetes cluster and a redirect to a toolforge tool.

As noted before, this may cause (hopefully brief) downtime for paws.wmflabs.org.
The tools vps project tasks will need (toolforge)root time to be executed.

Adding the team project here to get on their radar.

Chicocvenancio removed Chicocvenancio as the assignee of this task.May 29 2018, 11:23 PM
bd808 added a subscriber: bd808.May 29 2018, 11:36 PM
Chicocvenancio added a comment.EditedJun 16 2018, 7:47 PM

Noting jupyterhub 0.9 has been released. This task is now the main blocker for deploying changes to PAWS.

There is still some work to be done, but I should be able to go through it next week.

Main tasks left:

Mentioned in SAL (#wikimedia-cloud) [2018-06-20T22:03:53Z] <chicocvenancio> updated nginx-ingress-controler to sidestep T195217

Vvjjkkii renamed this task from Sane ingress methods for PAWS to mkcaaaaaaa.Jul 1 2018, 1:08 AM
Vvjjkkii raised the priority of this task from Normal to High.
Vvjjkkii updated the task description. (Show Details)
TerraCodes renamed this task from mkcaaaaaaa to Sane ingress methods for PAWS.Jul 1 2018, 2:31 AM
TerraCodes lowered the priority of this task from High to Normal.
TerraCodes updated the task description. (Show Details)
bd808 added a subscriber: GTirloni.Sep 12 2018, 5:19 PM

@GTirloni this might be an interesting project for you to work with @Chicocvenancio on. It's been sitting in my "things to help do" list for far too long.

@bd808, looks interesting! I'll get up to speed and sync with Chico.

GTirloni added a comment.EditedJan 24 2019, 6:21 PM

I tested an architecture with TCP load balancing (nginx stream module) in front of the k8s workers and then just used plain Ingress resources.

This seemed simpler and more in line with k8s best practices rather than rolling out our own via web proxies in Horizon.

This should make the PAWS cluster really standalone, not depending on magical Horizon webproxies (which can't handle multiple IPs and don't do health checking).

Having all traffic through a single ingress controller running on the master feels like a k8s anti-pattern to me. The master shouldn't be schedulable at all (besides the underlying API/scheduler/controller-manager processes).

Having a few LBs in front of the k8s workers is a common architecture for deploying k8s on bare metal. Once/if we have proper automation for this, it can be deployed anywhere.

If PAWS traffic grows any larger than the master can handle we'll have to re-architecture everything anyway but maybe we can cross that bridge when we get there.

There's a time/energy constraint on my part so I don't want to be a blocker here but I wanted to share my proposal. I'm happy to work with @Chicocvenancio on the steps that require admin access to the tools project to carry out his proposal with webproxies.

Sample nginx config for load balancers (missing proper health checks, etc).

/etc/nginx/conf.d/paws.conf:

stream {
  upstream paws_workers_http {
    server paws-k8s-worker-01.paws.eqiad.wmflabs:32080 max_fails=3 fail_timeout=10s;
    server paws-k8s-worker-02.paws.eqiad.wmflabs:32080 max_fails=3 fail_timeout=10s;
    server paws-k8s-worker-03.paws.eqiad.wmflabs:32080 max_fails=3 fail_timeout=10s;
    server paws-k8s-worker-04.paws.eqiad.wmflabs:32080 max_fails=3 fail_timeout=10s;
  }

  upstream paws_workers_https {
    server paws-k8s-worker-01.paws.eqiad.wmflabs:32443 max_fails=3 fail_timeout=10s;
    server paws-k8s-worker-02.paws.eqiad.wmflabs:32443 max_fails=3 fail_timeout=10s;
    server paws-k8s-worker-03.paws.eqiad.wmflabs:32443 max_fails=3 fail_timeout=10s;
    server paws-k8s-worker-04.paws.eqiad.wmflabs:32443 max_fails=3 fail_timeout=10s;
  }

  server {
    listen 80;
    proxy_pass paws_workers_http;
  }

  server {
    listen 443;
    proxy_pass paws_workers_https;
  }
}

I am not opposed to this approach, but do think it doesn't bring a lot of extra reliability to PAWS unless we add some other changes to make the PAWS k8s cluster HA and maybe add nodes to increase capacity.

As of now, the master is a SPOF and as I see it, it will continue to be with only this change.

If we can add at least a second master and make some capacity planning for PAWS peak usage (mainly during hackathons) I wholeheartedly support a more robust ingress.

Otherwise, I'd prefer the simpler ingress with k8s nodeports pointing to the master as there will be some work needed for this.

Chicocvenancio raised the priority of this task from Normal to High.Mar 13 2019, 7:53 PM

In order to simplify the proxy setup and free the existing A record at paws.wmflabs.org to be used in a HA load balancing scenario, I've deleted the paws.wmflabs.org and paws-public.wmflabs.org webproxies.

I've recreated those addresses as A records and pointed them to paws-proxy-02 where I configured TLS termination using Let's Encrypt (with automatic renewal).

This eliminated one proxy layer from this setup where it previous had 3: webproxy -> paws-proxy-02 -> k8s nginx ingress -> app

GTirloni removed a subscriber: GTirloni.Mar 23 2019, 8:46 PM