Page MenuHomePhabricator

Decide TLS auth proxy method for the new toolforge jobs framework
Closed, ResolvedPublic

Description

My initial idea was to deploy this API in toolforge k8s itself, something like https://jobs.toolforge.org/api/. The new Toolforge Jobs Framework requires client TLS in order to call the k8s API on behalf of the original user.

This mix presents a certain challenge for the TLS flow:

  • for toolforge webservices, server side TLS is terminated in the nginx front proxy, and original client TLS doesn't reach the final nginx or pod running in k8s.
  • the nginx front proxy calls k8s ingress without TLS or, specifically, without accounting for original client TLS certs.
  • the acme-chief certificate for *.toolforge.org is currently not available to arbitrary pods running in k8s.

Some options to handle this:

option 1

introduce a special case for ' jobs.toolforge.org' in the front proxy, that uses the stream nginx method to proxy TCP level instead of HTTP. We leave client cert TLS termination for jobs.toolforge.orgin the actual jobs.toolforge.org pod or k8s ingress controller.

Docs: https://docs.nginx.com/nginx/admin-guide/load-balancer/tcp-udp-load-balancer/

option 2

introduce general config options in nginx front proxy to account for client TLS certs, and forward them to the k8s ingress backend. We would need to make nginx front proxy aware of the k8s CA in order to validate the client TLS cert. This might be the most 'elegant' solution, barring the inconvenience of having to extract the k8s CA and make it available to nginx front proxy.

The nginx front proxy config would end up looking like this:

server {
  [..]
  ssl_client_certificate  /etc/kubernetes-ca.crt;
  ssl_verify_client optional;
  [...]
  proxy_set_header X-Client-Dn $ssl_client_s_dn;
}

option 3

Move general TLS termination out of the front proxy, into the k8s ingress nginx server. We may still have to support webservices running in the grid so this option is unlikely to be feasible.

option 4

Introduce a dedicated frontend for the jobs API in parallel to the common toolforge front proxy. This will cost a VM and 1 floating IP.

Event Timeline

aborrero renamed this task from Decide TLS auth method for the new toolforge jobs framework to Decide TLS auth proxy method for the new toolforge jobs framework.Feb 8 2021, 5:52 PM
aborrero updated the task description. (Show Details)

For discussion: option 3 is possible if we changed the grid to create service objects that proxy out to the grid inside k8s. That sounds like a fair bit of work since they currently use a very simple process and some tools don't have not k8s compatible names...but it would be a way to start to wind down the front proxy.

What about a new ingress controller for the jobs api?

What about a new ingress controller for the jobs api?

How would that look like in terms of URLs/endpoints?

I was trying to make this API public (in the sense of public URL/IPv4 address) but it is true that for the first iteration a simple internal name like jobs.k8s.svc.tools.eqiad1.wikimedia.cloud or similar, with its own ingress controller, should be enough.

What about a new ingress controller for the jobs api?

How would that look like in terms of URLs/endpoints?

I was trying to make this API public (in the sense of public URL/IPv4 address) but it is true that for the first iteration a simple internal name like jobs.k8s.svc.tools.eqiad1.wikimedia.cloud or similar, with its own ingress controller, should be enough.

A combo of this with option 1 feels like would be a potential great solution. We probably afford the special case in the front proxy if the ultimate goal is to deprecate the grid and remove the frontproxy altogether.

I was trying to make this API public (in the sense of public URL/IPv4 address) but it is true that for the first iteration a simple internal name like jobs.k8s.svc.tools.eqiad1.wikimedia.cloud or similar, with its own ingress controller, should be enough.

This is one of those situations where cert-manager shines (and why it is integrated into most k8s ingresses). A dedicated frontend would be very flexible in terms of backend implementation, but that could also just be a different termination that option 1 points to....which isn't a bad thought. Option 2 seems elegant, but I don't like the idea of duplicating the CA to it very much.

A combo of this with option 1 feels like would be a potential great solution. We probably afford the special case in the front proxy if the ultimate goal is to deprecate the grid and remove the frontproxy altogether.

+1 I like that a lot. I also think that deprecating the frontproxy (and possibly eventually moving to cert-manager after that's happened) might be a good future.

The new Toolforge Jobs Framework requires client TLS in order to call the k8s API on behalf of the original user.

So the idea would be to have the cli tooling for this send the tool's k8s client cert to an api that then replays that cert somehow when talking to the k8s api? Would that work even with passthrough? x.509 auth requires active challenge response which would need to happen at the k8s api wouldn't it? Or am I misunderstanding and the authn needed here is only to the intermediate api and then it will talk to the k8s api using a service account that can operate in all of the tool namespaces?

The new Toolforge Jobs Framework requires client TLS in order to call the k8s API on behalf of the original user.

So the idea would be to have the cli tooling for this send the tool's k8s client cert to an api that then replays that cert somehow when talking to the k8s api? Would that work even with passthrough? x.509 auth requires active challenge response which would need to happen at the k8s api wouldn't it? Or am I misunderstanding and the authn needed here is only to the intermediate api and then it will talk to the k8s api using a service account that can operate in all of the tool namespaces?

a bit more simpler. The API receives client TLS auth, validates the client cert with the k8s CA. That cert includes the user in the CN. With that information, the API then can then load the same certs (pub/priv) from the user home in NFS. With the new cert, our API will contact the k8s API.

Does this diagram help?

image.png (486×1 px, 108 KB)

Change 662941 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: add ingress configuration for jobs.toolforge.org

https://gerrit.wikimedia.org/r/662941

Does this diagram help?

image.png (486×1 px, 108 KB)

It does, thank you.

I re-read https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProposals/Toolforge_jobs to remind me why this api + client approach is being considered. I'm not sure I'm convinced yet that trading adding complexity in the ingress for reducing deployment complexity in the Docker images is a fair exchange. But I can see the merits of the argument that if the front proxy is removed in the future much of what I currently see as additional complexity at the ingress will actually disappear.

My approach so far has been: we have been bitten by the tools-webservices architecture limitations some times already. Let's try a different approach, which turns out it has some potential side benefits as well. (think on a visual web frontend for the API so toolforge users could potentially manage their jobs using that).

I don't think the new proxy thing is too complex. It is a major snowflake, but not too complex to maintain (it shouldn't change once introduced if the URI doesn't change).

aborrero triaged this task as Medium priority.Feb 10 2021, 12:34 PM

Mentioned in SAL (#wikimedia-cloud) [2021-02-19T10:25:49Z] <arturo> create DNS zone svc.toolsbeta.eqiad1.wikimedia.cloud (T274139)

Mentioned in SAL (#wikimedia-cloud) [2021-02-19T10:27:47Z] <arturo> create DNS record jobs.svc.toolsbeta.eqiad1.wikimedia.cloud with CNAME to k8s.toolsbeta.eqiad1.wikimedia.cloud (T274139)

Change 662941 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: add haproxy and nginx-ingress configuration for jobs.toolforge.org

https://gerrit.wikimedia.org/r/662941

This is solved by now:

  • we decided to don't modify the front proxy yet.
  • In toolsbeta, I created jobs.svc.toolsbeta.eqiad1.wikimedia.cloud as a CNAME to k8s.toolsbeta.eqiad1.wikimedia.cloud, so pointing to haproxy
  • there is a new HAproxy port (30001/tcp) that contacts ingress nodes in the k8s cluster

This should unblock this for now.