Page MenuHomePhabricator

Create a reusable container to replace nginx ingress anonymizing reverse proxy setups
Open, In Progress, HighPublicFeature

Description

There are a number of tools that would like to provide anonymizing proxy access to a third-party resource in the same spirit as the Toolforge CDNJS and Google Fonts proxies. While helping find a resolution for T250922: MoeData causes visiting browser to load data from 3rd party sites I came up with an Ingress-only solution that would allow creating a reverse proxy attached to a path within a tool's $TOOLNAME.toolforge.org URL space. These Ingress objects use ingress-nginx specific annotations to configure nginx to act as a reverse proxy:

tool-bd808-test.proxy-scdn
kind: Ingress
metadata:
  annotations:
    kubernetes.io/ingress.class: nginx
    nginx.ingress.kubernetes.io/backend-protocol: https
    nginx.ingress.kubernetes.io/proxy-ssl-name: i.scdn.co
    nginx.ingress.kubernetes.io/proxy-ssl-server-name: "true"
    nginx.ingress.kubernetes.io/rewrite-target: /$2
    nginx.ingress.kubernetes.io/upstream-vhost: i.scdn.co
  name: proxy-scdn
  namespace: tool-bd808-test
spec:
  rules:
  - host: bd808-test.toolforge.org
    http:
      paths:
      - backend:
          service:
            name: i-scdn-co
            port:
              number: 443
        path: /scdn(/|$)(.*)
        pathType: ImplementationSpecific

This pattern has now been used by several tools and is in danger of failing when T392356: Replace ingress-nginx before upstream EOL date is implemented.

It should be possible to create a reusable golang reverse proxy service that tools can use in combination with a custom Ingress that does not rely on ingress-nginx specific features. The proxy in https://gitlab.wikimedia.org/toolforge-repos/gitlab-content can be used as a source of inspiration and implementation for creating this new service container.

See also:

Event Timeline

bd808 triaged this task as High priority.

I have an initial working solution at https://gitlab.wikimedia.org/toolforge-repos/containers-rproxy. I have deployed it to replace the tool-bd808-test.proxy-scdn ingress only proxy from the task description. That looked something like:

bd808@laptop:~$ ssh dev.toolforge.org
bd808@tools-bastion-14:~$ become bd808-test
tools.bd808-test@tools-bastion-14:~$ kubectl delete ingress proxy-scdn
tools.bd808-test@tools-bastion-14:~$ kubectl delete service i-scdn-co
tools.bd808-test@tools-bastion-14:~$ toolforge envvars create RPROXY_UPSTREAM_URL 'https://i.scdn.co'
tools.bd808-test@tools-bastion-14:~$ toolforge envvars create RPROXY_PATH_REGEX '/scdn(/|$)(.*)'
tools.bd808-test@tools-bastion-14:~$ toolforge envvars create RPROXY_PATH_TEMPLATE '/$2'
tools.bd808-test@tools-bastion-14:~$ toolforge envvars create GO_LOG debug
tools.bd808-test@tools-bastion-14:~$ toolforge jobs run \
    --image tool-containers/rproxy:latest \
    --command web \
    --continuous \
    --port 8000 \
    --health-check-http '/healthz' \
    rproxy-scdn
tools.bd808-test@tools-bastion-14:~$ kubectl apply --validate=true -f - << EOF
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: rproxy-scdn
spec:
  rules:
    - host: bd808-test.toolforge.org
      http:
        paths:
          - path: /scdn
            pathType: Prefix
            backend:
              service:
                name: rproxy-scdn
                port:
                  number: 8000
EOF
tools.bd808-test@tools-bastion-14:~$ toolforge jobs logs -f rproxy-scdn
2026-01-21T16:53:21Z [rproxy-scdn-7cb649754c-ph55j] [job] {"time":"2026-01-21T16:53:21.307769884Z","level":"INFO","msg":"Creating reverse proxy","upstream":"https://i.scdn.co"}
2026-01-21T16:53:32Z [rproxy-scdn-7cb649754c-ph55j] [job] {"time":"2026-01-21T16:53:32.127887145Z","level":"DEBUG","msg":"Proxying request","upstream":{"Scheme":"https","Opaque":"","User":null,"Host":"i.scdn.co","Path":"","RawPath":"","OmitHost":false,"ForceQuery":false,"RawQuery":"","Fragment":"","RawFragment":""},"oldPath":"/scdn/image/ab67616d0000b27373dc2eca0656689869d88ae9","newPath":"/image/ab67616d0000b27373dc2eca0656689869d88ae9","headers":{"Accept":["text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"],"Accept-Language":["en-US,en;q=0.9,ar;q=0.8,he;q=0.7,hi;q=0.6,fj;q=0.5"],"Dnt":["1"],"Priority":["u=0, i"],"Sec-Fetch-Dest":["document"],"Sec-Fetch-Mode":["navigate"],"Sec-Fetch-Site":["none"],"Sec-Fetch-User":["?1"],"Sec-Gpc":["1"],"Upgrade-Insecure-Requests":["1"],"User-Agent":["Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:147.0) Gecko/20100101 Firefox/147.0"],"X-Forwarded-For":["192.168.166.0"],"X-Forwarded-Host":["bd808-test.toolforge.org"],"X-Forwarded-Port":["443"],"X-Forwarded-Proto":["https"],"X-Forwarded-Scheme":["https"],"X-Original-Forwarded-Host":["bd808-test.toolforge.org"],"X-Original-Uri":["/scdn/image/ab67616d0000b27373dc2eca0656689869d88ae9"],"X-Real-Ip":["192.168.166.0"],"X-Request-Id":["e09350e87aefc426c1530a9481aa8f4f"],"X-Scheme":["https"]}}

Looking at P87548 I just realized that the current rproxy solution I have built really only works to provide a single reverse proxy per Toolforge tool. This is because the envvar based configuration only allows one set of config data to be supplied. I need to think a bit harder about reasonable ways to support N deployments per tool. It would be relatively simple to add support for a config file. I would like a solution that avoids using NFS if I can dream one up.

I thought a bit about how to adjust envvar lookups so that multiple instances of the reverse proxy job could run in parallel. I think that could be made possible by adding a cli flag giving prefix/suffix to use when looking up the envvars. It seems kind of wasteful to run N containers where 1 could do the work, but maybe that's ok?

Making it possible for one reverse proxy job to handle multiple pathupstream configs in parallel seems a bit more ideal than requiring multiple jobs. The tool already has an assumption of mapping a request path to an upstream via a regex match and path template. Configuring N of these (path regex, upstream url, upstream path template) tuples could be handled in a few different ways:

  • A single envvar could store a list of tuples using YAML/JSON/whatever encoding.
  • The RPROXY_PATH_REGEX, RPROXY_UPSTREAM_URL, RPROXY_PATH_TEMPLATE variables could each encode an array of N values. Whitespace separation should work for encoding lists of all three settings.
  • A new RPROXY_COUNT envvar could be introduced which is used to tell the tool how many different RPROXY_{WHATEVER}_N settings to look for in the environment. For ease of use for the presumably more common single upstream case the default for RPROXY_COUNT could be 1 and the first reverse proxy could use the envvars without an index suffix.

The N upstreams case may or may not require the tool maintainer to setup multiple Ingress objects. In a new design I would recommend making all the upstream paths use a common root (/rproxy/ for example) that can be mapped as single pathType: Prefix Ingress. For legacy tools with diverse existing roots a separate Ingress can be provisioned for each root path as needed.

I'm a bit torn here on which implementation to go after. I can think of roughly equal length list of pros and cons for each idea. Optimizing for ease of debugging and usage growth seems like a reasonable plan.

  • I think that ease of debugging knocks the space separated envvar idea out of the running.
  • Making a YAML file and loading it into a single envvar with cat $YAML | toolforge envvar create RPROXY_CONFIG is actually not too horrible as a workflow. toolforge envvars show $VAR preserves whitespace including newlines in the output for reviewing the config.
  • The "normal to exceptional" upgrade path of the RPROXY_COUNT idea somehow feels like the least complicated for users at first. Things get a bit weird when thinking how that pattern might work over time if a tool reduced the number of reverse proxies in use. Removing anything other than the tail of the list would require renumbering all of the envvars which is not really a thing we have tooling support for.

I think I'm talking myself into the YAML in an envvar option.

@Premeditated Your Tool-MoeData project was where I was inspired to invent the ingress-only reverse proxy pattern. I think that this new container is ready to replace that solution for you. Here's what I think that might look like:

$ ssh dev.toolforge.org
$ become moedata
$ kubectl delete ingress proxy-musicbrainz
$ kubectl delete service musicbrainz
$ kubectl delete ingress proxy-scdn
$ kubectl delete service i-scdn-co
$ kubectl delete ingress proxy-spotify
$ kubectl delete service spotify
$ kubectl delete ingress proxy-tatsumo
$ kubectl delete service tatsumo
$ toolforge envvars create RPROXY_CONFIG << '_EOF'
---
routes:
  - path: /musicbrainz(/|$)(.*)
    upstream: https://musicbrainz.org
    template: /$2
  - path: /scdn(/|$)(.*)
    upstream: https://i.scdn.co
    template: /$2
  - path: /spotify(/|$)(.*)
    upstream: https://api.spotify.com
    template: /$2
  - path: /tatsumo(/|$)(.*)
    upstream: https://tatsumo.pythonanywhere.com
    template: /$2
_EOF
$ toolforge jobs run \
    --image tool-containers/rproxy:latest \
    --command web \
    --continuous \
    --port 8000 \
    --health-check-http '/healthz' \
    rproxy
$ kubectl apply --validate=true -f - << '_EOF'
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: rproxy
spec:
  rules:
    - host: moedata.toolforge.org
      http:
        paths:
          - path: /musicbrainz
            pathType: Prefix
            backend: {service: {name: rproxy, port: {number: 8000}}}
          - path: /scdn
            pathType: Prefix
            backend: {service: {name: rproxy, port: {number: 8000}}}
          - path: /spotify
            pathType: Prefix
            backend: {service: {name: rproxy, port: {number: 8000}}}
          - path: /tatsumo
            pathType: Prefix
            backend: {service: {name: rproxy, port: {number: 8000}}}
_EOF

This replaces the four custom Ingress objects and their associated Service objects that you are using now with one job running the new rproxy container and one Ingress that maps the four paths from the prior Ingress objects to the new rproxy service.

I have this same RPROXY_CONFIG running in the https://bd808-test.toolforge.org/. If you can make time to come up with test URLs for the various upstream servers you or I can test them against the bd808-test tool. I have been using /scdn/image/ab67616d0000b27373dc2eca0656689869d88ae9 as one test case as I've been developing things.

Do you have time to help test and switch things over? @taavi has not given me a timeline for finishing T414674: Remove remaining uses of ingress-nginx specific annotations, but I'd like to wrap up my help with it in the next week or two if possible.

bd808 changed the task status from Open to In Progress.Thu, Feb 12, 12:49 AM
bd808 moved this task from To Do to In Dev/Progress on the User-bd808 board.

First of all, this is a very smart and easy Ingress-only solution, @bd808! 🥳

I have followed the steps you suggested, with one exception: since tatsumo is deprecated, I replaced it with the Deezer API (https://api.deezer.com).

Here are the results with some example links:

musicbrainz (problems):
https://musicbrainz.org/ws/2/artist/5b11f4ce-a62d-471e-81fc-a69a8278c7da?fmt=json
https://moedata.toolforge.org/musicbrainz/ws/2/artist/5b11f4ce-a62d-471e-81fc-a69a8278c7da?fmt=json

https://musicbrainz.org/ws/2/release?label=47e718e1-7ee4-460c-b1cc-1192a841c6e5&fmt=json
https://moedata.toolforge.org/musicbrainz/ws/2/release?label=47e718e1-7ee4-460c-b1cc-1192a841c6e5&fmt=json

https://musicbrainz.org/ws/2/release?artist=65f4f0c5-ef9e-490c-aee3-909e7ae6b2ab&status=bootleg&type=live&fmt=json
https://moedata.toolforge.org/musicbrainz/ws/2/release?artist=65f4f0c5-ef9e-490c-aee3-909e7ae6b2ab&status=bootleg&type=live&fmt=json


scdn (works):
https://moedata.toolforge.org/scdn/image/ab67616d0000b27373dc2eca0656689869d88ae9
https://moedata.toolforge.org/scdn/image/ab67616d0000b273bbd45c8d36e0e045ef640411
https://moedata.toolforge.org/scdn/image/ab676161000051749175feab318f624eb4cc0bfe
https://moedata.toolforge.org/scdn/image/ab67616d0000b2731622b92a310a2e33e679f6bf

Deezer (works):
https://moedata.toolforge.org/deezer/album/upc:886447863329
https://moedata.toolforge.org/deezer/album/104290392
https://moedata.toolforge.org/deezer/album/104290392/image
https://moedata.toolforge.org/deezer/artist/145

Spotify (works - Require API key)
https://moedata.toolforge.org/spotify/v1/albums/4aawyAB9vmqN3uQ7FjRGTy?market=en
https://moedata.toolforge.org/spotify/v1/artists/0TnOYISbd1XYRBk9myaseg

For MusicBrainz (MB), however, I am getting the error: Webservice is unreachable.

I have set MB to:

- path: /musicbrainz(/|$)(.*)
  upstream: https://musicbrainz.org
  template: /$2

When the API root URL is https://musicbrainz.org/ws/2/. But should have work anyway.

For MusicBrainz (MB), however, I am getting the error: Webservice is unreachable.

I have set MB to:

- path: /musicbrainz(/|$)(.*)
  upstream: https://musicbrainz.org
  template: /$2

When the API root URL is https://musicbrainz.org/ws/2/. But should have work anyway.

I didn't have any luck in testing musicbrainz with arbitrary urls before I had pinged you either. Do you still have the prior ingress + service configuration that was working to proxy musicbrainz? I'm hoping I can stare at that log enough to realize what my service needs to do differently to work as well.

For MusicBrainz (MB), however, I am getting the error: Webservice is unreachable.

I have set MB to:

- path: /musicbrainz(/|$)(.*)
  upstream: https://musicbrainz.org
  template: /$2

When the API root URL is https://musicbrainz.org/ws/2/. But should have work anyway.

I didn't have any luck in testing musicbrainz with arbitrary urls before I had pinged you either. Do you still have the prior ingress + service configuration that was working to proxy musicbrainz? I'm hoping I can stare at that log enough to realize what my service needs to do differently to work as well.

I’m not aware of any conflicting Ingress and Service configuration.

I didn't have any luck in testing musicbrainz with arbitrary urls before I had pinged you either. Do you still have the prior ingress + service configuration that was working to proxy musicbrainz? I'm hoping I can stare at that log enough to realize what my service needs to do differently to work as well.

I’m not aware of any conflicting Ingress and Service configuration.

I mean the objects that kubectl delete ingress proxy-musicbrainz and kubectl delete service musicbrainz would have removed. I'm wondering if you have the YAML that would have created them, and/or if the live objects are in the moedata Kubernetes namespace still.

I didn't have any luck in testing musicbrainz with arbitrary urls before I had pinged you either. Do you still have the prior ingress + service configuration that was working to proxy musicbrainz? I'm hoping I can stare at that log enough to realize what my service needs to do differently to work as well.

I’m not aware of any conflicting Ingress and Service configuration.

I mean the objects that kubectl delete ingress proxy-musicbrainz and kubectl delete service musicbrainz would have removed. I'm wondering if you have the YAML that would have created them, and/or if the live objects are in the moedata Kubernetes namespace still.

I have checked kubectl get ingress where moedata-subdomain and rproxy are the only ones. I do have the YAML files with the old config in home dir. But they are not applied.

@Premeditated I think we are talking past each other. I'll try a reset. I read your comments in T414836#11612398 as stating that 3 of 4 reverse proxies worked using my new service. The musicbrainz one however is failing. I assumed that the prior solution works (or worked if uninstalled) for musicbrainz and was hoping that you had that configuration around still so I could try and reverse engineer a solution from it.

I'll look at the moedata-musicbrainz-ingress.yaml and see what I can figure out.

moedata-musicbrainz-ingress.yaml
#
# WARNING: this file has been edited by WMCS admins
# see https://phabricator.wikimedia.org/T294547 for more information
#
---
apiVersion: v1
kind: Service
metadata:
  name: musicbrainz
spec:
  type: ExternalName
  externalName: musicbrainz.org
...
---
apiVersion: networking.k8s.io/v1beta1
kind: Ingress
metadata:
  name: proxy-musicbrainz
  annotations:
    kubernetes.io/ingress.class: nginx
    nginx.ingress.kubernetes.io/rewrite-target: /$2
    nginx.ingress.kubernetes.io/upstream-vhost: musicbrainz.org
    nginx.ingress.kubernetes.io/backend-protocol: http
spec:
  rules:
    - host: moedata.toolforge.org
      http:
        paths:
          - backend:
              serviceName: musicbrainz
              servicePort: 80
            path: /musicbrainz(/|$)(.*)
...

When I run the rproxy golang binary directly from my laptop with no upstream Toolforge ingress proxies I can get proxied responses back from musicbrainz.

bd808@mbp03:~/projects/wmf/toolforge-tools/containers-rproxy$ ./containers-rproxy &
[1] 34930
{"time":"2026-02-12T18:54:06.636275-07:00","level":"INFO","msg":"Compiled route","route":{"Path":"/musicbrainz(/|$)(.*)","Upstream":"https://musicbrainz.org","Template":"/$2","PathRegexp":"/musicbrainz(/|$)(.*)","UpstreamURL":{"Scheme":"https","Opaque":"","User":null,"Host":"musicbrainz.org","Path":"","RawPath":"","OmitHost":false,"ForceQuery":false,"RawQuery":"","Fragment":"","RawFragment":""}}}
mbp03:~/projects/wmf/toolforge-tools/containers-rproxy  (git main)
bd808$ curl -v 'localhost:8000/musicbrainz/ws/2/artist/5b11f4ce-a62d-471e-81fc-a69a8278c7da?fmt=json'
* Host localhost:8000 was resolved.
* IPv6: ::1
* IPv4: 127.0.0.1
*   Trying [::1]:8000...
* Connected to localhost (::1) port 8000
> GET /musicbrainz/ws/2/artist/5b11f4ce-a62d-471e-81fc-a69a8278c7da?fmt=json HTTP/1.1
> Host: localhost:8000
> User-Agent: curl/8.7.1
> Accept: */*
>
* Request completely sent off
{"time":"2026-02-12T18:54:25.827966-07:00","level":"DEBUG","msg":"Proxying request","upstream":"https://musicbrainz.org","oldPath":"/musicbrainz/ws/2/artist/5b11f4ce-a62d-471e-81fc-a69a8278c7da","newPath":"/ws/2/artist/5b11f4ce-a62d-471e-81fc-a69a8278c7da","headers":{"Accept":["*/*"],"User-Agent":["curl/8.7.1"]}}
< HTTP/1.1 200 OK
< Access-Control-Allow-Origin: *
< Content-Type: application/json; charset=utf-8
< Date: Fri, 13 Feb 2026 01:54:26 GMT
< Etag: W/"c5f75fe2aac03ee6afa2fd7a3c28bcb5"
< Server: Plack::Handler::Starlet
< Vary: Accept-Encoding
< X-Cache-Status: STALE
< X-Ratelimit-Limit: 1200
< X-Ratelimit-Remaining: 1017
< X-Ratelimit-Reset: 1770947668
< X-Runtime: 0.025985
< Transfer-Encoding: chunked
<
* Connection #0 to host localhost left intact
{"id":"5b11f4ce-a62d-471e-81fc-a69a8278c7da","country":"US","ipis":[],"gender":null,"gender-id":null,"disambiguation":"1980s–1990s US grunge band","sort-name":"Nirvana","name":"Nirvana","type-id":"e431f5f6-b5d2-343d-8b36-72607fffb74b","life-span":{"begin":"1987","end":"1994-04-05","ended":true},"isnis":["0000000123486830","0000000123487390"],"end-area":null,"type":"Group","begin-area":{"sort-name":"Aberdeen","disambiguation":"","name":"Aberdeen","type":null,"type-id":null,"id":"a640b45c-c173-49b1-8030-973603e895b5"},"area":{"id":"489ce91b-6658-3307-9877-795b68554c98","type-id":null,"name":"United States","disambiguation":"","sort-name":"United States","type":null,"iso-3166-1-codes":["US"]}}

This works when the requesting User-Agent is Firefox instead of curl too.

When I am running the same rproxy config behind the Toolforge ingress there are a number of headers added to the request by the ingress proxies, and I think that one or more of them are triggering the upstream to return a 403 or similar error status. I will poke at this more and see if I can prove that.

When I am running the same rproxy config behind the Toolforge ingress there are a number of headers added to the request by the ingress proxies, and I think that one or more of them are triggering the upstream to return a 403 or similar error status. I will poke at this more and see if I can prove that.

After much poking I am still unable to recreate the error locally. I have figured a little bit more out about what is happening however. When running running on Toolforge from the build service managed container the https://bd808-test.toolforge.org/musicbrainz/ws/2/artist/5b11f4ce-a62d-471e-81fc-a69a8278c7da?fmt=json request triggers an EOF error at some point inside ServeHTTP's call to transport.RoundTrip(outreq). That error is handed to the default error handler for the ReverseProxy which then returns a 502 response. Somewhere in the Toolforge front proxies (I think the nginx-ingress layer?) sees the 502 and decides to hand off to the fourohfour tool to respond.

The part of all of this that is most mysterious to me at the moment is that I cannot recreate the EOF error locally. Writing this up just made me realize that I have not tried running the build service container locally as a reproduction environment. I'll try that.

The part of all of this that is most mysterious to me at the moment is that I cannot recreate the EOF error locally. Writing this up just made me realize that I have not tried running the build service container locally as a reproduction environment. I'll try that.

Still no joy. Running the container locally works just as well as running the compiled binary locally does. Things that could still be different:

  • musicbrainz.org resolving to IPv6 locally, IPv4 on Toolforge?
  • The nginx-ingress inbound reverse proxy
  • The Toolforge HAProxy front proxy

My current assumption has been that the EOF error comes from golang trying to read a chunked response from the upstream, but I'm not easily seeing how the addition of front proxies would cause that at the moment.