Page MenuHomePhabricator

MoeData causes visiting browser to load data from 3rd party sites
Closed, ResolvedPublic

Description

Eventually T130748: Add Content-Security-Policy header enforcing 3rd party web interaction restrictions to proxy responses may be changed from report-only mode to enforcement mode and then these requests will break. As these interactions are core to the tool's functionality, adding a reverse proxy with a restrictive allow list for proxied URLs to the tool itself is probably the "best" way to present the desired content without exposing the user to direct interaction with 3rd party hosting and potential tracking. This could be done with a PHP script to do the proxying.

Recent CSP violation reports can be seen at https://csp-report.toolforge.org/search?ft=moedata

Event Timeline

adding a reverse proxy with a restrictive allow list for proxied URLs to the tool itself is probably the "best" way to present the desired content without exposing the user to direct interaction with 3rd party hosting and potential tracking.

After a bit of chat on IRC I realized that this description makes sense to me, but probably is not completely clear to everyone. One of the dangers of being a backend nerd for a long time is losing track of what is and is not easily acquired knowledge. I will try to add some additional details to help explain my meaning.

A reverse proxy in this context is an HTTP endpoint that somehow receives a description of the content that is desired, fetches that content itself from some other HTTP endpoint, and then returns the response to the original requestor. Anyone accessing a Toolforge tool actually interacts with at least one reverse proxy, the tools.wmflabs.org & *.toolforge.org 'dynamicproxy' service. Our dynamicproxy is built using Nginx, LUA, and Redis, but could be rebuilt using other technologies. Regardless of how it actually built, a reverse proxy with a restriction on what content it will process does basically the same things:

  1. Receives an incoming HTTP request from a web browser or other user agent.
  2. The proxy then examines the request to decide if it is allowed to answer it.
  3. If the request is not allowed, an error response is returned.
  4. If the request is allowed, then the reverse proxy software has to figure out the URL to the real content being requested.
  5. Once the real URL is determined, the reverse proxy calls that URL itself and waits for the 'upstream' server to return a response.
  6. When the response is received from the upstream server, it is passed on the client that is waiting for a response to their original request.

In the case of this tool, right now the proxy should only allow requests for resources from the https://tatsumo.pythonanywhere.com/ or https://i.scdn.co/ upstream servers. This simplifies the problem of authorization. The tool only needs to proxy a limited number of upstream servers rather than arbitrary upstreams. The upstream servers can be hardcoded into the proxy.

I can think of two different ways to implement reverse proxies for these hosts in the MoeData tool. One is writing a PHP script and the other is by using Service and Ingress objects in the 2020 Kuberentes cluster. There are various libraries and tutorials that can be found online for building a basic proxy script in PHP. Many of them do not have great security or performance though which is unfortunate. Figuring out how to do this with the Kubernetes objects is pretty interesting to me, so I thought I would poke at that first.

A bit of web searching led me to a blog post by Elvin Efendiev which had enough clues to get me started. The blog post explains that the Kubernetes ingress layer needs a Service object to tell Kubernetes how to find the upstream server and an Ingress object to route inbound requests to the Service. I came up with this proof of concept using my bd808-test tool:

T250922-proxy.yaml
---
# Service object for routing requests to i.scdn.co
apiVersion: v1
kind: Service
metadata:
  name: i-scdn-co
  namespace: tool-bd808-test
spec:
  type: ExternalName
  externalName: i.scdn.co
...
---
# Ingress object for routing requests to i.scdn.co
apiVersion: networking.k8s.io/v1beta1
kind: Ingress
metadata:
  name: proxy-scdn
  namespace: tool-bd808-test
  annotations:
    kubernetes.io/ingress.class: nginx
    nginx.ingress.kubernetes.io/rewrite-target: /$2
    nginx.ingress.kubernetes.io/upstream-vhost: i.scdn.co
    nginx.ingress.kubernetes.io/backend-protocol: https
    nginx.ingress.kubernetes.io/server-snippet: |
      proxy_ssl_name i.scdn.co;
      proxy_ssl_server_name on;
spec:
  rules:
    - host: bd808-test.toolforge.org
      http:
        paths:
          - backend:
              serviceName: i-scdn-co
              servicePort: 443
            path: /scdn(/|$)(.*)
...
$ kubectl apply --validate=true -f T250922-proxy.yaml
service/i-scdn-co created
ingress.networking.k8s.io/proxy-scdn created

With these Kubernetes objects running on the cluster, https://bd808-test.toolforge.org/scdn/image/ab67616d0000b27373dc2eca0656689869d88ae9 returns the image from https://i.scdn.co/image/ab67616d0000b27373dc2eca0656689869d88ae9!

There was one step I had to do to make this work that I didn't describe yet. The 2020 Kubernetes cluster places quotas on each namespace. The default quota for each tool only allows one Service object. This is fine for normal usage, but to use this reverse proxy technique we need to add another Service for each upstream server. I increased the quota on Service objects for the bd808-test tool from 1 to 5 to test this. To implement this solution for the MoeData tool a similar change would be needed there.

$ kubectl describe quota tool-moedata
Name:                   tool-moedata
Namespace:              tool-moedata
Resource                Used   Hard
--------                ----   ----
configmaps              1      10
limits.cpu              500m   2
limits.memory           512Mi  8Gi
persistentvolumeclaims  0      3
pods                    1      4
replicationcontrollers  0      1
requests.cpu            150m   2
requests.memory         256Mi  6Gi
secrets                 1      10
services                1      3
services.nodeports      0      0

I have now added https://tatsumo.pythonanywhere.com/ and https://i.scdn.co/, if it's passible to get one more for Musicbrainz.org I would be satisfied.

I'm also using the Wikidata API under the toolforge.org domain. But I get the following error message:

Access to fetch at 'https://www.wikidata.org/w/api.php?action=wbgetentities&props=aliases|labels|claims&languages=en&redirects=no&format=json&ids=Q82050732' from origin 'https://moedata.toolforge.org' has been blocked by CORS policy: No 'Access-Control-Allow-Origin' header is present on the requested resource. If an opaque response serves your needs, set the request's mode to 'no-cors' to fetch the resource with CORS disabled.

Mentioned in SAL (#wikimedia-cloud) [2020-04-25T22:31:51Z] <bd808> Bumped Service quota from 3 to 5 (T250922, T246592)

I have now added https://tatsumo.pythonanywhere.com/ and https://i.scdn.co/, if it's passible to get one more for Musicbrainz.org I would be satisfied.

{{done}}. I bumped the quota to 5 so you will have a small amount of room from growth.

I'm also using the Wikidata API under the toolforge.org domain. But I get the following error message:

Access to fetch at 'https://www.wikidata.org/w/api.php?action=wbgetentities&props=aliases|labels|claims&languages=en&redirects=no&format=json&ids=Q82050732' from origin 'https://moedata.toolforge.org' has been blocked by CORS policy: No 'Access-Control-Allow-Origin' header is present on the requested resource. If an opaque response serves your needs, set the request's mode to 'no-cors' to fetch the resource with CORS disabled.

https://www.mediawiki.org/wiki/API:Cross-site_requests#CORS_usage -- Add an origin=* parameter to the Action API calls you are making to tell MediaWiki to add the need Access-Control-Allow-Origin header to the response.

bd808 assigned this task to Premeditated.

Based on my manual inspection of browser activity driven by https://moedata.toolforge.org/album/2ABAeQdTwWlZZj4cW2zOWX and the status of https://csp-report.toolforge.org/search?ft=moedata (on 2020-07-02) I think that @Premeditated has been able to fix the 3rd party interaction leaks.