Page MenuHomePhabricator

Identify requests for media files from logged-in users
Open, Needs TriagePublic

Description

Problem

As part of the work under WE5.4 to protect our infrastructure from abusive scraping, we want to be able to distinguish between requests for media files that come from authenticated and anonymous users.

This will allow us to exclude our logged in users, including community bots, from any filtering that we implement for access to media files, in particular for generating thumbnails not currently in cache.

Approach

Enable identification of logged in users requesting media files by supporting authentication cookies on upload.wikimedia.org. The CDN can then validate the JWT session to use the information as a signal in abuse filtering.

Acceptance criteria

  • JWT session cookies included on requests to upload.wikimedia.org
  • Other cookies are not required to be present, but may be included
  • SRE can configure CDN to validate JWT sessions on media requests

Event Timeline

In general, sending cookies to a different site is hard as browsers consider it cross-site tracking and implement various countermeasures. So I expect this would be a lot of work and possibly even then too unreliable to be useful.

Some implementation options and the problems I can see with them:

  1. Expose an upload.wikimedia.org endpoint that can set cookies somehow. I'd much prefer to avoid that being a MediaWiki endpoint, as that comes with all kinds of complications (MediaWiki PHP code would run with the upload VCL rules applied, so totally different edge behavior). Could be a standalone PHP file, or even just VCL logic (could be as simple as upload.wikimedia.org/setCookie?value=<...> setting a predefined cookie to the given value and then redirect back).

We could set the cookie during login (since auth.wikimedia.org is same-site with upload.wikimedia.org), but once the user is on a non-wikimedia.org site, modern browsers will refuse to send the cookie in an embedded request (such as an <img> tag) since they consider cookies on cross-site requests as third-party tracking unless the user interacts with the domain of the cookie.

  1. Set the cookie on the wikimedia.org parent domain when the user is on auth.wikimedia.org. Such cookies are insecure (as non-MediaWiki apps on various *.wikimedia.org subdomains can access and write them), but that's fine as long as the cookie content cannot be abused (ie. not a session / authentication cookie). Since SUL3 ensures that the user interacts with auth.wikimedia.org during login, these cookies will work.

The user interaction flag expires in a couple days though (7 to 45 depending on the browser) and the cookie will become inaccessible again. How do we detect that and recover? In this situation, the edge is communicating with MediaWiki solely through the contents of an <img> tag (or SVG or video or whatever). Probably the least evil thing to do is to set an onerror handler that tries to load the image again from Javascript via a CORS-enabled fetch() to get access to the headers, or calls a dedicated auth.wikimedia.org API endpoint; but it would still be very messy. And what do we do about it, send the user to reauthenticate every week?

I guess we could set a partitioned cookie in an AJAX request, which would mean the cookie would work on the given domain (e.g. en.wikipedia.org) but not on other domains, so this would have to be repeated every time the user changes wikis, with UX implications since images would fail to load and then have to be loaded again.

Users who have disabled Javascript would just not see the images and have no idea what's happening.

  1. Serve images in same-site requests (upload.wikidata.org etc).

Per @Joe this might not be as horrible as it sounds. It does break client caching across sites, though, plus a ton of community tooling probably. In any case, if we take this route, it's mostly SRE work.

  1. Rather than cookies, use a query parameter (similar to T414338: Identify the provenance of image requests).

Simple but affects caching (and would probably break some clients but T414338 will do that anyway). Seems tricky to get it into image URLs used by gadgets but not URLs that are meant for sharing.

  1. Rather than a cookie or query parameter, use a service worker to manipulate resource fetches from upload.wikimedia.org and inject a custom header.

We'd have to horribly abuse a simple header and fit the payload into 128 bytes, otherwise this triggers a CORS preflight which is bad for performance.