Page MenuHomePhabricator

Identify the provenance of image requests
Open, Needs TriagePublic

Description

Problem

As part of the work under WE5.4 to protect our infrastructure from abusive scraping, we want to be able to understand the provenance of image requests. This means being able to distinguish when and where a URL to an image was generated.

This will allow us to use this information as a signal in request filtering at the CDN, by helping to determine if a request is coming from a browser session visiting the website, an API query, from dumps or if they are the result of hotlinking.

Approach

Generate signed URLs for image requests, by adding query parameters that contain the provenance information and a signature that can be trivially validated at the CDN. The signature should be an HMAC that includes the URL, source (web, api, dumps), timestamp and a secret.

  1. Acceptance criteria
    • Generated image URLs include provence query parameters
    • Generated image URLs include an HMAC signature
    • Signature contents and HMAC algorithm agreed with SRE
    • SRE can configure the CDN based on the source that generated an image URL
    • SRE can configure the CDN based on the freshness of an image URL

Event Timeline

Needs an exact definition of the provenance parameters. (UTC day + web/api? @Joe said somewhere else that we should just avoid doing this for dumps so they get grouped with other unknown-provenance requests. If we do need provenance info for dumps, not sure we are the right team for that.) Note that there are implications to client-side caching and thus site performance.

I can think of three strictness levels for the signature:

  • Include the exact URL (without query parameters, presumably). Any change to the URL (e.g. resizing, language change for SVG) would require a new API request. Seems to me like a pointless hoop to make clients jump through.
  • Include the filename. Clients need to obtain a legitimate URL but then can manipulate details like size (e.g. currently MediaViewer does this to improve performance). If you wanted to fetch a large batch of files, you'd need to make a bunch of API requests, so this would still generate a bunch of additional API traffic (OTOH it would presumably make image traffic easier to police since we have more controls for API traffic). There are some nuances to how to implement this as the canonical filename and the filename in the URL (which is what the edge has easy access to) differ in some obscure ways.
  • Do not include anything other than the provenance parameters (ie. once you have the signature you can reuse the provenance info for other images). Makes things simpler for clients and reduces API traffic but easier to circumvent. Still enough to identify scrapers unless they are actively malicious and have Commons-specific logic to avoid detection.

We'd need to decide whether to generate the signed URL for images inside page content during parsing or on the fly. Again, affects client-side caching.

There are two ways to include provenance in the URL: upload.wikimedia.org/.../Blah.jpg?prov=... and upload.wikimedia.org/.../prov=.../Blah.jpg (or similar; ie. query part or path part). Including it in the query is a bit more likely to break clients / callers which expect a URL with no query part (and do something like $query . '?foo=bar'). Including it in the path will probably need adjustment MediaWiki's own URL parsing, and probably some Apache / other traffic stuff. The latter seems worse.

Probably can be implemented in File::getUrl() / File::getThumbUrl() / File::getArchiveUrl() / File::getArchiveThumbUrl() which pretty much everything passes through? We'd have to audit callers and see whether they break. Also, only hacky ways to differentiate web/API at that point.

My notes from meeting between Jonathan, MW devs, and SRE (Valetin, Chris, Giuseppe):

  • We believe the most volumous scrapers/bots that hit upload.wikimedia.org aren't scraping Wikipedia or Commons through wiki pages that contain images. And, they (generally) do not manipulate the URL for other widths or file names. Instead, they tend to request a list of images from the Action API (presumably on Commons) and then download the original files directly.
  • When we're under heavy load in terms of intra-DC bandwidth (i.e. due to cache misses on media URLs), we currenly try to protect/prioritize some traffic above others based on various heuristics. The heuristics we have are fairly crude and lack context. We can differentiate real browsers from bots with some level of confidence today, but this is not enough by itself because non-abusive bots/tools/apps also download and re-use image and are not browsers. For our main traffic ("text" pageviews and API calls) we have:
    • the x-trusted-request framework to exempt logged-in users/bots, (regardless of client being a browser),
    • a stable Referer header, because first-party API calls and subsequent pageviews are same-origin and thus are have a reliable Referer header. This in constrast to cross-origin requests where privacy settings in browsers are much more likely to strip it entirely. The default Referrer-Policy of origin-when-cross-origin is fine, but there's enough legitimate users with stricter settings that this is not a strong signal for upload.wikimedia.org.
      • To the extent we have it, we already use Referer as a signal, but it's not good enough, especially for originals, and trivial bypass.
    • Session cookies. Unlike on wiki domains, we don't store session cookies on upload.wikimedia.org. The request for that is T414337: Identify requests for media files from logged-in users.
    • Edge uniques to start rate limits low for a fresh/cookieless client (and grow as a client gains reputation, thus making it less effective to use many IPs). Unlike on wiki domains, browsers limit the retention of upload.wikimedia.org cookies (treated as third-party cookie), and fragment them between wikis (not shared) which means when a browser that has high reputation on wikipedia.org visits wikinews.org, it starts fresh, including for upload.wikimedia.org requests from there.
      • From Turnilo, text breaks down as 8+ (40%), 0 days (29%), 1-7 (29%), upload breaks down as 8+ (42%), 0 days (40%), 1-7 (17%). A lot more 0-day traffic in the default/baseline scenario.
  • Signed URLs could let us detect URL tampering, as the hash would contain e.g. filename and width in it, so if you modified those the hash would no longer match. Today, clients can request uncached widths that aren't used on-wiki. However, T402792: Consider rate limiting non-standard thumbnail sizes is in progress and will already let us throttle this to a very low rate (i.e. we can satisfy legacy hotlinks, which should be static, easy to cache, and different in behavior from abusive patterns like enumerating all file names through a botnet, or requesting arbitrary widths). Alternatively, we coudl even redirect to a standard width, after we address first-party edge cases with CSS tracked at T402792.
  • Signed URLs could let us distinguish API calls from first-party embeds, if we include a bit for this in the hash. This would be akin to a more reliable Referer header, something akin to ?ref=wikipedia.org should work regardless of Referrer-Policy and also reliably protect use outside web browsers where Referer headers don't exist, e.g. a logged-in account is using a bot or tool to interact with page HTML.
    • We have lots of first-party API calls that will make this harder. Such as: Mediaviewer, Images in search suggestions (OpenSearch API), Images in popups (Page Summary API via PCS), Images shown in mobile apps (Action API)
  • Timo asked Google about potential impact of signed URLs on Google Images (recurring Google/Wikimedia partnership meeting, public notes), and they shared the same concern as I had originally, which is that this would likely lead to duplication of result and dilute ranking. And while this may be an emerging trend on other sites (e.g. image attachments on GitHub issues, Facebook post images, etc), they generally do this behind a login wall, or they opt-out Googlebot. Even then the URLs would still emerge elsewhere via the API and on other sites, which is exactly what we rely on for ranking. It seems to me it would also affect keywords, since images heavily on host pages for relevant keywords. There is no equivalent of <link rel=canonical> for image files. Google suggested we go with a query string (no path segment) and specifically a UTM parameter, which for better or worse, are defacto stripped for purposes of determining the canonical URL.
  • MediaWiki by default does not generate arbitrary thumbnails on-demand. Only thumbnails used in articles, at widths used, are generated on disk. Anything else is a 404. To reliably get a thumbnail URL, it must come from a page, or from an API that creates it (and thus has a natural place to detect abuse, with APIs falling under "text" where the above mitigations work already). The 404-handler approach is specifically what we use in WMF production as an optimization. It might be worth long-term reconsidering this. This is effectively the same as what we're trying to do with signed URLs, but without requiring any complexity in URL variants or tracking. The files just wouldn't exist. This isn't trivial, as we'll need to think about async generation, and what we serve in the interim, especially for multi-page files or multilingual files where there isn't an obvious fallback that we serve while a thumbnail is being generated.

As a minimum starting point from where we can learn how well it works, and what we might need more, we will (@Krinkle, @matmarex):

  • Add a basic provenance parameter to first-party traffic to original files. That is, we'll append something like ?ref=wikipedia.org to non-thumb media URLs served on-site, and leave the rest as-is. No modification, breaking down, or tracking of external traffic. This is where we expect the biggest gain, because originals are meant to be rare, except where an image is so small that the original serves as the thumbnail, which this parameter will let us distinguish.
    • Scope: skinned pageviews, logged-in API calls, mw-api-int calls.

Open questions that we'll try to answer after this:

  • What about API calls? We might not need it if organic traffic is naturally cache-hit and then use the lack of provenance on cache-miss. Revisit based on MVP.
  • What about video files? Treat transcode as original or as thumb? In video player? Does it use the API?
  • What about download buttons? I.e. the implicit "original" link on the File description page, and the download link in Mediaviewer. They might be rare enough and fine as-is, or we may need to tag them.

There is no equivalent of <link rel=canonical> for image files.

FWIW there is, the Link header, but I doubt anything understands it.

FWIW there is, the Link header, but I doubt anything understands it.

I think that might work actually.

I was familiar with it, and the header is part of an HTTP standard (RFC 8288), but the individual relationships are not standardised at the HTTP level. For example, rel=canonical is standardised in the WHATWG HTML spec (both the tag and header). There is no standard for rel=canonical on other resources, so I didn't consider it.

Google support for Link header, but I don't think we should adopt it for HTML because fewer crawlers will support it and it's less portable (headers easily lost once the HTML is saved or passed through application layers).

Google Web Search does index formats beyond HTML, such as PDF and DOCX resources. Their support page actually demonstrates the Link header on a DOCX resource, point to a PDF (despite being non-standard). This is still within Web Search, though, so it might be limited to Googlebot (vs Googlebot-Image for Google Images Search). I will ask them in the next partnership meeting.

If this works, it might also solve the thumbnail-size problem where Google Images often points to small thumbnails (due to Wikipedia articles embedding those) instead of the medium-size previews from the Commons file description page, creating a poor experience (I can't reproduce this in Google or Bing today, maybe something solved this, e.g. JSON-LD for Commons). The Link-header could nudge catalogs like Google Images to a higher resolution (e.g. 1024px instead of 250px) akin to what we present in Mediaviewer and on file description pages.