Problem
As part of the work under WE5.4 to protect our infrastructure from abusive scraping, we want to be able to understand the provenance of image requests. This means being able to distinguish when and where a URL to an image was generated.
This will allow us to use this information as a signal in request filtering at the CDN, by helping to determine if a request is coming from a browser session visiting the website, an API query, from dumps or if they are the result of hotlinking.
Approach
Generate signed URLs for image requests, by adding query parameters that contain the provenance information and a signature that can be trivially validated at the CDN. The signature should be an HMAC that includes the URL, source (web, api, dumps), timestamp and a secret.
- Acceptance criteria
- Generated image URLs include provence query parameters
- Generated image URLs include an HMAC signature
- Signature contents and HMAC algorithm agreed with SRE
- SRE can configure the CDN based on the source that generated an image URL
- SRE can configure the CDN based on the freshness of an image URL