Followup actionable from https://wikitech.wikimedia.org/wiki/Incident_documentation/20200511-thumbor
During the incident, it became clear that if we had a very short cache (e.g. 5-10m) for 404s for thumbnails, the amount of requests that would reach the service eventually would be way less and thus would greatly mitigate the incident.
That however would open a path for cache pollution attacks. One that I can think of easily is the following:
- Race condition, e.g. thumbnails being requested before the original has been uploaded and thus the thumbnail taking some time to generate. With request coalescing at the edge(which we currently have) and a short cache period (even 5-10m might be too much), that race condition would probably be mitigated before it became enough of a nuissance.