Page MenuHomePhabricator

RFC: Use content hash based image / thumb URLs
Open, NormalPublic

Description

This task was split out of T66214, as establishing an API for thumbnails is more pressing than moving to content hash based thumb identifiers. The thumbnail API can accommodate either without too much trouble, which lets us tackle the move to content hash based addressing in a second phase.

Identifying thumbs by content hash instead of human-readable names

Content hash based URLs for media files and thumbnails have some advantages over the current pretty names:

  • automatic cache busting
  • consistency of HTML revisions and media referenced in it, in particular in old revisions (important for HTML storage and Parsoid)
  • natural content-based deduplication
  • content-based image blocking (bad image lists etc)
  • media renames don't trigger HTML updates
  • simplifies a potential migration of all media content to commons

There are also some disadvantages:

  • need to use Content-disposition header to suggest pretty name for image saving
  • need to think about quick image purging for copyvio cases, as cache busting is not enough there
  • applying of access restrictions is more complicated, as it needs to query all image-revisions referring to the hash and choose which to apply (likely "least-restrictive restriction wins")
  • media edits (i.e. uploading a new version) do trigger HTML updates
  • use of hash collisions for vandalism, should the chosen hash mechanism turn out to be susceptible to practical preimage attacks and reuploads of the same content are allowed (which may be desirable to allow easily fixing data corruption)

Related Objects

Event Timeline

GWicke removed brion as the assignee of this task.Nov 2 2016, 9:32 PM
GWicke updated the task description. (Show Details)

FYI I've sort of implement a solution for this on Vagrant a while ago for the current thumbnail URI scheme, by replacing the second instance of the file name in the URI with the SHA1 of the original.

Eg. http://127.0.0.1:6081/images/thumb/d/d7/Munich_subway_station_Westfriedhof2.jpg/800px-Munich_subway_station_Westfriedhof2.jpg becomes http://127.0.0.1:6081/images/thumb/d/d7/Munich_subway_station_Westfriedhof2.jpg/800px-2zeso4ug3i3dsai23sd7thu7kdnndiq.jpg

It only causes minor breakage for code in the wild that consumed the second name when it needed the original's file name, which has to be updated to read the first occurrence instead.

The option that makes that happen is $supportsSha1URLs on the FileRepo

ema moved this task from Triage to Caching on the Traffic board.Nov 7 2016, 11:36 AM
daniel moved this task from Inbox to Backlog on the TechCom-RFC board.Nov 30 2016, 9:34 PM
Arlolra moved this task from Backlog to Non-Parsoid Tasks on the Parsoid board.Dec 2 2016, 7:34 PM

Note to future selves: we'd probably want the filename to contain the human readable File: name so people and machines can generate search results more easily.

Krinkle renamed this task from Use content hash based image / thumb URLs to RFC: Use content hash based image / thumb URLs.Mar 21 2018, 9:03 PM
MaxSem removed a project: Zero.Jan 3 2019, 11:16 PM