Page MenuHomePhabricator

RFC: Use content hash based image / thumb URLs
Open, MediumPublic

Description

  • Affected components: TBD.
  • Engineer for initial implementation: TBD.
  • Code steward: TBD.

Motivation

(Define the problem you are seeking to solve.)

Requirements

(Specify the requirements that a proposal should meet.)


Exploration

This task was split out of T66214, as establishing an API for thumbnails is more pressing than moving to content hash based thumb identifiers. The thumbnail API can accommodate either without too much trouble, which lets us tackle the move to content hash based addressing in a second phase.

Identifying thumbs by content hash instead of human-readable names

Content hash based URLs for media files and thumbnails have some advantages over the current pretty names:

  • automatic cache busting
  • consistency of HTML revisions and media referenced in it, in particular in old revisions (important for HTML storage and Parsoid)
  • natural content-based deduplication
  • content-based image blocking (bad image lists etc)
  • media renames don't trigger HTML updates
  • simplifies a potential migration of all media content to commons

There are also some disadvantages:

  • need to use Content-disposition header to suggest pretty name for image saving
  • need to think about quick image purging for copyvio cases, as cache busting is not enough there
  • applying of access restrictions is more complicated, as it needs to query all image-revisions referring to the hash and choose which to apply (likely "least-restrictive restriction wins")
  • media edits (i.e. uploading a new version) do trigger HTML updates
  • use of hash collisions for vandalism, should the chosen hash mechanism turn out to be susceptible to practical preimage attacks and reuploads of the same content are allowed (which may be desirable to allow easily fixing data corruption)

Related Objects

Event Timeline

GWicke updated the task description. (Show Details)

FYI I've sort of implement a solution for this on Vagrant a while ago for the current thumbnail URI scheme, by replacing the second instance of the file name in the URI with the SHA1 of the original.

Eg. http://127.0.0.1:6081/images/thumb/d/d7/Munich_subway_station_Westfriedhof2.jpg/800px-Munich_subway_station_Westfriedhof2.jpg becomes http://127.0.0.1:6081/images/thumb/d/d7/Munich_subway_station_Westfriedhof2.jpg/800px-2zeso4ug3i3dsai23sd7thu7kdnndiq.jpg

It only causes minor breakage for code in the wild that consumed the second name when it needed the original's file name, which has to be updated to read the first occurrence instead.

The option that makes that happen is $supportsSha1URLs on the FileRepo

Note to future selves: we'd probably want the filename to contain the human readable File: name so people and machines can generate search results more easily.

Krinkle renamed this task from Use content hash based image / thumb URLs to RFC: Use content hash based image / thumb URLs.Mar 21 2018, 9:03 PM
Krinkle moved this task from Old to P1: Define on the TechCom-RFC board.
BBlack subscribed.

The swap of Traffic for Traffic-Icebox in this ticket's set of tags was based on a bulk action for all tickets that aren't are neither part of our current planned work nor clearly a recent, higher-priority emergent issue. This is simply one step in a larger task cleanup effort. Further triage of these tickets (and especially, organizing future potential project ideas from them into a new medium) will occur afterwards! For more detail, have a look at the extended explanation on the main page of Traffic-Icebox . Thank you!