Page MenuHomePhabricator

[S] Explore how many rows in the file tables have duplicated SHA-1 values on WMF wikis
Closed, ResolvedPublic

Description

Files in the image, oldimage, and filearchive table can have the same SHA-1 value. This task is to determine the extent to which the duplications occur and as such what safeguards need to be in place for duplicated files.

To do this queries will be run to find the maximum number of duplications for a SHA-1 on commonswiki.

Acceptance criteria
  • Have these queries run and explore what this means for the code that reads images by SHA-1 value.

Related Objects

StatusSubtypeAssignedTask
ResolvedDreamy_Jazz
ResolvedDreamy_Jazz
OpenNone
OpenNone
DuplicateNone
Resolvedkostajh
ResolvedNone
ResolvedTchanders
ResolvedDreamy_Jazz
ResolvedDreamy_Jazz
ResolvedDreamy_Jazz
ResolvedDreamy_Jazz
ResolvedDreamy_Jazz
ResolvedBUG REPORTDreamy_Jazz
ResolvedTchanders
ResolvedDreamy_Jazz
ResolvedBUG REPORTDreamy_Jazz
ResolvedDreamy_Jazz
ResolvedDreamy_Jazz
ResolvedNone
ResolvedDreamy_Jazz
Resolvedkostajh
Resolvedkostajh
ResolvedDreamy_Jazz
ResolvedDreamy_Jazz
ResolvedSTran
Resolvedkostajh
ResolvedDreamy_Jazz
ResolvedDreamy_Jazz

Event Timeline

Results:

From this, the code that gets files for a SHA-1 should expect that there could be thousands of file rows for a given SHA-1 and as such should appropriately limit the number of files selected.

It should also handle cases where no SHA-1 exists for a row. No SHA-1 existing for a row seems to mean that there is no image associated with that row, and as such these should be ignored.