Page MenuHomePhabricator

Warn the user if they're uploading a non-exact duplicate image via UploadWizard
Open, Needs TriagePublic

Description

We'll need to

  • create an implementation of the pHash algorithm that can be run inside MediaWiki (or find an open source implementation we can use)
  • run this for all existing images, and store the hashes
  • create a new hash and store it for every new upload

The above is covered by T167947, and will allow the community to create tools to find duplicates.

If we also gathered deleted images and generated and stored hashes for those this would be even more useful. Storing hashes for images on other wikis would be useful too - if those images are fair-use then they're likely to be copyrighted

Once that's done we'll need to also generate the hash for stashed uploads in UW, and match that against existing stored hashes, and alert the user if there's a match (and warn them if that match has already been deleted)


Some implementation details

  • Trying to figure out how to store hashes that can be searched based on hamming distance is what made a previous version of this run aground (see T121797). According to User:Fæ's work on this we'll catch ~90% of duplicates just checking for identical hashes, so let's just check for identical hashes
  • We can possibly store the hashes in a mysql table with a flag to indicate if the original image has been deleted (that gets updated when the image is deleted/undeleted)
    • Ideally we'd store the pHash in the new file table, see T28741