Warn the user if they're uploading a non-exact duplicate image via UploadWizard
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	Cparle
	Apr 11 2024, 5:02 PM

Description

We'll need to

create an implementation of the pHash algorithm that can be run inside MediaWiki (or find an open source implementation we can use)
run this for all existing images, and store the hashes
create a new hash and store it for every new upload

The above is covered by T167947, and will allow the community to create tools to find duplicates.

If we also gathered deleted images and generated and stored hashes for those this would be even more useful. Storing hashes for images on other wikis would be useful too - if those images are fair-use then they're likely to be copyrighted

Once that's done we'll need to also generate the hash for stashed uploads in UW, and match that against existing stored hashes, and alert the user if there's a match (and warn them if that match has already been deleted)

Some implementation details

Trying to figure out how to store hashes that can be searched based on hamming distance is what made a previous version of this run aground (see T121797). According to User:Fæ's work on this we'll catch ~90% of duplicates just checking for identical hashes, so let's just check for identical hashes
~~We can possibly store the hashes in a mysql table with a flag to indicate if the original image has been deleted (that gets updated when the image is deleted/undeleted)~~
- Ideally we'd store the pHash in the new file table, see T28741

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T357587 [Research EPIC] Media quality investigation on Commons FY23/24
		Open		None	T362352 Warn the user if they're uploading a non-exact duplicate image via UploadWizard