Page MenuHomePhabricator

Change IA upload disallow on duplicate to a challenge
Closed, DuplicatePublic

Description

IA-upload restricts the ability to upload a second copy of a file from the Internet Archive, based on the unique uri fragment for the work. This used to be reasonable way to manage duplicates however with Fae having mega-uploaded PDFs from IA, and these PDFs being problematic in quality, or problematic scans to proofread, this restriction is now unreasonable, and cannot be managed by general users easily.

I propose that the restriction be changed to 1) (best option) a warning/challenge to say that the file has already been uploaded and the form of the upload, but still allow an upload of a different file type if forced. If that is too hard to easily achieve, then 2) please remove the restriction completely where the file type differs from what is uploaded. If checking the file type is too hard, then 3) please disable the restriction.

I still think that it is reasonable to reject if same filetype, as the user can easily upload a new version over the top.

Event Timeline

The very simplest way would be to just change pageForIAItem() to always return an empty string.

The next simplest would be to add a checkbox ("Ignore duplicates") to the UI that pageForIAItem() checks and if true return the empty string.

With the latter approach the solution could eventually be expanded so that the checkbox is hidden by default, and when pageForIAItem() detects a duplicate it sends the user to a form where that checkbox is made visible (and possibly checked by default). That would give us some protection against unintended dupes, but may not be worth the effort in practice.

The two first variants should be doable within roughly "straightforward bugfix" amounts of time expended. The latter probably isn't prohibitively hard to do, but that would depend on what facilities are already available and would in any case take non-trivially more effort than the two dumbest approaches. But maybe it could piggyback on the functionality that returns information about the duplicate file to the user?

In any case, the dupe check has as a fundamental assumption that the existence of duplicates will be an exception state. After the bulk-upload of significant portions of IAs works that assumption no longer holds true: not having a dupe is now the exception, and having a dupe is the expected state. In other words, so long as this check is in place as-is the tool is functionally broken from an end-user perspective.

Samwilson subscribed.

I've merged this with T269518; hopefully that's okay. I think that task captures all of this.