Page MenuHomePhabricator

IA Upload: Permit duplicate IA identifier if of a different format
Closed, ResolvedPublic

Description

https://commons.wikimedia.org/wiki/File:Duke_University_Libraries_(IA_carysnewitinerar01cary).pdf has over-compressed scans.

For Wikisource purposes, a good quality scan is needed, and in places the PDF scan of this is NOT reliably intelligible

So I would like to use https://iw.toolforge.org/ia-upload to attempt a regeneration of the file using the JP2 scans (into either PDF or DJVU.

The tool however won't let me do this because a file with the relevant IA identifier already exists ( albiet as PDF).

The tool should allow me to regenerate the relevant scan, regardless of the presence of the existing PDF, (The warning whilst appreciated, is not helpful if it prevents me doing something that was a definite choice to resolve a specific technical issue.)

Event Timeline

ShakespeareFan00 renamed this task from Presence of PDf file with given filename or IA identifer blocks attempts to regenerate file to Presence of PDf file with given filename or IA identifer blocks attempts to regenerate alternate file for the associated identifier..Dec 6 2020, 9:53 AM
ShakespeareFan00 updated the task description. (Show Details)
Reedy renamed this task from Presence of PDf file with given filename or IA identifer blocks attempts to regenerate alternate file for the associated identifier. to Presence of PDF file with given filename or IA identifer blocks attempts to regenerate alternate file for the associated identifier.Dec 6 2020, 3:25 PM

I had not seen that you had reported the same problem, ( i reported https://phabricator.wikimedia.org/T270928) Results that the user Fæ (BOT?) was loaded pdf files from Internet Archive https://commons.wikimedia.org/wiki/Special:Contributions/F%C3%A6 and so blocks djvu upload with this tool, ignoring that pdf version are poor version that djvu files. Djvu files are priority for Wikisource for OCR amd transcripts.

Samwilson added a subscriber: Inductiveload.
Samwilson subscribed.

I've merged the two above tasks to this one (the first that was created). They're not all identical, but I think can be fixed together by changing to just show a prominent warning when an IA identifier is found to already be on Commons, instead of prohibiting upload. This would mean additional PDFs or DjVus could be uploaded. Does that sound okay?

As a precursor to this, it's probably worth updating some dependencies: https://github.com/wikisource/ia-upload/pull/48

Samwilson renamed this task from Presence of PDF file with given filename or IA identifer blocks attempts to regenerate alternate file for the associated identifier to IA Upload: Permit duplicate IA identifier if of a different format.Feb 22 2021, 11:44 PM

PR 48 is merged and deployed to the test site.

It sounds like we want to continue preventing duplicate IA IDs where the filetype is the same, and only show a warning when it's different. Something like this?:

An existing file is already linked to the IA identifier 'in.ernet.dli.2015.3711': File:Under the Greenwood Tree.pdf. Please only upload a new file if you're certain that it's not duplicating that file.

Does that sound okay?

@Samwilson I would go like this: ...Please only upload a new file if you're certain that it's not duplicating that file in the same format. Uploading a duplicate as a different file type is OK.

My suggested wording would be "An existing file is already linked to the IA identifier 'foo00foo': File:bar.pdf. It is not necessary to upload a duplicate file, unless it is to resolve a technical issue, missing or illegible content." which enumerates the reasons where uploading apparent 'duplicates' wouldbe permitted.

@Jan.Kamenicek re "if you're certain that it's not duplicating that file in the same format" — the patch above still prohibits same-format duplicates, so this wouldn't quite fit.

Do we want to allow duplicates of the same format?

@ShakespeareFan00 that's a good idea to list the reasons for uploading a dupe, but "missing or illegible content" sounds to me like times when we'd want to overwrite an existing file with a better one (of course, not if it's a different edition, but I assume we're talking about a matching IA ID).

@Samwilson Ah... based on the name of this task I supposed that its aim is to enable uploading the duplicate files in different formats, so now I am quite confused why the suggested solution does not enable it... I have also explicitely asked for it in [[ T272167 ]] and explained the reasons there. You have closed it as a duplicate of this task, so I supposed this task is going to solve it. Does it not?

@Samwilson I am sorry, now I see you were asking about THE SAME formats. No that is not usually desirable, but my wording did not suggest it either.

@ShakespeareFan00 There can be many more reasons why a file could be useful in both pdf and djvu and I often upload books in both files. While djvu is better for further processing in Wikisource, ordinary non-Wikisource users of files from Commons usually prefer pdf as pdf-readers are much more widespread than djvu readers. Many people are not able to process djvu files.

Do we want to allow duplicates of the same format?

Yes. As I outlined in the other task, preventing dups only makes sense if we assume that to be an exceptional state; but post-bulk-upload that's now the rule. Cutting down on unnecessary dups would still be nice, but that's a warning not a hard limit.

Also, keep in mind, when you write "duplicate" you're really describing "happens to have the same IA identifier". There is no guarantee that this means bit-for-bit identical, which means it could have differences that are anywhere from trivial to significant. A regenerated text layer using a newer OCR engine being a prime example.

Remember, this only happens when uploading from the same IA ID via IA-Upload, not for user-generated re-OCRd files.

It's highly unlikely that the same format from the same ID will be different on the second upload (unless the file owner re-derived it in the meantime), so the upload would fail due to identical file hash anyway.

And obviously uploading a different format is always fine to do.

So just warn that the ID appears to exist, let the user proceed, and if the file was identical, catch the failure and report.

Also, tell the user which file is the duplicate so they can go and check it out.

Please be wary of the usage of "duplicate" as it has a jargon meaning at Commons. If we are talking for works of the same file type from the same source of the same scale or less, then it will typically considered a duplicate. If it is of a different file type, or of a higher quality, then it isn't a duplicate. Of course it it is of same source, and file type, and better quality, then it should overwrite the existing file. Just want to ensure that we are keeping the jargon aligned.

That's a good point about 'duplicate' not being an actual duplicate. Which actually isn't being handled very well at the moment: it shows a generic 'unable to upload' error; I'll fix it to show a custom duplicate message with a link to the existing file.

So, forgive me if I'm getting confused, but I think we want these two cases:

  1. When the IA ID is already on a file, show this on the metadata page and allow the upload to continue:

An existing file is already linked to the IA identifier '$1': $2. Please only upload a new file if you're sure that it's required. A different format of the same IA item is OK.

  1. After submitting the metadata page (because that's the only point we actually know it's an exact duplicate):

Unable to upload, because a exact duplicate file already exists on Wikimedia Commons: $1

Is this messaging correct?

Concur, as much as my brain can get to anything today

The above change is merged, and deployed to both the test and production sites.

Is it all good?

@Samwilson I have just tried it and unfortunately it failed :-(. There is a pdf file uploaded to Commons https://commons.wikimedia.org/wiki/File:Bohemia_under_Hapsburg_misrule_(1915).pdf which I uploaded from Internet archive https://archive.org/details/bohemiaunderhaps00capeiala/mode/2up some time ago. Now I wanted to upload the djvu of the same file. First I received the message

An existing file is already linked to the IA identifier 'bohemiaunderhaps00capeiala': File:Bohemia under Hapsburg misrule (1915).pdf. Please only upload a new file if you're sure that it's required. A different format of the same IA item is OK.

That seemed still OK so I clicked "upload" which was followed by another message "File failed to upload". I tried it several times with the same result.

Besides that I also wanted to overwrite and existing file https://commons.wikimedia.org/wiki/File:Copy_of_the_Will_of_Augustine_Herrman,_of_Bohemia_Manor.pdf with the same file and filetype, only without the first page which I had forgotten to exclude. I have undestood during the discussion above, that such overwriting of the files is going to be possible too, but it failed as well.

The IA tool did not warn me when creating duplicates yesterday leading to duplicate indexes. I caught them by mistake, but I want to flag this as an issue. If want to permit the creation of duplicate files albeit in different formats, them the warning needs to be in place and require confirmation to override.

The IA tool did not warn me when creating duplicates yesterday leading to duplicate indexes. I caught them by mistake, but I want to flag this as an issue. If want to permit the creation of duplicate files albeit in different formats, them the warning needs to be in place and require confirmation to override.

Please give examples rather than commentary that doesn't allow any further examination of the issue.

https://www.mediawiki.org/wiki/How_to_report_a_bug