Page MenuHomePhabricator

FileImporter cannot import files with missing sha1 field
Closed, ResolvedPublic5 Estimated Story Points

Description

There is currently no consensus to do so, but I wanted to see whether it was possible to import nl:Bestand:Wiki.png to Commons. When I click on the export link, I get the following error message:

FileImporter\Data\FileRevision: Missing sha1 field on construction

This limitation isn't listed on the project page on Meta.

Event Timeline

Bdijkstra created this task.Mar 3 2020, 2:56 PM
Restricted Application added a project: archived--TCB-Team. · View Herald TranscriptMar 3 2020, 2:56 PM

It seems one file version is missing a sha1, see api https://nl.wikipedia.org/w/api.php?action=query&titles=Bestand:Wiki.png&prop=imageinfo&iiprop=sha1|timestamp&iilimit=max
version is "timestamp": "2004-07-23T09:10:05Z"

I am not sure if the maintenance script populateImageSha1.php can fix it or how to request a run.

Bdijkstra renamed this task from FileImporter cannot import really old files to FileImporter cannot import files with missing sha1 field.Mar 20 2020, 10:17 PM
Bdijkstra updated the task description. (Show Details)

There is indeed one file revision without a sha1 in this file's history:
https://nl.wikipedia.org/wiki/Speciaal:ApiSandbox#action=query&prop=imageinfo&titles=Bestand:Wiki.png&iiprop=timestamp%7Csha1&iilimit=500

I had a quick look at some other code, and it appears like we need to set the sha1 field to null or an empty string then:
https://codesearch.wmflabs.org/search/?q=(isset%7Carray_key_exists)%5C(%5B%5E()%5D*%5B%22%27%5Dsha1%5B%22%27%5D

To me this sounds like a bug we should fix. I don't see a reason to block imports because of a missing sha1 field.

However, we can not blindly change this code!

  • ImportDetails class requires the field, and might behave bad if it contains null. Proposed solution: Fall back to the timestamp.
  • DuplicateFileRevisionChecker can't work without a sha1. Proposed solution: Skip the check if the sha1 is null or an empty string.
Lena_WMDE set the point value for this task to 5.
Andrew-WMDE moved this task from Sprint Backlog to Doing on the WMDE-QWERTY-Sprint-2020-05-27 board.

I took a look at the file history and this time also included the url iiprop:
https://nl.wikipedia.org/wiki/Speciaal:ApiSandbox#action=query&format=json&prop=imageinfo&titles=Bestand%3AWiki.png&iiprop=timestamp%7Csha1%7Curl&iilimit=500

The url for the revision which is missing a sha1 is the following:
https://upload.wikimedia.org/wikipedia/nl/archive/b/bc/20040724095936%21Wiki.png

When trying to access this file the following is returned:

File not found: /v1/AUTH_mw/wikipedia-nl-local-public/archive/b/bc/20040724095936%21Wiki.png

So naturally when importing this particular file while ignoring the missing sha1 (by setting it equal to an empty string) you still get the following error:

<Error, collected 1 error(s) on the way, integer value set> +------+---------------------------+------------------------------------------+ | 1 | http-bad-status | 404 Not Found | +------+---------------------------+------------------------------------------+

Now if I understand https://www.mediawiki.org/wiki/API:Imageinfo correctly, then the sha1 should be present unless a filehidden property is returned:

Adds SHA-1 hash for the file. If the file has been revision deleted, a filehidden property will be returned.

To me it It feels like this is a weird edge case, where the file is actually gone/deleted but the api acts as if it still exists (minus the sha1)?
As a file doesn't technically exist, we probably want to to treat this in the same way we do a filehidden, or?

@thiemowmde what are your thoughts?

FWIW, looking at the edits made at the Village Pump and by the uploader on that day, it seems that there was nothing wrong with the upload at the time.

Excellent analysis! In my opinion, it's worth doing two things:

  • Let's add the code that fixes a missing SHA1. I mean, confronting the user with an exception is clearly a bug and should be fixed. Maybe this helps making a few more imports possible. I don't think it can hurt.
  • We currently check for filehidden as well as filemissing in the API response. In both cases, we block the entire import. It might be possible to skip older file revisions instead. But this sounds like it should be a separate task – or never be done. For now, let's handle the edge case above the same way. We can even reuse the existing error message: "Can't import file because at least one of its revisions is missing an image file." (fileimporter-filemissinginrevision)

Change 599866 had a related patch set uploaded (by Andrew-WMDE; owner: Andrew-WMDE):
[mediawiki/extensions/FileImporter@master] Allow import when image info is missing "sha1"

https://gerrit.wikimedia.org/r/599866

Change 601637 had a related patch set uploaded (by Thiemo Kreuz (WMDE); owner: Thiemo Kreuz (WMDE)):
[mediawiki/extensions/FileImporter@master] Do not require file revision sha1 anywhere

https://gerrit.wikimedia.org/r/601637

Change 601637 merged by jenkins-bot:
[mediawiki/extensions/FileImporter@master] Do not require file revision sha1 anywhere

https://gerrit.wikimedia.org/r/601637

Change 602352 had a related patch set uploaded (by Andrew-WMDE; owner: Andrew-WMDE):
[mediawiki/extensions/FileImporter@master] Improve error message shown when an image url leads to a 404

https://gerrit.wikimedia.org/r/602352

Change 602393 had a related patch set uploaded (by Andrew-WMDE; owner: Andrew-WMDE):
[mediawiki/extensions/FileImporter@master] [WIP] Do not require text revision sha1 anywhere

https://gerrit.wikimedia.org/r/602393

Change 602393 merged by jenkins-bot:
[mediawiki/extensions/FileImporter@master] Do not require text revision sha1 anywhere

https://gerrit.wikimedia.org/r/602393

Change 602672 had a related patch set uploaded (by Andrew-WMDE; owner: Andrew-WMDE):
[mediawiki/extensions/FileImporter@master] Add "missing sha1" test for duplicate checker

https://gerrit.wikimedia.org/r/602672

Change 602672 merged by jenkins-bot:
[mediawiki/extensions/FileImporter@master] Add "missing sha1" test for duplicate checker

https://gerrit.wikimedia.org/r/602672

Change 599866 abandoned by Thiemo Kreuz (WMDE):
Allow import when image info is missing "sha1"

Reason:
I tried to rebase this, but it turns out everything is already done in other patches now.

https://gerrit.wikimedia.org/r/599866

Change 602352 merged by jenkins-bot:
[mediawiki/extensions/FileImporter@master] Improve error message shown when an image url leads to a 404

https://gerrit.wikimedia.org/r/602352

Lena_WMDE closed this task as Resolved.Jun 11 2020, 10:45 AM
Lena_WMDE moved this task from Demo to Done on the WMDE-QWERTY-Sprint-2020-05-27 board.