Page MenuHomePhabricator

Old file versions may include duplicate content, which breaks import
Closed, ResolvedPublic5 Estimated Story Points

Description

If I try to import https://et.wikipedia.org/wiki/Fail:Image002.gif then it gives an error as one of its older versions is identical to https://commons.wikimedia.org/wiki/File:Maks.gif which was imported to Commons earlier. It occassionally happens that new version is accidentally uploaded under wrong title or file is intentionally overwritten, but it then then gets reverted and new version is uploaded under new title. I believe it should be then still possible to import files under both titles along with their entire file histories.

I recently wrote about other kinds of checks on old versions (T216516#5349251) which seem unwanted to me. Maybe there is some generic solution to this.

Event Timeline

Pikne created this task.Jul 28 2019, 2:04 PM
Restricted Application added a project: archived--TCB-Team. · View Herald TranscriptJul 28 2019, 2:04 PM
awight renamed this task from Old file versions are checked for duplicates to Old file versions may include duplicate content, which breaks import.Sep 10 2019, 10:26 PM
awight triaged this task as Lowest priority.
awight updated the task description. (Show Details)
Pikne added a subscriber: awight.Sep 11 2019, 10:25 AM

Update: We haven't seen this error yet.

@awight, do you mean that you can't reproduce this? I still run into an error if I try to to import "Image002.gif" mentioned above. On Grafana this shows up as "duplicateFiles" error. On July 28th, when another user reported this on wiki and when I created this ticket, I see 12 errors of this type recorded.

awight raised the priority of this task from Lowest to Medium.Sep 11 2019, 1:44 PM

Update: We haven't seen this error yet.

@awight, do you mean that you can't reproduce this? I still run into an error if I try to to import "Image002.gif" mentioned above. On Grafana this shows up as "duplicateFiles" error. On July 28th, when another user reported this on wiki and when I created this ticket, I see 12 errors of this type recorded.

I somehow confused this task with T185734. Reverting my changes; thank you for noticing!

Pikne updated the task description. (Show Details)May 23 2020, 7:27 PM
Lena_WMDE set the point value for this task to 5.
Lena_WMDE added a subscriber: Lena_WMDE.
  • It should be possible to upload files with duplicates in the older revisions
  • It should not be possible to upload files where the latest revision is a duplicate
  • It should be possible to upload files with duplicates in the older revisions
  • It should not be possible to upload files where the latest revision is a duplicate

Isn't this already the current implementation, see https://phabricator.wikimedia.org/diffusion/EFLI/browse/master/src/Services/ImportPlanValidator.php$280-282?
@thiemowmde @awight

Change 602677 had a related patch set uploaded (by Thiemo Kreuz (WMDE); owner: Thiemo Kreuz (WMDE)):
[mediawiki/extensions/FileImporter@master] Fix incomplete "latest file revision" calculation

https://gerrit.wikimedia.org/r/602677

Indeed.

I started digging into the code and could reproduce the issue locally. Import https://et.wikipedia.org/wiki/Fail:Image002.gif first. Then try to import https://commons.wikimedia.org/wiki/File:Maks.gif. That's blocked.

Look at the history of https://et.wikipedia.org/wiki/Fail:Image002.gif. The file revision with the latest timestamp is not the 1st, but the 2nd in the history. That's because the 1st is a revert. Unfortunately our code was exclusively looking at the timestamp to find the latest revision. I uploaded a patch to fix this.

Change 603396 had a related patch set uploaded (by Thiemo Kreuz (WMDE); owner: Thiemo Kreuz (WMDE)):
[mediawiki/extensions/FileImporter@master] Add missing test cases for revisions marked with "archivename"

https://gerrit.wikimedia.org/r/603396

Change 602677 merged by jenkins-bot:
[mediawiki/extensions/FileImporter@master] Fix incomplete "latest file revision" calculation

https://gerrit.wikimedia.org/r/602677

Change 603396 merged by jenkins-bot:
[mediawiki/extensions/FileImporter@master] Add missing test cases for revisions marked with "archivename"

https://gerrit.wikimedia.org/r/603396

WMDE-Fisch closed this task as Resolved.Jun 9 2020, 11:53 AM
WMDE-Fisch claimed this task.
WMDE-Fisch moved this task from Demo to Done on the WMDE-QWERTY-Sprint-2020-05-27 board.