Page MenuHomePhabricator

404 for old file versions on Wikimedia Commons with empty archive name
Open, LowPublic

Description

Hi all;

I'm trying to download Wikimedia Commons, but I have found some errors. For
example:

  • oi_archive_name is empty for this file

http://commons.wikimedia.org/wiki/File:Nl-scheikundig.ogg#filehistory

  • link is broken and you get an empty file

http://commons.wikimedia.org/wiki/File:SMS_Bluecher.jpg#filehistory

Are you aware of these errors in old files? Is this going to be fixed?

Regards,
emijrp


Version: unspecified
Severity: major

Details

Reference
bz35367

Event Timeline

bzimport raised the priority of this task from to Low.Nov 22 2014, 12:19 AM
bzimport set Reference to bz35367.
bzimport added a subscriber: Unknown Object (MLST).
Emijrp created this task.Mar 20 2012, 9:15 PM
Reedy added a comment.Mar 20 2012, 9:29 PM

(In reply to comment #0)

Hi all;
I'm trying to download Wikimedia Commons, but I have found some errors. For
example:

  • oi_archive_name is empty for this file

http://commons.wikimedia.org/wiki/File:Nl-scheikundig.ogg#filehistory

  • link is broken and you get an empty file

http://commons.wikimedia.org/wiki/File:SMS_Bluecher.jpg#filehistory
Are you aware of these errors in old files? Is this going to be fixed?
Regards,
emijrp

It can only be fixed if said files exist in some backup/similar

aaron added a comment.Mar 20 2012, 9:31 PM

It may still be on NFS, I've seen this in various places.

(In reply to comment #1)

(In reply to comment #0)

Hi all;
I'm trying to download Wikimedia Commons, but I have found some errors. For
example:

  • oi_archive_name is empty for this file

http://commons.wikimedia.org/wiki/File:Nl-scheikundig.ogg#filehistory

  • link is broken and you get an empty file

http://commons.wikimedia.org/wiki/File:SMS_Bluecher.jpg#filehistory
Are you aware of these errors in old files? Is this going to be fixed?
Regards,
emijrp

It can only be fixed if said files exist in some backup/similar

There are more errors like those ones, I didn't make a comprehensive list.

There are more bugs like this.

Bawolff, do you have suggestions on how to break down this bug in actionable items?
We probably need the following:

  1. some maintenance script to list files with each of the problems in question (oi_archive_name empty, archived versions linking "404 Not Found" etc.),
  2. scripts or whatever to correct the wrong metadata (where that's the problem) or look for missing files in NFS and restore them,
  3. bug to track the need to do something about the leftovers.

I'm downloading all the Commons files with emijrp's script, so we already have huge lists of suspects, e.g. https://archive.org/download/wikimediacommons-201208/2012-08-check.txt

(Data loss -> critical.)

Well the easiest to find would be everything select oi_name, oi_timestamp from oldimage where oi_archive_name = ''; this could be done by anyone with labs

After that one can look in the thumbnail log. From what I've seen of it, its full of line about thumbnail failed due to missing src path (this seems to be the main cause of failing png thumbnails now that vips has removed the size limit on that format)

As an aside, It'd be nice if we graphed number of missing files somewhere in ganglia. Ancedotally it seems like there are more of them then there used to be. It would be good to get real stats on this very scary problem.

Btw, one probable cause of recent incidents may have been fixed - see bug 54736

See also related bug 54776

aaron added a comment.May 14 2014, 9:26 PM

*** Bug 60766 has been marked as a duplicate of this bug. ***

*** Bug 41320 has been marked as a duplicate of this bug. ***

*** Bug 56218 has been marked as a duplicate of this bug. ***

Jdforrester-WMF moved this task from Untriaged to Backlog on the Multimedia board.Sep 4 2015, 5:57 PM
Restricted Application added subscribers: Matanya, Aklapper. · View Herald TranscriptSep 4 2015, 5:57 PM