Page MenuHomePhabricator

PDF and Djvu files on Commons failed to be processed (no thumbnails, zero pages) but otherwise valid
Open, Needs TriagePublic

Description

I went through Wikimedia Commons dump and checked for all invalid PDF and Djvu files (those with no thumbnails, 0x0 size, and zero pages) and tested them. Those which were really invalid I tried to replace with a fixed version, or if I could not find a fixed version, I marked them for speedy deletion.

But I have found some files which look invalid on Commons which seems to load fine (at least in Firefox for PDF, and ddjvu for Djvu files). Maybe there is some issue with how they are processed on the backend?

Here is the list:

https://commons.wikimedia.org/wiki/File:Arheograficheskaya_komissiya_Letopis_zanyatij_01_1861.pdf (processing of thumbnails started, but then it died)
https://commons.wikimedia.org/wiki/File:CADAL08001216_文選樓叢書_疇人傳:卷十二.djvu
https://commons.wikimedia.org/wiki/File:CADAL08011455_清代学术丛书·第一集·颜氏学记:卷七至卷八.djvu
https://commons.wikimedia.org/wiki/File:Niva_1891-05.djvu
https://commons.wikimedia.org/wiki/File:Кирилова_книга_часть_8.djvu
https://commons.wikimedia.org/wiki/File:Русский_биографический_словарь._Том_15_(1910)_—_с._24-25.djvu
https://commons.wikimedia.org/wiki/File:Томские_губернские_ведомости,_1900_№_38_(28_сентября).djvu
https://commons.wikimedia.org/wiki/File:Указатель_статей_морского_сборника_1848_-_1872_г._1875(2).djvu
https://commons.wikimedia.org/wiki/File:Congressional_Research_Service_Reports_R45148_-_U.S._Trade_Policy_Primer_-_Frequently_Asked_Questions.pdf
https://commons.wikimedia.org/wiki/File:EUR_2014-1209.pdf
https://commons.wikimedia.org/wiki/File:%E8%AE%80%E6%9B%B8%E5%A0%82%E7%B6%B5%E8%A1%A3%E5%85%A8%E9%9B%86%E5%9B%9B%E5%8D%81%E5%85%AD%E5%8D%B7_%E6%B8%85%E5%BA%B7%E7%86%99%E5%88%BB%E6%9C%AC_%E7%AC%AC21%E5%86%8A.pdf

See also (and possibly duplicate with): T297942, T298417, T299521

Event Timeline

What is this wikimirror.org? Why change links to that?

So this list is exhaustive. I went through all PDFs and Djvu files on Wikimedia Commons as of previous week. Not just a random example. if we fix these, then all of them will be fixed. :-)

No, this one seems just a slightly broken PDF. I just fixed it.

that's odd, I saved the pdf file starting from a Word document. (Ok, at a second thought that's not odd at all :-) ) Thanks!

So I fixed it using mutool clean. But the ones I listed above cannot be fixed this way. And this is what I am reporting. So mutool clean does not fix it, looking at MediaBox values show reasonable page sizes (including the first page), and even metadata (example for the first file above shows page size available:

{
    "name": "pdf-PageSize",
    "value": [
        {
            "name": 0,
            "value": "612 x 792 pts (letter)"
        },
        {
            "name": 1,
            "value": "697 x 855 pts"
        }
    ]
}

But Mediawiki does not show width and height. So something is wrong.

@mau If you made this PDF yourself, could I recommend removing the first blank page? Because otherwise the first thumbnail does not show anything.

@Mitar probably it's even better to substitute the first page with the actual cover for the book, indeed. I proceed :-)

Mitar updated the task description. (Show Details)
Mitar updated the task description. (Show Details)

I ran into the same problem. I don't know if this can be considered a solution, because these steps have to be done on the server side, but I solved my problem:

  1. step – repair thumbnails for files of the core MediaWiki
php maintenance/refreshImageMetadata.php --verbose --mime image/vnd.djvu --force
  1. step – do null edit of the index pages by Extension:Proofread_Page (need for actualization info about the pages count for special page)
php maintenance/refreshLinks.php --namespace 252