Page MenuHomePhabricator

PDF and Djvu files on Commons failed to be processed (no thumbnails, zero pages) but otherwise valid
Open, Needs TriagePublic

Description

I went through Wikimedia Commons dump and checked for all invalid PDF and Djvu files (those with no thumbnails, 0x0 size, and zero pages) and tested them. Those which were really invalid I tried to replace with a fixed version, or if I could not find a fixed version, I marked them for speedy deletion.

But I have found some files which look invalid on Commons which seems to load fine (at least in Firefox for PDF, and ddjvu for Djvu files). Maybe there is some issue with how they are processed on the backend?

Here is the list:

https://commons.wikimedia.org/wiki/File:Arheograficheskaya_komissiya_Letopis_zanyatij_01_1861.pdf (processing of thumbnails started, but then it died)
https://commons.wikimedia.org/wiki/File:CADAL08001216_文選樓叢書_疇人傳:卷十二.djvu
https://commons.wikimedia.org/wiki/File:CADAL08011455_清代学术丛书·第一集·颜氏学记:卷七至卷八.djvu
https://commons.wikimedia.org/wiki/File:Niva_1891-05.djvu
https://commons.wikimedia.org/wiki/File:Кирилова_книга_часть_8.djvu
https://commons.wikimedia.org/wiki/File:Русский_биографический_словарь._Том_15_(1910)_—_с._24-25.djvu
https://commons.wikimedia.org/wiki/File:Томские_губернские_ведомости,_1900_№_38_(28_сентября).djvu
https://commons.wikimedia.org/wiki/File:Указатель_статей_морского_сборника_1848_-_1872_г._1875(2).djvu
https://commons.wikimedia.org/wiki/File:Congressional_Research_Service_Reports_R45148_-_U.S._Trade_Policy_Primer_-_Frequently_Asked_Questions.pdf
https://commons.wikimedia.org/wiki/File:EUR_2014-1209.pdf
https://commons.wikimedia.org/wiki/File:%E8%AE%80%E6%9B%B8%E5%A0%82%E7%B6%B5%E8%A1%A3%E5%85%A8%E9%9B%86%E5%9B%9B%E5%8D%81%E5%85%AD%E5%8D%B7_%E6%B8%85%E5%BA%B7%E7%86%99%E5%88%BB%E6%9C%AC_%E7%AC%AC21%E5%86%8A.pdf

See also (and possibly duplicate with): T297942, T298417, T299521

Event Timeline

What is this wikimirror.org? Why change links to that?

So this list is exhaustive. I went through all PDFs and Djvu files on Wikimedia Commons as of previous week. Not just a random example. if we fix these, then all of them will be fixed. :-)

No, this one seems just a slightly broken PDF. I just fixed it.

that's odd, I saved the pdf file starting from a Word document. (Ok, at a second thought that's not odd at all :-) ) Thanks!

So I fixed it using mutool clean. But the ones I listed above cannot be fixed this way. And this is what I am reporting. So mutool clean does not fix it, looking at MediaBox values show reasonable page sizes (including the first page), and even metadata (example for the first file above shows page size available:

{
    "name": "pdf-PageSize",
    "value": [
        {
            "name": 0,
            "value": "612 x 792 pts (letter)"
        },
        {
            "name": 1,
            "value": "697 x 855 pts"
        }
    ]
}

But Mediawiki does not show width and height. So something is wrong.

@mau If you made this PDF yourself, could I recommend removing the first blank page? Because otherwise the first thumbnail does not show anything.

@Mitar probably it's even better to substitute the first page with the actual cover for the book, indeed. I proceed :-)

Mitar updated the task description. (Show Details)
Mitar updated the task description. (Show Details)

I ran into the same problem. I don't know if this can be considered a solution, because these steps have to be done on the server side, but I solved my problem:

  1. step – repair thumbnails for files of the core MediaWiki
php maintenance/refreshImageMetadata.php --verbose --mime image/vnd.djvu --force
  1. step – do null edit of the index pages by Extension:Proofread_Page (need for actualization info about the pages count for special page)
php maintenance/refreshLinks.php --namespace 252

The listed pdf files seem to have been fixed now? Maybe the problem only remain for djvu files now.

MediaWiki does not support unbundled DjVu files any more all non bundled/indexed files need to be converted with

djvmcvt -b "$DJVU_PATH" "$BUNDLED_FILE"

See https://github.com/WolfgangFahl/djvu-viewer for a tool that can help with the mass migration of your wiki that can also help you created migration scripts.

Script example:

#!/bin/bash
# DjVu bundling script
# Generated for: /var/www/mediawiki/sites/genwiki.genealogy.net/images/0/08/Stadtroda-Kreis-AB-1948_Hermsdorf.djvu
# Date: 2026-01-05T09:24:30.031857

set -e  # Exit on error

# Define variables
DJVU_PATH=/var/www/mediawiki/sites/genwiki.genealogy.net/images/0/08/Stadtroda-Kreis-AB-1948_Hermsdorf.djvu
DJVU_DIR=/var/www/mediawiki/sites/genwiki.genealogy.net/images/0/08
BASENAME=Stadtroda-Kreis-AB-1948_Hermsdorf.djvu
BACKUP_FILE=/var/www/mediawiki/sites/genwiki.genealogy.net/djvu/backup/Stadtroda-Kreis-AB-1948_Hermsdorf.zip
BUNDLED_FILE=/var/www/mediawiki/sites/genwiki.genealogy.net/images/0/08/Stadtroda-Kreis-AB-1948_Hermsdorf_bundled.djvu

# Step 1: Create backup ZIP
cd "$DJVU_DIR"
echo 'Creating backup ZIP...'
zip -j "$BACKUP_FILE" "$BASENAME" \
  Stadtroda-Kreis-AB-1942-010_0001.djbz \
  Stadtroda-Kreis-AB-1942-001_0001.djvu \
  Stadtroda-Kreis-AB-1942-002_0001.djvu \
  Stadtroda-Kreis-AB-1942-003_0001.djvu \
  Stadtroda-Kreis-AB-1942-004_0001.djvu \
  Stadtroda-Kreis-AB-1942-005_0001.djvu \
  Stadtroda-Kreis-AB-1942-006_0001.djvu \
  Stadtroda-Kreis-AB-1942-007_0001.djvu \
  Stadtroda-Kreis-AB-1942-008_0001.djvu \
  Stadtroda-Kreis-AB-1942-009_0001.djvu \
  Stadtroda-Kreis-AB-1942-010_0001.djvu \
  Stadtroda-Kreis-AB-1942-020_0001.djbz \
  Stadtroda-Kreis-AB-1942-011_0001.djvu \
  Stadtroda-Kreis-AB-1942-012_0001.djvu \
  Stadtroda-Kreis-AB-1942-013_0001.djvu \
  Stadtroda-Kreis-AB-1942-014_0001.djvu \
  Stadtroda-Kreis-AB-1942-015_0001.djvu \
  Stadtroda-Kreis-AB-1942-016_0001.djvu \
  Stadtroda-Kreis-AB-1942-017_0001.djvu \
  Stadtroda-Kreis-AB-1942-018_0001.djvu \
  Stadtroda-Kreis-AB-1942-019_0001.djvu \
  Stadtroda-Kreis-AB-1942-020_0001.djvu \
  Stadtroda-Kreis-AB-1942-030_0001.djbz \
  Stadtroda-Kreis-AB-1942-021_0001.djvu \
  Stadtroda-Kreis-AB-1942-022_0001.djvu \
  Stadtroda-Kreis-AB-1942-023_0001.djvu \
  Stadtroda-Kreis-AB-1942-024_0001.djvu \
  Stadtroda-Kreis-AB-1942-025_0001.djvu \
  Stadtroda-Kreis-AB-1942-026_0001.djvu \
  Stadtroda-Kreis-AB-1942-027_0001.djvu \
  Stadtroda-Kreis-AB-1942-028_0001.djvu \
  Stadtroda-Kreis-AB-1942-029_0001.djvu \
  Stadtroda-Kreis-AB-1942-030_0001.djvu \
  Stadtroda-Kreis-AB-1942-039_0001.djbz \
  Stadtroda-Kreis-AB-1942-031_0001.djvu \
  Stadtroda-Kreis-AB-1942-032_0001.djvu \
  Stadtroda-Kreis-AB-1942-033_0001.djvu \
  Stadtroda-Kreis-AB-1942-034_0001.djvu \
  Stadtroda-Kreis-AB-1942-035_0001.djvu \
  Stadtroda-Kreis-AB-1942-036_0001.djvu \
  Stadtroda-Kreis-AB-1942-037_0001.djvu \
  Stadtroda-Kreis-AB-1942-038_0001.djvu \
  Stadtroda-Kreis-AB-1942-039_0001.djvu

# Step 2: Verify backup was created
if [ ! -f "$BACKUP_FILE" ]; then
  echo 'Error: Backup ZIP not created'
  exit 1
fi
echo 'Backup created: '$BACKUP_FILE

# Step 3: Convert to bundled format
echo 'Converting to bundled format...'
djvmcvt -b "$DJVU_PATH" "$BUNDLED_FILE"

# Step 4: Verify bundled file was created
if [ ! -f "$BUNDLED_FILE" ]; then
  echo 'Error: Bundled file not created'
  exit 1
fi
echo 'Bundled file created: '$BUNDLED_FILE

# Step 5: Sleep for CIFS sync (if needed)
sleep 1

# Step 6: Remove original files
echo 'Removing original files...'
rm -f "$DJVU_PATH"
rm -f /var/www/mediawiki/sites/genwiki.genealogy.net/images/0/08/Stadtroda-Kreis-AB-1942-010_0001.djbz
rm -f /var/www/mediawiki/sites/genwiki.genealogy.net/images/0/08/Stadtroda-Kreis-AB-1942-001_0001.djvu
rm -f /var/www/mediawiki/sites/genwiki.genealogy.net/images/0/08/Stadtroda-Kreis-AB-1942-002_0001.djvu
rm -f /var/www/mediawiki/sites/genwiki.genealogy.net/images/0/08/Stadtroda-Kreis-AB-1942-003_0001.djvu
rm -f /var/www/mediawiki/sites/genwiki.genealogy.net/images/0/08/Stadtroda-Kreis-AB-1942-004_0001.djvu
rm -f /var/www/mediawiki/sites/genwiki.genealogy.net/images/0/08/Stadtroda-Kreis-AB-1942-005_0001.djvu
rm -f /var/www/mediawiki/sites/genwiki.genealogy.net/images/0/08/Stadtroda-Kreis-AB-1942-006_0001.djvu
rm -f /var/www/mediawiki/sites/genwiki.genealogy.net/images/0/08/Stadtroda-Kreis-AB-1942-007_0001.djvu
rm -f /var/www/mediawiki/sites/genwiki.genealogy.net/images/0/08/Stadtroda-Kreis-AB-1942-008_0001.djvu
rm -f /var/www/mediawiki/sites/genwiki.genealogy.net/images/0/08/Stadtroda-Kreis-AB-1942-009_0001.djvu
rm -f /var/www/mediawiki/sites/genwiki.genealogy.net/images/0/08/Stadtroda-Kreis-AB-1942-010_0001.djvu
rm -f /var/www/mediawiki/sites/genwiki.genealogy.net/images/0/08/Stadtroda-Kreis-AB-1942-020_0001.djbz
rm -f /var/www/mediawiki/sites/genwiki.genealogy.net/images/0/08/Stadtroda-Kreis-AB-1942-011_0001.djvu
rm -f /var/www/mediawiki/sites/genwiki.genealogy.net/images/0/08/Stadtroda-Kreis-AB-1942-012_0001.djvu
rm -f /var/www/mediawiki/sites/genwiki.genealogy.net/images/0/08/Stadtroda-Kreis-AB-1942-013_0001.djvu
rm -f /var/www/mediawiki/sites/genwiki.genealogy.net/images/0/08/Stadtroda-Kreis-AB-1942-014_0001.djvu
rm -f /var/www/mediawiki/sites/genwiki.genealogy.net/images/0/08/Stadtroda-Kreis-AB-1942-015_0001.djvu
rm -f /var/www/mediawiki/sites/genwiki.genealogy.net/images/0/08/Stadtroda-Kreis-AB-1942-016_0001.djvu
rm -f /var/www/mediawiki/sites/genwiki.genealogy.net/images/0/08/Stadtroda-Kreis-AB-1942-017_0001.djvu
rm -f /var/www/mediawiki/sites/genwiki.genealogy.net/images/0/08/Stadtroda-Kreis-AB-1942-018_0001.djvu
rm -f /var/www/mediawiki/sites/genwiki.genealogy.net/images/0/08/Stadtroda-Kreis-AB-1942-019_0001.djvu
rm -f /var/www/mediawiki/sites/genwiki.genealogy.net/images/0/08/Stadtroda-Kreis-AB-1942-020_0001.djvu
rm -f /var/www/mediawiki/sites/genwiki.genealogy.net/images/0/08/Stadtroda-Kreis-AB-1942-030_0001.djbz
rm -f /var/www/mediawiki/sites/genwiki.genealogy.net/images/0/08/Stadtroda-Kreis-AB-1942-021_0001.djvu
rm -f /var/www/mediawiki/sites/genwiki.genealogy.net/images/0/08/Stadtroda-Kreis-AB-1942-022_0001.djvu
rm -f /var/www/mediawiki/sites/genwiki.genealogy.net/images/0/08/Stadtroda-Kreis-AB-1942-023_0001.djvu
rm -f /var/www/mediawiki/sites/genwiki.genealogy.net/images/0/08/Stadtroda-Kreis-AB-1942-024_0001.djvu
rm -f /var/www/mediawiki/sites/genwiki.genealogy.net/images/0/08/Stadtroda-Kreis-AB-1942-025_0001.djvu
rm -f /var/www/mediawiki/sites/genwiki.genealogy.net/images/0/08/Stadtroda-Kreis-AB-1942-026_0001.djvu
rm -f /var/www/mediawiki/sites/genwiki.genealogy.net/images/0/08/Stadtroda-Kreis-AB-1942-027_0001.djvu
rm -f /var/www/mediawiki/sites/genwiki.genealogy.net/images/0/08/Stadtroda-Kreis-AB-1942-028_0001.djvu
rm -f /var/www/mediawiki/sites/genwiki.genealogy.net/images/0/08/Stadtroda-Kreis-AB-1942-029_0001.djvu
rm -f /var/www/mediawiki/sites/genwiki.genealogy.net/images/0/08/Stadtroda-Kreis-AB-1942-030_0001.djvu
rm -f /var/www/mediawiki/sites/genwiki.genealogy.net/images/0/08/Stadtroda-Kreis-AB-1942-039_0001.djbz
rm -f /var/www/mediawiki/sites/genwiki.genealogy.net/images/0/08/Stadtroda-Kreis-AB-1942-031_0001.djvu
rm -f /var/www/mediawiki/sites/genwiki.genealogy.net/images/0/08/Stadtroda-Kreis-AB-1942-032_0001.djvu
rm -f /var/www/mediawiki/sites/genwiki.genealogy.net/images/0/08/Stadtroda-Kreis-AB-1942-033_0001.djvu
rm -f /var/www/mediawiki/sites/genwiki.genealogy.net/images/0/08/Stadtroda-Kreis-AB-1942-034_0001.djvu
rm -f /var/www/mediawiki/sites/genwiki.genealogy.net/images/0/08/Stadtroda-Kreis-AB-1942-035_0001.djvu
rm -f /var/www/mediawiki/sites/genwiki.genealogy.net/images/0/08/Stadtroda-Kreis-AB-1942-036_0001.djvu
rm -f /var/www/mediawiki/sites/genwiki.genealogy.net/images/0/08/Stadtroda-Kreis-AB-1942-037_0001.djvu
rm -f /var/www/mediawiki/sites/genwiki.genealogy.net/images/0/08/Stadtroda-Kreis-AB-1942-038_0001.djvu
rm -f /var/www/mediawiki/sites/genwiki.genealogy.net/images/0/08/Stadtroda-Kreis-AB-1942-039_0001.djvu

# Step 7: Move bundled file to original location
echo 'Moving bundled file to original location...'
mv "$BUNDLED_FILE" "$DJVU_PATH"

echo 'Bundling complete!'
echo 'Backup saved at: '$BACKUP_FILE
docker exec genwiki39-mw php maintenance/refreshImageMetadata.php --force --mime=image/vnd.djvu --start=Stadtroda-Kreis-AB-1948_Hermsdorf.djvu --end=Stadtroda-Kreis-AB-1948_Hermsdorf.djvu

since https://commons.wikimedia.org/wiki/File:Niva_1891-05.djvu im offering to help fixing all djvu files on commons. Whom do i need to contact?

Some files just seem to be broken

wget https://upload.wikimedia.org/wikipedia/commons/2/25/%D0%A3%D0%BA%D0%B0%D0%B7%D0%B0%D1%82%D0%B5%D0%BB%D1%8C_%D1%81%D1%82%D0%B0%D1%82%D0%B5%D0%B9_%D0%BC%D0%BE%D1%80%D1%81%D0%BA%D0%BE%D0%B3%D0%BE_%D1%81%D0%B1%D0%BE%D1%80%D0%BD%D0%B8%D0%BA%D0%B0_1848_-_1872_%D0%B3._1875%282%29.djvu
 mv "Указатель_статей_морского_сборника_1848_-_1872_г._1875(2).djvu" "morskoy_sbornik.djvu"
file morskoy_sbornik.djvu 
morskoy_sbornik.djvu: DjVu multiple page document
djvudump morskoy_sbornik.djvu
*** [1-15108] Fehlerhafte IFF-Datei (ungültige Abschnitts-ID)
*** (IFFByteStream.cpp:248)
*** 'int DJVU::IFFByteStream::get_chunk(GUTF8String &, int *, int *)'

This problem still persists with PDFs. If there is a working solution for PDFs (that I missed in the above discussion), I would love to know which steps I can take to fix them!

I think I found a working solution using Mutool. Will update if the thumbnail breaks again after a few days, if not I will apply this solution to the other files.

Not sure how relevant this might be, but we just encountered some weird issues after a user uploaded new versions to try fixing some pdf that has the problem of 0x0 no thumbnail.
https://commons.wikimedia.org/wiki/Commons:Administrators%27_noticeboard#c-RoyZuo-20260121203300-T%C3%BArelio-20260121180300

I think this task should be split - one subtask for the PDF problem and one subtask for the DjVu and only the common ground to be discussed her.