Page MenuHomePhabricator

PDF file has 0x0 image size in Commons
Open, Needs TriagePublic

Description

I has happened 3 times to me recently. I uploaded a dummy file, and reverted to the original, and it fixed the issue.
https://commons.wikimedia.org/wiki/File:Annales_de_chimie,_tome_13-14,_1792.pdf
https://commons.wikimedia.org/wiki/File:Annales_de_chimie,_tome_11-12,_1791-1792.pdf
https://commons.wikimedia.org/wiki/File:Annales_de_chimie,_tome_15-16,_1792-1793.pdf

So this seems to be systematic and reproductible:

  1. Upload a large PDF file.
  2. If size shown is 0 x 0 pixels, upload another file, and then reverting to the original fixes the issue.

Event Timeline

In File:Sabah, Sarawak and Singapore (State Constitutions) Order in Council 1963.pdf, all page thumbnails just won't appear after page 1 of the index (example). Similar problems happen to this another file.

When entering the affected pages on Wikisource, it displays "Failed to initialize OpenSeadragon, no image found." OCR is also disabled. Purging the Commons and Wikisource description pages are useless, and Yann's solution is not effective for my files.

This happens to all PDF files, but if the files are not too big, a simple purge fixes the issue.

I hope this helps.

This issue seems to have disappeared. Any change lately?

I have (or had) the same issue with several files in https://commons.wikimedia.org/wiki/Category:Folkekalender_for_Danmark. And it seems that for some files the problem disappear suddenly and for other files they seem to work at first but then stop working.
For some files the thumbs are the issue and for some files SeaDragon on Wikisource fails.
I can find no logic in what happens.
I tried to purge files and it worked for some files but not other files.
I tried delete and undelete but that did not work.

This issue seems to have disappeared. Any change lately?

No change. Purge is useless for my files. Also is there any way to find other technical users for help? This task has been stuck for weeks.

This issue seems to have disappeared. Any change lately?

Actually, it didn't. For some small files, thumbnails appear directly, but for most files, they do not.

@Yann fixed https://commons.wikimedia.org/wiki/File:Folkekalender_for_Danmark_1862.pdf some days ago and I saw that it worked. Today it does not work (alt least for me).
So there's something strange in the neighborhood.... So time to call the Ghostbusters?

After suggestion i changed page 1 to a "plain" image and it seems that it made https://commons.wikimedia.org/wiki/File:Folkekalender_for_Danmark_1862.pdf work. At least for now. Som perhaps a part of the problem is/was that there were some code included in page 1.

Here is an example of a persistent issue, no matter what I do: https://commons.wikimedia.org/wiki/File:Gide_-_Le_Journal_des_Faux-monnayeurs.pdf
And it is a small file, so the issue is not the size.

I have another one with the 0 × 0 problem: https://commons.wikimedia.org/wiki/File:Giuseppe_Fraccaroli_-_L%27Isola_dei_ciechi,_Milano,_Arnaldo_De_Mohr_e_C.,_1907.pdf This was working fine when I uploaded it last year, then recently I came upon it and it's gone. Purging doesn't help.

Also, I'm not sure if it's related, but this file: https://commons.wikimedia.org/wiki/File:Viaggio_in_Sicilia,_1831_-_volume_I_-_BEIC_IE4742922.djvu seems to be working fine in Commons, but on Wikisource it shows the 0 × 0 pixel problem: https://it.wikisource.org/wiki/File:Viaggio_in_Sicilia,_1831_-_volume_I_-_BEIC_IE4742922.djvu

As other people have reported, the problem is increasing. Many files which were OK at the time of upload have now a problem. See screenshot.

PDF display issue.jpg (879×1 px, 597 KB)

And now we have PDF files that have totally normal page thumbnails on Commons, yet have thumbnail problems on English Wikisource—see this index and the source file. The problem only worsens.

0x0 means that metadata was failed to be extracted, which probably means it takes too much time or too much memory

0x0 means that metadata was failed to be extracted, which probably means it takes too much time or too much memory

This can't be the case here, as the file is 841 × 1,650 pixels, 21 pages, 7.42 MB. It looks OK now. This erratic behavior is really weird and frustrating.

For big files, this explanation is not convincing either. Why uploading a dummy file would lessen the server load?

Problem still here: https://it.wikisource.org/wiki/File:Viaggio_in_Sicilia,_1831_-_volume_I_-_BEIC_IE4742922.djvu (521 × 2,000 pixels, 47.78 MB, 306 pages) and here https://commons.wikimedia.org/wiki/File:Folkekalender_for_Danmark_1870.pdf (733 × 954 pixels, 40.99 MB, 147 pages). This is even weirder, as it displays fine here: https://da.wikisource.org/wiki/Fil:Folkekalender_for_Danmark_1870.pdf

I looked into the code behind this and here's what I think is happening:

When a PDF is uploaded, the system runs pdfinfo to extract page dimensions. If that step fails or times out (more likely with larger files, but can happen with any file), the dimensions get saved as 0/0. This bad data then gets cached for up to a month, which is why purging often doesn't help.

The workaround (upload dummy and revert) works because it forces a fresh metadata extraction attempt.

I believe a relatively safe fix would be to make PdfHandler::isFileMetadataValid() check that stored dimensions are actually non-zero. If they're 0/0, it would return METADATA_BAD, triggering an automatic re-read and essentially automating the current workaround.

I have another one with the 0 × 0 problem: https://commons.wikimedia.org/wiki/File:Giuseppe_Fraccaroli_-_L%27Isola_dei_ciechi,_Milano,_Arnaldo_De_Mohr_e_C.,_1907.pdf This was working fine when I uploaded it last year, then recently I came upon it and it's gone. Purging doesn't help.

Also, I'm not sure if it's related, but this file: https://commons.wikimedia.org/wiki/File:Viaggio_in_Sicilia,_1831_-_volume_I_-_BEIC_IE4742922.djvu seems to be working fine in Commons, but on Wikisource it shows the 0 × 0 pixel problem: https://it.wikisource.org/wiki/File:Viaggio_in_Sicilia,_1831_-_volume_I_-_BEIC_IE4742922.djvu

Your DJVU example seems to be a separate but related issue. I confirmed it:

Commons:        width=521, height=2000  
it.wikisource:  width=0,   height=0

The file has correct metadata on Commons, but Wikisource sees 0/0. This is because DjVuHandler::getDimensionInfo() caches dimensions per-wiki with TTL_INDEFINITE. If the first lookup returned bad data, it stays cached forever on Wikisource, even after Commons corrects it. Might be worth a separate ticket for this caching bug.

This can't be the case here, as the file is 841 × 1,650 pixels, 21 pages, 7.42 MB.

Its possible there can be other causes, but small files can still sometimes mean large cpu or memory usage. File size is correlated but its not a hard rule.

I'm just speculating here. It would probably be good for someone to go through the logs and verify what the cause of the failure actually is.

Why uploading a dummy file would lessen the server load?

Because it triggers a redo of processing the file.

I believe a relatively safe fix would be to make PdfHandler::isFileMetadataValid() check that stored dimensions are actually non-zero. If they're 0/0, it would return METADATA_BAD, triggering an automatic re-read and essentially automating the current workaround.

We probably want some level of caching here because if the file is permanently broken redoing this on every view would be very bad for performance. Probably a middle ground is needed.

Historically ?action=purge would re-trigger metadata extraction, but that was removed. As a first step i would suggest bringing that back.

Historically ?action=purge would re-trigger metadata extraction, but that was removed. As a first step i would suggest bringing that back.

That seems a sensible thing to do.

What I really do not understand is why books which were OK at the time of upload fail some time later (sometimes weeks later).

Historically ?action=purge would re-trigger metadata extraction, but that was removed. As a first step i would suggest bringing that back.

Agreed that bringing back purge-triggered metadata re-extraction would be a good first step.

What I really do not understand is why books which were OK at the time of upload fail some time later (sometimes weeks later).

PdfHandler uses split metadata, so per-page dimensions for large PDFs are stored in BlobStore. The dimension cache has TTL_MONTH. When it expires and BlobStore has a transient failure during re-population, the per-page data silently fails to load while the page count loads fine (it's small enough to stay inline). The callback then returns a valid-looking array with false for every page's dimensions, and that gets cached for another month.

So: upload OK ‚ cache expires after ~30 days so re-population hits a BlobStore hiccup ‚ and bad data cached for another 30 days eventually file appears broken.

Historically ?action=purge would re-trigger metadata extraction, but that was removed. As a first step i would suggest bringing that back.

That seems a sensible thing to do.

@aaron Looks like this was removed as part of T132921 / 9120ee007ae32 . Do you have thoughts on reinstating the metadata clearing on purge?

Edit: I may be wrong about this, it looks like metadata still gets updated on purge for pdfs if the metadata is invalid. Sometimes anyways, but not always

What I really do not understand is why books which were OK at the time of upload fail some time later (sometimes weeks later).

The problem happened to me less than 20 minutes ago while I was proofreading File:Lacour - Lettres de Laetitia et de Ludovic, 1834.pdf on the French Wikisource.

I believe a relatively safe fix would be to make PdfHandler::isFileMetadataValid() check that stored dimensions are actually non-zero. If they're 0/0, it would return METADATA_BAD, triggering an automatic re-read and essentially automating the current workaround.

We probably want some level of caching here because if the file is permanently broken redoing this on every view would be very bad for performance. Probably a middle ground is needed.

Don't we actually already effectively do this without cache ?

		if ( !isset( $data['pages'] ) ) {
			return self::METADATA_BAD;
		}

The dimensions are per page. So without pages, no dimensions. This also explains why the purge on Commons, often DOES work. it will mark it as BAD, refetch the metadata, then in getDimensionsInfo, it will only cache if there is a Pagecount. if ( !$data || !isset( $data['Pages'] ) ). If there is no pagecount, the callback returns false, and false is a special value, that is never cached.

from fetchAndRegenerate, which feeds getWithSetCallback
// Callback yielded a cacheable value
 $value !== false && $ttl >= 0 ) &&

https://commons.wikimedia.org/wiki/Special:ApiSandbox#action=query&format=json&prop=imageinfo&titles=File%3ABulletin%20de%20la%20société%20des%20bibliophiles%20bretons%20et%20de%20l'histoire%20de%20Bretagne%20(8e%20année)%2C%201885.pdf&formatversion=2&iiprop=timestamp%7Cuser%7Cmetadata%7Cdimensions%7Cmediatype%7Csize

File:Bulletin de la société des bibliophiles bretons et de l'histoire de Bretagne (8e année), 1885.pdf is interesting. it has a pagecount in the DB, but it doesn't have the pages subarray for all the dimensions. this is very strange, as they are output from one and the same command. This seems non-recoverable atm, as the badmetadata check doesn't account for this possibility (and I honestly don't understand what would cause it).

https://commons.wikimedia.org/wiki/Special:ApiSandbox#action=query&format=json&prop=imageinfo&titles=File%3AAṛajnord-%20awetaranakan%20ekeghetsʻwoh%20andamnerun%20-%20or%20kě%20khōsi%20Hayastaneaytsʻ%20...%20(IA%20aajnordawetaran01unkngoog).pdf&formatversion=2&iiprop=timestamp%7Cuser%7Cmetadata%7Cdimensions%7Cmediatype%7Csize

File:Aṛajnord- awetaranakan ekeghetsʻwoh andamnerun - or kě khōsi Hayastaneaytsʻ ... (IA aajnordawetaran01unkngoog).pdf does have both Pages (pagecount) and the pages subarray (with dimensions). So not sure why this one isn't showing anything. It will be interesting to see if this one is fixed a month later.

Can we somehow actively expunge the cache entry when we purge ?

I also notice that if there is no text extraction, both pdf and djvu happily store an empty array entry per page in their metadata. We should probably fix that and make that more efficient. (now T424094: PDFs and DJVU without textcontent pollute metadata with empty entries per page)

And now the dummy file trick doesn't work anymore. This is becoming nasty.
https://commons.wikimedia.org/wiki/File:Gandhi_-_The_Story_of_My_Experiments_With_Truth,_vol._1.pdf

We all realize it's annoying. But stating that over and over again, isn't fixing this problem. If someone had the full overview and understanding of what was going on here, it would be fixed already.

Change #1276427 had a related patch set uploaded (by TheDJ; author: TheDJ):

[mediawiki/extensions/PdfHandler@master] Purge PDF dimensions cache when purging page

https://gerrit.wikimedia.org/r/1276427

Change #1276427 had a related patch set uploaded (by TheDJ; author: TheDJ):

[mediawiki/extensions/PdfHandler@master] Purge PDF dimensions cache when purging page

https://gerrit.wikimedia.org/r/1276427

Like to ask what was changed on 23 April? As a matter of following up the change, nothing has changed with respect to this file.

Like to ask what was changed on 23 April? As a matter of following up the change, nothing has changed with respect to this file.

The patch is still under code review and has not been deployed to production yet, so the behavior on Commons is unchanged for now. Once it passes review and gets deployed, purging a PDF file page should also clear the cached dimensions, which should help with recovery from this issue.

Change #1276427 merged by jenkins-bot:

[mediawiki/extensions/PdfHandler@master] Purge PDF dimensions cache when purging page

https://gerrit.wikimedia.org/r/1276427

Thanks a lot. I've see bots are adding auto notifications here. Please notify here when it is fully deployed.
Will see whether the patch will solve this aging problem.

The patch will likely not fully solve the problem. However hopefully it allows for easier and more consistent resolution of affected files. If it does, it points to where the problem is located.

Likely there are also multiple problems.

The deploy date can be observed from the label on the ticket. This is the date of the first deploy. Wikimedia has a staggered deploy schedule running from tuesday to thursday every single week. This is called the ‘train’. https://wikitech.wikimedia.org/wiki/Deployments/Train

I just purged the page now and all versions are showing correct dimensions (927 x 1,462). Patch seems to be working