Page MenuHomePhabricator

Some DjVu files not being rendered in commons (showing up as "0 × 0 pixels", despite the file size in MB being nonzero)
Closed, ResolvedPublic

Description

I searched around and found a couple similar bugs, but wasn't entirely sure they were describing the same problem.

I came across this issue while dealing with File:Congressional Record Volume 81 Part 3.djvu, but several additional examples have surfaced from various other discussions:

Note: This bug description initially listed several files that also failed to generate thumbnails, but further investigation revealed that in those cases the actual files were corrupt (i.e. truncated or missing pages):

Event Timeline

waldyrious updated the task description. (Show Details)
waldyrious raised the priority of this task from to Needs Triage.
waldyrious added a subscriber: waldyrious.
Restricted Application added subscribers: Steinsplitter, Aklapper. · View Herald TranscriptAug 1 2015, 4:46 PM
waldyrious set Security to None.
waldyrious updated the task description. (Show Details)Aug 1 2015, 4:49 PM

Have you tried to locally open the files downloaded from Wikimedia Commons?
If so, with which specific application did you succeed in doing that?

To me all those files look broken:

For https://commons.wikimedia.org/wiki/File:Ten_Years_Later.djvu :

$:andre\> ghostscript Ten_Years_Later.djvu 
GPL Ghostscript 9.16 (2015-03-30)
Error: /undefined in AT&TFORM
Operand stack:

Execution stack:
   %interp_exit   .runexec2   --nostringval--   --nostringval--   --nostringval--   2   %stopped_push   --nostringval--   --nostringval--   --nostringval--   false   1   %stopped_push   1951   1   3   %oparray_pop   1950   1   3   %oparray_pop   1934   1   3   %oparray_pop   1820   1   3   %oparray_pop   --nostringval--   %errorexec_pop   .runexec2   --nostringval--   --nostringval--   --nostringval--   2   %stopped_push   --nostringval--
Dictionary stack:
   --dict:1183/1684(ro)(G)--   --dict:0/20(G)--   --dict:78/200(L)--
Current allocation mode is local
Current file position is 9
GPL Ghostscript 9.16: Unrecoverable error, exit code 1

For https://commons.wikimedia.org/wiki/File:Congressional_Record_Volume_81_Part_3.djvu :

$:andre\> ghostscript Congressional_Record_Volume_81_Part_3.djvu 
GPL Ghostscript 9.16 (2015-03-30)
Error: /undefined in AT&TFORMC
Operand stack:

Execution stack:
   %interp_exit   .runexec2   --nostringval--   --nostringval--   --nostringval--   2   %stopped_push   --nostringval--   --nostringval--   --nostringval--   false   1   %stopped_push   1951   1   3   %oparray_pop   1950   1   3   %oparray_pop   1934   1   3   %oparray_pop   1820   1   3   %oparray_pop   --nostringval--   %errorexec_pop   .runexec2   --nostringval--   --nostringval--   --nostringval--   2   %stopped_push   --nostringval--
Dictionary stack:
   --dict:1183/1684(ro)(G)--   --dict:0/20(G)--   --dict:78/200(L)--
Current allocation mode is local
Current file position is 11
GPL Ghostscript 9.16: Unrecoverable error, exit code 1

For https://commons.wikimedia.org/wiki/File:Niva_1906-45.djvu

$:andre\> ghostscript Niva_1906-45.djvu 
GPL Ghostscript 9.16 (2015-03-30)
Error: /undefined in AT&TFORM
Operand stack:
Execution stack:
   %interp_exit   .runexec2   --nostringval--   --nostringval--   --nostringval--   2   %stopped_push   --nostringval--   --nostringval--   --nostringval--   false   1   %stopped_push   1951   1   3   %oparray_pop   1950   1   3   %oparray_pop   1934   1   3   %oparray_pop   1820   1   3   %oparray_pop   --nostringval--   %errorexec_pop   .runexec2   --nostringval--   --nostringval--   --nostringval--   2   %stopped_push   --nostringval--
Dictionary stack:
   --dict:1183/1684(ro)(G)--   --dict:0/20(G)--   --dict:78/200(L)--
Current allocation mode is local
Current file position is 9
GPL Ghostscript 9.16: Unrecoverable error, exit code 1

I don't have tools to manipulate djvu files locally, but I've asked the uploader of "Congressional Record Volume 81 Part 3.djvu" to try it. Meanwhile, I searched for similar issues and found:

I wonder if any of these may be related to the changes made in T96360 (or similar ones). Pinging @aaron, @faidon, @GWicke, @GOIII for possible insights.

I wonder if any of these may be related to the changes made in T96360 (or similar ones).

Does the upload time of the mentioned files correspond with the incident time?

I can only confirm that it's the case for File:Congressional_Record_Volume_81_Part_3.djvu, but considering the upload times, I'd say it's true for File:Ten_Years_Later.djvu too.

Another possibly relevant fact: all of these files, whether they work or not in commons, show up as 0x0 pixels in wikisource. The only one that doesn't is Ten Years Later 2.djvu, but that's clearly a different file (i.e. created with a different process / parameters) than the original one. The uploader doesn't seem to have an account here, so I sent her an email to see if she can enlighten us :)

The uploader doesn't seem to have an account here, so I sent her an email to see if she can enlighten us :)

Any feedback?

Restricted Application added a project: Multimedia. · View Herald TranscriptAug 31 2015, 8:16 AM

@Aklapper, no on both counts :( User:Clockery, in particular, seems to have winded down wiki activity, coming to a complete stall 3 months ago, so for now that path is a dead end. @Ernest-Mtl hasn't replied to my request yet, although he was very responsive throughout the previous steps to upload the Congressional Records document. For convenience, I'll quote my comment to him below:

Can you download the 0x0 djvu file from wikimedia commons and open it?
It seems that it may be corrupt, according to a comment in the bug I linked above.
If your local copy opens, but the one downloaded from commons doesn't, then we'll know it's a problem in the upload step.
Alternatively, you could upload the local version to some other web host (google drive, dropbox, mega...) and share the link so others can do the debugging.

Does that seem like it would be helpful? If so, I can re-ping him (or maybe he'll get a notification for being mentioned in this thread).

Apart from that, is there anything in particular I can try to help debug the problem?

Restricted Application added a subscriber: Matanya. · View Herald TranscriptAug 31 2015, 6:27 PM
Jdforrester-WMF triaged this task as Low priority.Sep 4 2015, 6:49 PM
Jdforrester-WMF moved this task from Untriaged to Backlog on the Multimedia board.
Aklapper changed the task status from Open to Stalled.Sep 13 2015, 9:04 PM

As I explained in T107664#1502732, I do not think there is any bug here on the Wikimedia side.
Changing status to stalled as per last comment.

I've re-pinged @Ernest-Mtl on-wiki. Without access to the file before it is uploaded to commons, I can't test whether something goes wrong during the upload.

Note. the 0 x 0 pixels message usually means that mediawiki failed to extract metadata from the file (Which could mean the file is corrupt, or could mean there's something wrong with how mediawiki is extracting metadata).

Have you tried to locally open the files downloaded from Wikimedia Commons?
If so, with which specific application did you succeed in doing that?

To me all those files look broken:

For https://commons.wikimedia.org/wiki/File:Ten_Years_Later.djvu :

$:andre\> ghostscript Ten_Years_Later.djvu 
GPL Ghostscript 9.16 (2015-03-30)
Error: /undefined in AT&TFORM
Operand stack:

Execution stack:
   %interp_exit   .runexec2   --nostringval--   --nostringval--   --nostringval--   2   %stopped_push   --nostringval--   --nostringval--   --nostringval--   false   1   %stopped_push   1951   1   3   %oparray_pop   1950   1   3   %oparray_pop   1934   1   3   %oparray_pop   1820   1   3   %oparray_pop   --nostringval--   %errorexec_pop   .runexec2   --nostringval--   --nostringval--   --nostringval--   2   %stopped_push   --nostringval--
Dictionary stack:
   --dict:1183/1684(ro)(G)--   --dict:0/20(G)--   --dict:78/200(L)--
Current allocation mode is local
Current file position is 9
GPL Ghostscript 9.16: Unrecoverable error, exit code 1

For https://commons.wikimedia.org/wiki/File:Congressional_Record_Volume_81_Part_3.djvu :

$:andre\> ghostscript Congressional_Record_Volume_81_Part_3.djvu 
GPL Ghostscript 9.16 (2015-03-30)
Error: /undefined in AT&TFORMC
Operand stack:

Execution stack:
   %interp_exit   .runexec2   --nostringval--   --nostringval--   --nostringval--   2   %stopped_push   --nostringval--   --nostringval--   --nostringval--   false   1   %stopped_push   1951   1   3   %oparray_pop   1950   1   3   %oparray_pop   1934   1   3   %oparray_pop   1820   1   3   %oparray_pop   --nostringval--   %errorexec_pop   .runexec2   --nostringval--   --nostringval--   --nostringval--   2   %stopped_push   --nostringval--
Dictionary stack:
   --dict:1183/1684(ro)(G)--   --dict:0/20(G)--   --dict:78/200(L)--
Current allocation mode is local
Current file position is 11
GPL Ghostscript 9.16: Unrecoverable error, exit code 1

For https://commons.wikimedia.org/wiki/File:Niva_1906-45.djvu

$:andre\> ghostscript Niva_1906-45.djvu 
GPL Ghostscript 9.16 (2015-03-30)
Error: /undefined in AT&TFORM
Operand stack:
Execution stack:
   %interp_exit   .runexec2   --nostringval--   --nostringval--   --nostringval--   2   %stopped_push   --nostringval--   --nostringval--   --nostringval--   false   1   %stopped_push   1951   1   3   %oparray_pop   1950   1   3   %oparray_pop   1934   1   3   %oparray_pop   1820   1   3   %oparray_pop   --nostringval--   %errorexec_pop   .runexec2   --nostringval--   --nostringval--   --nostringval--   2   %stopped_push   --nostringval--
Dictionary stack:
   --dict:1183/1684(ro)(G)--   --dict:0/20(G)--   --dict:78/200(L)--
Current allocation mode is local
Current file position is 9
GPL Ghostscript 9.16: Unrecoverable error, exit code 1

Ghostscript is for PDFs, not DjVus, which is why you're getting that error

The Niva_1906-43.djvu file appears corrupt at first glance (Although I haven't looked at it in detail)

`
bawolff@Bawolff-L:~$ evince Niva_1906-43.djvu 

** (evince:5818): WARNING **: DjvuLibre error: [1-15108] Corrupted IFF file (Illegal chunk id).

** (evince:5818): WARNING **: DjvuLibre error: IFFByteStream.cpp:248

** (evince:5818): WARNING **: DjvuLibre error: [1-15108] Corrupted IFF file (Illegal chunk id).

** (evince:5818): WARNING **: DjvuLibre error: IFFByteStream.cpp:248
**
EvinceDocument:ERROR:/build/buildd-evince_2.30.3-2-i386-XSLfOu/evince-2.30.3/./libdocument/ev-document-misc.c:58:ev_document_misc_get_thumbnail_frame: assertion failed: (width_r >= 0 && height_r >= 0)
Aborted

djvudump has a similar complaint about illegal chunk id.

I only looked at that file. The other one's might be failing for different reasons.

I wonder if any of these may be related to the changes made in T96360 (or similar ones).

I would consider that unlikely.

Here are the results I got (empty lines and duplicate lines removed from output, for compactness):

$ evince Congressional_Record_Volume_81_Part_3-dedup.djvu
$ evince Congressional_Record_Volume_81_Part_3.djvu
$ evince Niva_1894-31.djvu
** (evince:13150): WARNING **: DjvuLibre error: Unexpected End Of File.
** (evince:13150): WARNING **: DjvuLibre error: DataPool.cpp:1768
** (evince:13150): WARNING **: DjvuLibre error: DjVuFile.cpp:2249 # <-- this error appears 16 times
$ evince Niva_1906-43.djvu
** (evince:13169): WARNING **: DjvuLibre error: [1-15108] Corrupted IFF file (Illegal chunk id).
** (evince:13169): WARNING **: DjvuLibre error: IFFByteStream.cpp:248
$ evince Niva_1906-45.djvu
** (evince:13222): WARNING **: DjvuLibre error: [1-15109] Corrupted IFF file (Mangled chunk boundaries).
** (evince:13222): WARNING **: DjvuLibre error: IFFByteStream.cpp:243
** (evince:13222): WARNING **: DjvuLibre error: [1-15114] IFFByteStream not ready for reading chunk.
** (evince:13222): WARNING **: DjvuLibre error: IFFByteStream.cpp:177
$ evince Ten_Years_Later.djvu
** (evince:13241): WARNING **: DjvuLibre error: Unexpected End Of File.
** (evince:13241): WARNING **: DjvuLibre error: DataPool.cpp:1768
** (evince:13241): WARNING **: DjvuLibre error: DjVuFile.cpp:2249 # <-- this error appears 129 times

Notes:

  • The Congressional Records file with the "dedup" suffix is the version uploaded 14:38, July 30, 2015 (257.95 MB), which removed some duplicate pages (present in the original Internet Archive uploads), while the unsuffixed one is the revision from 04:02, July 27, 2015.
  • The various versions of the three "Niva" files and of the "Ten Years Later" one were all identical (tested by file size and md5 hash), so they represent simple reuploads attempting to get the file to work with mediawiki.
  • The tests were performed with "GNOME Document Viewer 3.4.0" (output from evince --version)

Conclusions:

  • All the files I listed above do have DjVu errors, except from the Congressional Record one, at least according to evince.
  • It could be that another tool would reveal errors that evince didn't detect. (@Bawolff, do you have any suggestions of tools that could be used to test this?)
  • Or it could be a problem due to the large file size (what can be done to test this?).

Some further info: using djview, the errors seem to be clearer (they show up by scrolling down the thumbnail list):

$ djview -verbose Ten_Years_Later.djvu
djview: INFO: [1-12515] file://localhost/home/waldyrious/Ten_Years_Later.djvu/tenyearslater00duma_0393.djvu FAILED.
djview: ERROR: Unexpected End Of File.
djview: INFO: [1-12515] file://localhost/home/waldyrious/Ten_Years_Later.djvu/tenyearslater00duma_0394.djvu FAILED.
djview: ERROR: Unexpected End Of File.
(...snip...)
djview: INFO: [1-12515] file://localhost/home/waldyrious/Ten_Years_Later.djvu/tenyearslater00duma_0521.djvu FAILED.
djview: ERROR: Unexpected End Of File.
djview: INFO: [1-12515] file://localhost/home/waldyrious/Ten_Years_Later.djvu/tenyearslater00duma_0522.djvu FAILED.
djview: ERROR: Unexpected End Of File.

It appears that the Ten_Years_Later.djvu document contains 521 pages, but pages 393-522 are missing (maybe the file was truncated?), which corresponds precisely to the 129 times the ** (evince:13241): WARNING **: DjvuLibre error: DjVuFile.cpp:2249 appeared in the evince output for this file.

We get the same for the Niva_1894-31.djvu file -- out of 24 pages, pages 8-24 are missing:

$ djview -verbose Niva_1894-31.djvu
djview: INFO: [1-12515] file://localhost/home/waldyrious/Niva_1894-31.djvu/Niva-1894-31_07_0001.djvu FAILED.
djview: ERROR: Unexpected End Of File.
djview: INFO: [1-12515] file://localhost/home/waldyrious/Niva_1894-31.djvu/Niva-1894-31_08_0001.djvu FAILED.
djview: ERROR: Unexpected End Of File.
(...snip...)
djview: INFO: [1-12515] file://localhost/home/waldyrious/Niva_1894-31.djvu/Niva-1894-31_22_0001.djvu FAILED.
djview: ERROR: Unexpected End Of File.
djview: INFO: [1-12515] file://localhost/home/waldyrious/Niva_1894-31.djvu/Niva-1894-31_23_0001.djvu FAILED.
djview: ERROR: Unexpected End Of File.

In Niva_1906-43.djvu, page 15 (of 20) renders incorrectly:

$ djview -verbose Niva_1906-43.djvu 
djview: ERROR: [1-15108] Corrupted IFF file (Illegal chunk id).
djview: ERROR: [1-15108] Corrupted IFF file (Illegal chunk id).
djview: INFO: [1-12515] file://localhost/home/waldyrious/Niva_1906-43.djvu/niva-06-43_14_0001.djvu FAILED.
djview: ERROR: Unexpected End Of File.

Similarly, for Niva_1906-45.djvu, it's pages 12 and 13 that fail:

$ djview -verbose Niva_1906-45.djvu 
djview: ERROR: [1-15109] Corrupted IFF file (Mangled chunk boundaries).
djview: ERROR: [1-15109] Corrupted IFF file (Mangled chunk boundaries).
djview: INFO: [1-12515] file://localhost/home/waldyrious/Niva_1906-45.djvu/niva-06-45_11_0001.djvu FAILED.
djview: ERROR: [1-15413] Corrupted file: JB2 image dimension is zero.
djview: ERROR: [1-15114] IFFByteStream not ready for reading chunk.
djview: ERROR: [1-15114] IFFByteStream not ready for reading chunk.
djview: INFO: [1-12515] file://localhost/home/waldyrious/Niva_1906-45.djvu/niva-06-45_12_0001.djvu FAILED.
djview: ERROR: [1-12526] DejaVu decoder: a DJVU or IW44 image was expected.

In contrast, neither of the Congressional Records files generate any warning (and I scrolled through the thumbnail list to ensure every single one of the ~1200 pages got rendered):

$ djview -verbose Congressional_Record_Volume_81_Part_3.djvu 
$ djview -verbose Congressional_Record_Volume_81_Part_3-dedup.djvu 
$ 
Rillke added a subscriber: Rillke.Oct 26 2015, 9:30 PM

And we have a new one: File:Толковый словарь. Том 2(1) (Даль 1905).djvu.

For Congressional_Record_Volume_81_Part_3.djvu

char_length(img_metadata)  img_width  img_height
11013684                   3264       4416

Metadata look fine, at least syntactically. Anyone an idea how to debug this? "No thumbnail" -- MediaWiki knows that no thumbnail exists and will exist. This is different from most other files with thumbnails failed, where at least the generation is attempted but the generation tool fails and the image is still linked.

Hinote added a subscriber: Hinote.EditedOct 26 2015, 10:08 PM

"No thumbnail" -- hehe, we do not really need that fu..n thumbnail ;-) The problem for us at Wikisource is that the pagelist tag produces error in the Index page for such file, the Proofread pages also do not work properly, so we obtain complete mess at Wikisource for such files... Especially, if we try to upload (for some local reasons, working on the books) a new version of a book, for which we have already got many pages in work. Please set the priority for this issue a bit higher (requesting from the Russian Wikisource).

And we have a new one: File:Толковый словарь. Том 2(1) (Даль 1905).djvu.

For Congressional_Record_Volume_81_Part_3.djvu

char_length(img_metadata)  img_width  img_height
11013684                   3264       4416

Metadata look fine, at least syntactically. Anyone an idea how to debug this? "No thumbnail" -- MediaWiki knows that no thumbnail exists and will exist. This is different from most other files with thumbnails failed, where at least the generation is attempted but the generation tool fails and the image is still linked.

If you load the image thumbnail in the browser, usually it will return an html page with some sort of error.

Usually this means that either the file takes to much memory/time to render, or the file is invalid.

Tgr added a subscriber: Tgr.Oct 26 2015, 11:01 PM

I wonder if the output of the transform tools should be stored in the metadata somewhere (at least in the case of errors) and made available so that the reason for failure can be identified with less jumping through hoops. (Has a high chance of some kind of path disclosure vulnerability, but at least on WMF servers we don't care much.)

I wonder if the output of the transform tools should be stored in the metadata somewhere (at least in the case of errors) and made available so that the reason for failure can be identified with less jumping through hoops. (Has a high chance of some kind of path disclosure vulnerability, but at least on WMF servers we don't care much.)

We could store the error in memcached I suppose (Nowhere in the db really fits. img_metadata is supposed to be functionaly defined by the file). But might run into issues where parser cache gets out of sync with the image error.

Vladis13 added a comment.EditedOct 27 2015, 12:19 AM

Usually this means that either the file takes to much memory/time to render, or the file is invalid.

After upload this file I waited long enough (for rendering of it) and did reloads page, but it didnt help (screenshot).
This is a normal file to which was added OCR and normal saved via Finereader, also the file normal opens in popular Djvu viewer from djvu.org (in 1st line). So the file is fine also.

Also, my colleague from ru.wikisource uploaded another version of the file, which maked in another software by merging two normal files from Wikimedia, then uploaded it (see file history). But the same problem.

Usually this means that either the file takes to much memory/time to render, or the file is invalid.

After upload this file I waited long enough (for rendering of it) and did reloads page, but it didnt help (screenshot).
This is a normal file to which was added OCR and normal saved via Finereader, also the file normal opens in popular Djvu viewer from djvu.org (in 1st line). So the file is fine also.

Also, my colleague from ru.wikisource uploaded another version of the file, which maked in another software by merging two normal files from Wikimedia, then uploaded it (see file history). But the same problem.

File:Толковый_словарь._Том_2(1)_(Даль_1905).djvu looks like it works now. Is that the file you are referring to? If you mean an old version, could you link directly to which version so I know which file we're talking about?

waldyrious added a comment.EditedOct 27 2015, 11:44 AM

File:Толковый_словарь._Том_2(1)_(Даль_1905).djvu looks like it works now. Is that the file you are referring to? If you mean an old version, could you link directly to which version so I know which file we're talking about?

Yes, the version without the text layer was uploaded over it to allow the page images to be generated. This has been done with other files in the same situation (e.g. File:EB1911 - Volume 25.djvu), to at least provide a workaround until this issue is fixed. Any of the revisions showing "0 × 0" in the resolution exhibit the issue.

By the way, I've been trying to figure out this issue, and came across an interesting example: File:Dictionary of Greek and Roman Geography Volume II.djvu seems to kind of work on commons (nonzero resolution, one thumbnail generated but just for a cover page; no number of pages in the revisions table and no "go to page" dropdown); however, on wikisource it shows the same issue as the others mentioned in this thread, even though it's the exact same file, not a local copy. So I wonder if this is due to code or configuration differences between the setup of commons and that of wikisource. It's possible that the file itself is corrupt -- one would need to run djview on it to identify potential issues -- but that still wouldn't explain the rendering difference between commons and wikisource, so it's probably worth investigating. I tried a few other wikis (enwiki, meta and mediawiki) and they all have the same problems as wikisource.

Some potentially useful info from this discussion on wikisource:

I really don't get what is going on here; the same file as File:Vol118.djvu but with the text-layer completely removed (File:2vol118.djvu) seems to process and render just fine.

Note: both uploaded only locally to Wikisource; Wikisource API metadata request works ok for both; page numbers and thumbnails appear for the second, not the first. (It has been suggested that the issue is derived from excessively large metadata fields, hence this comparison)

And from this other discussion:

the current .php file dealing with DjVu "processing" is DjVuImage. Line 40 jumps out at me for starters.

That is a ~300MB memory limit. The djvutxt tool doesn't use anywhere near that much; 40M is adequate for processing all tree versions of EB1911 volume 25 here.
Any chance we could get debugging logs from processing some of these files? There's a bunch of wfDebug calls in that code."

Hinote added a comment.EditedOct 27 2015, 12:52 PM

Yes, the version without the text layer was uploaded over it to allow the page images to be generated.

Well, the actual history of this concrete file is a bit different: it has long history and was uploaded at the time when we cannot bypass the limit of 100MB existing at Commons (or the uploader did not know of how to bypass it). Anyway, the book was previously divided into parts and this file was simply the first part of it. In the meantime we at the Wikisource decided to work with a single file with the whole set of pages since we have the chunked upload option which allows us to upload the bigger versions of the file. Plus, we decided to keep the history of the file at Commons and keep the existing Proofread pages created for this part of the book, so we are trying to upload new versions of this file instead of uploading it as a new file. Our new versions were compiled with 1) whole set of pages and 2) yes, the OCR layer; we need both. Since we currently have problems with these new versions, the file was reverted to one of the working versions. And yes, any version at the history that shows '0 x 0' resolution exhibits the issue.

Vladis13 added a comment.EditedOct 28 2015, 10:50 PM

Was hope for today's bugfix phabricator:T94562 (large PDF/djvu files >100MB with large metadata >64K ). But it did not help. - Verified by temporary return of both versions of problematic File:Толковый словарь. Том 2(1) (Даль 1905).djvu.

Please, could you raise Priority.

Ok, so for reference the version of the file you are referring to is https://upload.wikimedia.org/wikipedia/commons/archive/e/e9/20151028224055%21%D0%A2%D0%BE%D0%BB%D0%BA%D0%BE%D0%B2%D1%8B%D0%B9_%D1%81%D0%BB%D0%BE%D0%B2%D0%B0%D1%80%D1%8C._%D0%A2%D0%BE%D0%BC_2%281%29_%28%D0%94%D0%B0%D0%BB%D1%8C_1905%29.djvu - ( The version of File:Толковый_словарь._Том_2(1)_(Даль_1905).djvu from 2015-10-28T22:38:30 ).

Please, could you raise Priority.

The priority is appropriate to the severity of the bug imo. Every bug is important to someone, but if every bug was "high" then priority would loose all meaning.

Locally I get:
Fatal error: Allowed memory size of 134217728 bytes exhausted (tried to allocate 21692109 bytes) in /var/www/w/git/vendor/wikimedia/utfnormal/src/Validator.php on line 764

Maybe there is just so much ocr data, that the utf-8 normalization runs out of memory.

Maybe there is just so much ocr data, that the utf-8 normalization runs out of memory.

But why is the "hidden text" (or OCR data) even an issue when it comes to just to thumbnail generation, dimension detection, page count, etc. on Commons? "Pulling" the hidden text only becomes relevant on Wikisource during initial page creation in the Page: namespace.

Is it because the OCR data is being treated like metadata and being "stored" as if both types of data are one in the same?

Maybe there is just so much ocr data, that the utf-8 normalization runs out of memory.

But why is the "hidden text" (or OCR data) even an issue when it comes to just to thumbnail generation, dimension detection, page count, etc. on Commons? "Pulling" the hidden text only becomes relevant on Wikisource during initial page creation in the Page: namespace.

Is it because the OCR data is being treated like metadata and being "stored" as if both types of data are one in the same?

Yes. Its because our implementation for ocr data is super-hacky and its treated in the same breath as per-page dimensions. If extracting OCR data fails, then extracting the width/height fails, which prevent mediawiki from making thumbnails.

Note: That OCR data is also used by Search, not just wikisource.

In start message was mentioned File:Niva 1906-45.djvu, which have size only 4mb. Also, I think, there is many another files which have size more 100mb and utf-8, to example asian files with hieroglyphs. So, perhaps the reason is not size utf-8?

In start message was mentioned File:Niva 1906-45.djvu, which have size only 4mb. Also, I think, there is many another files which have size more 100mb and utf-8, to example asian files with hieroglyphs. So, perhaps the reason is not size utf-8?

The different files have different root causes. I'm not sure why everyone is insisting on putting them in the same bug. It would be easier to follow if there were different bugs for the different files.

Niva_1906-45.djvu appears to just be totally broken. No other program seems to be able to open it. evince gives the following error:

bawolff@Bawolff-L:~$ evince Niva_1906-45.djvu 
** (evince:12036): WARNING **: DjvuLibre error: [1-15109] Corrupted IFF file (Mangled chunk boundaries).

** (evince:12036): WARNING **: DjvuLibre error: IFFByteStream.cpp:243

** (evince:12036): WARNING **: DjvuLibre error: [1-15114] IFFByteStream not ready for reading chunk.

** (evince:12036): WARNING **: DjvuLibre error: IFFByteStream.cpp:177

** (evince:12036): WARNING **: DjvuLibre error: [1-15109] Corrupted IFF file (Mangled chunk boundaries).

** (evince:12036): WARNING **: DjvuLibre error: IFFByteStream.cpp:243

** (evince:12036): WARNING **: DjvuLibre error: [1-15114] IFFByteStream not ready for reading chunk.

** (evince:12036): WARNING **: DjvuLibre error: IFFByteStream.cpp:177
**
EvinceDocument:ERROR:/build/buildd-evince_2.30.3-2-i386-XSLfOu/evince-2.30.3/./libdocument/ev-document-misc.c:58:ev_document_misc_get_thumbnail_frame: assertion failed: (width_r >= 0 && height_r >= 0)
Aborted
GOIII added a comment.Oct 29 2015, 2:41 AM

Niva_1906-45.djvu appears to just be totally broken. No other program seems to be able to open it. evince gives the following error:

Right. I can "open" that file in windows' Djview but positions 12 & 13 out of the 16 total 'pages' are indeed corrupt beyond recognition and thereby hosing the entire thing (regardless of it having any sort of hidden layer or not).

GOIII added a comment.EditedOct 29 2015, 2:56 AM

>SNiP<

Yes. Its because our implementation for ocr data is super-hacky and its treated in the same breath as per-page dimensions. If extracting OCR data fails, then extracting the width/height fails, which prevent mediawiki from making thumbnails.

I'm guessing that won't change until there is a better approach to storing that 'OCR' data.

Note: That OCR data is also used by Search, not just wikisource.

I guess that is good news if your source doc was 'born digital' and not a 'scan' of something published pre-1923 like most Wikisource hostable works are (never mind if they're not in English to begin with as well).

In start message was mentioned File:Niva 1906-45.djvu, which have size only 4mb. Also, I think, there is many another files which have size more 100mb and utf-8, to example asian files with hieroglyphs. So, perhaps the reason is not size utf-8?

The different files have different root causes. I'm not sure why everyone is insisting on putting them in the same bug. It would be easier to follow if there were different bugs for the different files.

I've split that issue out into

T117013: File:Толковый_словарь._Том_2(1)_(Даль_1905).djvu from 2015-10-28T22:38:30 fails to extract metadata/render. Popssibly OOM on utf-8 normalization

waldyrious updated the task description. (Show Details)Oct 29 2015, 12:09 PM

I've edited the task description to remove the examples where the files were actually corrupted (as I had previously noted above), and added the files which (to the best of my knowledge, please correct me if I'm wrong) exhibit the issue due to the same cause (large metadata field).

Change 249724 had a related patch set uploaded (by Rillke):
Parse huge XML metadata from DjVu images

https://gerrit.wikimedia.org/r/249724

Please get Gerrit to merge this... the sooner it gets rolled out, the sooner we'll know if all the mentioned cases are resolved or not.

Change 249724 had a related patch set uploaded (by Rillke):
Parse huge XML metadata from DjVu images

https://gerrit.wikimedia.org/r/249724

Please get Gerrit to merge this... the sooner it gets rolled out, the sooner we'll know if all the mentioned cases are resolved or not.

With the deployment of 1.27.0-wmf5 to a few wikis, including mw.org, I tried to load https://www.mediawiki.org/wiki/File:Congressional_Record_Volume_81_Part_3.djvu. The number of pages and dimensions do appear, but the thumbnails aren't generated. The error is

Error creating thumbnail: No path supplied in thumbnail object.

Maybe this requires the file to be uploaded locally? Either way, tomorrow we should find out once wmf5 is deployed to commons.

GOIII added a comment.Nov 3 2015, 9:27 PM

To test the problematic version against the ProofreadPage extension, file also uploaded to testwiki2:
https://test2.wikipedia.org/wiki/File:T107664.djvu

Well the pages, dimensions AND thumbnails came in on the above just fine. So it does seem like tomorrow's roll out to the rest of the wikis should fix this pain in the azz bug once and for all!

GOIII added a comment.Nov 4 2015, 12:44 AM

Possible other Tasks that can be affected by tomorrow's update...

Add 'em if you know of any others.

With the deployment of 1.27.0-wmf5 to a few wikis, including mw.org, I tried to load https://www.mediawiki.org/wiki/File:Congressional_Record_Volume_81_Part_3.djvu. The number of pages and dimensions do appear, but the thumbnails aren't generated.

Yes - when you view a foreign file (not counting instant commons which is different), the rendering of it happens on whichever wiki its uploaded to.

Possible other Tasks that can be affected by tomorrow's update...

That one should not have been re-opened.

This one is not going to be fixed by tomorrows deploy

This one is not going to be fixed

This is a dupe, but also not going to be fixed

This bug is probably going to be fixed. (Woo!)

This one is essentially this bug.

GOIII added a comment.Nov 4 2015, 4:01 AM

This one is not going to be fixed by tomorrows deploy

I dunno - uploaded the Jan. 9 2015 commons version to File:T86611.djvu on test2.wikipedia.org and it seems fine there.

The only difference is the file names.

Hurray! The fix worked :) I've created Category:DjVu files with errors for the files that are themselves corrupted.

waldyrious closed this task as Resolved.Nov 4 2015, 9:27 PM