Maniphest T192866

Some DjVu files have too much metadata to fit in their database column
Closed, ResolvedPublicPRODUCTION ERROR
Actions

Description

Hi, 4 DjVu files have 0 × 0 pixels:

May be related to T142939

Details

	Subject	Repo	Branch	Lines +/-
	media: Build and use JSON for metadata of djvu instead of XML	mediawiki/core	wmf/1.38.0-wmf.7	+93 -69
	media: Build and use JSON for metadata of djvu instead of XML	mediawiki/core	master	+93 -69

Customize query in gerrit

Related Objects

Mentioned In: T63111: Convert primary key integers and references thereto from int to bigint (unsigned)
T301039: Provide a dump of PDF/DjVU metadata
T298447: Use of Maintenance::purgeRedundantText could be risky for repos with splitted metadata
T298446: MediaWiki API fails with prop=imageinfo&iiprop=metadata to show or continue the metadata blob (for djvu with useSplitMetadata = true)
T298417: Undeleted djvu files show incorrect metadata: 0x0 size, no page number info
T275268: Address "image" table capacity problems by storing pdf/djvu text outside file metadata
T265571: MediaWiki 1.36/wmf.13 needlessly HTML encodes ASCII characters in DjVu text layer
T32906: Store DjVu, PDF extracted text in a structured table instead of img_metadata
T240562: Text extraction fails on seemingly bog standard DjVu file
T193948: Page index does not show in wikisource
T193200: PDF file has "0 x 0 pixels" and no thumbnail when shown on mediawiki.org but works fine on Commons
T192833: Cannot show the page index of a book on wikisource
Mentioned Here: T289228: Convert media handling code (PdfHandler, PagedTiffHandler) to use Shellbox
T291665: PDF File does not render on Commons but renders on Wikisource
T275268: Address "image" table capacity problems by storing pdf/djvu text outside file metadata
T240281: Clarify confusion between Multimedia and Structured Data team and update docs accordingly
T240562: Text extraction fails on seemingly bog standard DjVu file
T28741: Migrate file tables to a modern layout (image/oldimage; file/file_revision; add primary keys)
T67264: File storage uses (title,timestamp) as a unique key, but this is not unique
T89971: ApiQueryImageInfo is crufty, needs rewrite
T192833: Cannot show the page index of a book on wikisource
T142939: ia-upload convert pdf to djvu and uploaded as 0 × 0 pixels file

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Regarding the 33Mb. that is just some of those files. Many of them simply are corrupt. The ones that do hit that limit however, that is mostly because we store the entire text of those documents (for later search engine indexing) and put it in metadata as well. That's not really what that field is for of course..

Example output for https://commons.wikimedia.org/wiki/File:Nouveau_Larousse_illustré,_1898,_VI.djvu
The primary metadata itself is already 569617 characters (djvudump /Users/djhartman/Downloads/Nouveau_Larousse_illustré\,_1898\,_VI.djvu)

And then add to that the text that we parse out of it, another 18443512 characters...
djvutxt --detail=page /Users/djhartman/Downloads/Nouveau_Larousse_illustré\,_1898\,_VI.djvu

To that we add some XML wrapping, but you can imagine that even with compression, that many characters will simply trigger limits.

Logic: https://github.com/wikimedia/mediawiki/blob/master/includes/media/DjVuImage.php#L246

I think large metadata could be kept, as additional content (multicontent revision, anyone?), while keeping the image table with minimal metadata (just as an index). But I am no developer to suggest the best way to proceed.

In T192866#4157022, @jcrespo wrote:

I think large metadata could be kept, as additional content (multicontent revision, anyone?), while keeping the image table with minimal metadata (just as an index). But I am no developer to suggest the best way to proceed.

To quote MCR:

Derived (possibly virtual) content: OCR text from DeJaVu and PDF files (as derived content). Works best if the upload history is managed using MCR. There are currently no plans to implement this.

Alternative ideas

decouple metadata into a separate blobstore
don't store all the text content, and instead request it upon demand (but then that api needs to be really specifically marked to be only used for incidental use), and it might be slow...

Images in general have several issues. At the database level we already have T28741: Migrate file tables to a modern layout (image/oldimage; file/file_revision; add primary keys) and T67264: File storage uses (title,timestamp) as a unique key, but this is not unique, then the action API has T89971: ApiQueryImageInfo is crufty, needs rewrite and its many related tasks, and now this too.

For a short-term fix on the MediaWiki code side the Multimedia team probably knows more about the requirements in this area.

Long term, we should move this metadata storage elsewhere if we want to be able to store it. Is there a database-side limit on the size of data that can be put into ExternalStore, i.e. can it handle 33MB of data in one row? Otherwise, our options would seem to be either reparsing the metadata from the file on demand or storing it as a "file" in the file storage backend alongside the file itself.

@Anomie I think, for short term, either making uploads fail or truncate metadata to, let's say, 10MB and document such a limitation should be enough. While answering to you, I probably really mean to talk to Multimedia, if you say they should be the ones on the know.

I am not so worried right now about longer term solutions.

zhuyifei1999 subscribed.Apr 25 2018, 6:18 PM

Referring to @MarkTraceur and @matthiasmullie for any insight on this issue.

This is the lead cause of commons mediawiki database errors. I think errors for the existing record should be easy to fix by deleting the large row(s):

{
  "_index": "logstash-2018.04.26",
  "_type": "mediawiki",
  "_id": "AWMBIm8PpesmgM3luPXO",
  "_version": 1,
  "_score": null,
  "_source": {
    "server": "commons.wikimedia.org",
    "db_server": "10.64.48.150",
    "wiki": "commonswiki",
    "channel": "DBQuery",
    "type": "mediawiki",
    "error": "Lost connection to MySQL server during query (10.64.48.150)",
    "http_method": "POST",
    "@version": 1,
    "host": "mw1226",
    "shard": "s4",
    "sql1line": "SELECT  img_name,img_size,img_width,img_height,img_metadata,img_bits,img_media_type,img_major_mime,img_minor_mime,img_timestamp,img_sha1,COALESCE( comment_img_description.comment_text, img_description ) AS `img_description_text`,comment_img_description.comment_data AS `img_description_data`,comment_img_description.comment_id AS `img_description_cid`,img_user,img_user_text,NULL AS `img_actor`,img_metadata  FROM `image` LEFT JOIN `image_comment_temp` `temp_img_description` ON ((temp_img_description.imgcomment_name = img_name)) LEFT JOIN `comment` `comment_img_description` ON ((comment_img_description.comment_id = temp_img_description.imgcomment_description_id))   WHERE img_name IN ('Blue_pencil.svg','Wikidata-logo.svg','Wikisource-logo.svg','PD-icon.svg','Cyclopaedia,_Chambers_-_Volume_1.djvu','Cyclopaedia,_Chambers_-_Volume_2.djvu','Cyclopaedia,_Chambers_-_Supplement,_Volume_1.djvu','Cyclopaedia,_Chambers_-_Supplement,_Volume_2.djvu')   ",
    "fname": "LocalRepo::findFiles",
    "errno": 2013,
    "unique_id": "WuGRdgpAMD0AADmKcTIAAABC",
    "method": "Wikimedia\\Rdbms\\Database::makeQueryException",
    "level": "ERROR",
    "ip": "10.64.48.61",
    "mwversion": "1.32.0-wmf.1",
    "message": "LocalRepo::findFiles\t10.64.48.150\t2013\tLost connection to MySQL server during query (10.64.48.150)\tSELECT  img_name,img_size,img_width,img_height,img_metadata,img_bits,img_media_type,img_major_mime,img_minor_mime,img_timestamp,img_sha1,COALESCE( comment_img_description.comment_text, img_description ) AS `img_description_text`,comment_img_description.comment_data AS `img_description_data`,comment_img_description.comment_id AS `img_description_cid`,img_user,img_user_text,NULL AS `img_actor`,img_metadata  FROM `image` LEFT JOIN `image_comment_temp` `temp_img_description` ON ((temp_img_description.imgcomment_name = img_name)) LEFT JOIN `comment` `comment_img_description` ON ((comment_img_description.comment_id = temp_img_description.imgcomment_description_id))   WHERE img_name IN ('Blue_pencil.svg','Wikidata-logo.svg','Wikisource-logo.svg','PD-icon.svg','Cyclopaedia,_Chambers_-_Volume_1.djvu','Cyclopaedia,_Chambers_-_Volume_2.djvu','Cyclopaedia,_Chambers_-_Supplement,_Volume_1.djvu','Cyclopaedia,_Chambers_-_Supplement,_Volume_2.djvu')   ",
    "normalized_message": "{fname}\t{db_server}\t{errno}\t{error}\t{sql1line}",
    "url": "/w/api.php",
    "tags": [
      "syslog",
      "es",
      "es"
    ],
    "reqId": "WuGRdgpAMD0AADmKcTIAAABC",
    "referrer": null,
    "@timestamp": "2018-04-26T08:47:03.000Z",
    "db_name": "commonswiki",
    "db_user": "***"
  },
  "fields": {
    "@timestamp": [
      1524732423000
    ]
  },
  "sort": [
    1524732423000
  ]
}

In T192866#4160346, @jcrespo wrote:

This is the lead cause of commons mediawiki database errors. I think errors for the existing record should be easy to fix by deleting the large row(s):

Don't delete the whole rows, that will lose the image entirely. Setting img_metadata or oi_metadata for those rows to the empty string should work.

Sure, sorry, that is what I meant. s/rows/metadata fields on those rows/.

Quiddity mentioned this in T193200: PDF file has "0 x 0 pixels" and no thumbnail when shown on mediawiki.org but works fine on Commons.Apr 26 2018, 8:53 PM

I am seeing lots of read timeout errors for api due to this /w/api.php?format=json&action=query&list=allpages&apmaxsize=0&aplimit=15&apnamespace=6 Unless someone has a better option, I will empty the mentioned field on Monday to prevent further errors.

In T192866#4164866, @jcrespo wrote:

I am seeing lots of read timeout errors for api due to this /w/api.php?format=json&action=query&list=allpages&apmaxsize=0&aplimit=15&apnamespace=6 Unless someone has a better option, I will empty the mentioned field on Monday to prevent further errors.

That query shouldn't be touching the image table at all. I'd think that would be timing out due to more common issues of scanning through all rows in page with page_namespace = 6 (using the name_title index) and filtering for only those with page_len <= 0. Of which there probably are extremely few if any (enwiki doesn't seem to have any at all, for example).

Reedy mentioned this in T193948: Page index does not show in wikisource.May 5 2018, 11:14 PM

Aklapper merged a task: T193948: Page index does not show in wikisource.May 12 2018, 7:09 PM

Aklapper added a subscriber: Reedy.

Ankry subscribed.May 12 2018, 9:31 PM

• Vvjjkkii renamed this task from Some DjVu files have too much metadata to fit in their database column to xdeaaaaaaa.Jul 1 2018, 1:14 AM

• Vvjjkkii triaged this task as High priority.

• Vvjjkkii added projects: CheckUser, Connected-Open-Heritage-Batch-uploads (RAÄ-KMB_1_2017-02), Tamil-Sites, Gamepress, Hashtags, Jade, KartoEditor, Language-2018-Apr-June, New-Editor-Experiences, Mail, TCB-Team (now WMDE-TechWish).

• Vvjjkkii updated the task description. (Show Details)

• Vvjjkkii removed a subscriber: Aklapper.

Wong128hk renamed this task from xdeaaaaaaa to Some DjVu files have too much metadata to fit in their database column.Jul 1 2018, 6:09 AM

Wong128hk raised the priority of this task from High to Needs Triage.

Wong128hk removed projects: TCB-Team (now WMDE-TechWish), Mail, New-Editor-Experiences, Language-2018-Apr-June, KartoEditor, Jade, Hashtags, Gamepress, Tamil-Sites, Connected-Open-Heritage-Batch-uploads (RAÄ-KMB_1_2017-02), CheckUser.

Wong128hk updated the task description. (Show Details)

Wong128hk added a subscriber: Aklapper.

Wong128hk merged a task: T193948: Page index does not show in wikisource.Jul 1 2018, 6:11 AM

Krinkle moved this task from Untriaged to Dec2019/1.35.wmf.10+ on the Wikimedia-production-error board.Aug 29 2018, 8:21 PM

Krinkle moved this task from Untriaged to Usage problem on the MediaWiki-libs-Rdbms board.Sep 17 2018, 4:02 PM

The listed urls still show 0px x 0px and the log shows entries similar to the one mentioned here, but not sure if it is the same.

Expectation (readQueryTime <= 5) by MediaWiki::main not met (actual: 6.7309379577637):
query: SELECT img_name,img_timestamp,img_width,img_height,img_metadata,img_bits,img_media_type,img_major_mime,img_minor_mime,img_timestamp,img_sha1,COALESCE( comment_img_description.comment_text, img_description ) AS `img_description_text`,comment_img_descriptio [TRX#24550b]
#0 /srv/mediawiki/php-1.32.0-wmf.20/includes/libs/rdbms/TransactionProfiler.php(228): Wikimedia\Rdbms\TransactionProfiler->reportExpectationViolated()
#1 /srv/mediawiki/php-1.32.0-wmf.20/includes/libs/rdbms/database/Database.php(1258): Wikimedia\Rdbms\TransactionProfiler->recordQueryCompletion()
#2 /srv/mediawiki/php-1.32.0-wmf.20/includes/libs/rdbms/database/Database.php(1155): Wikimedia\Rdbms\Database->doProfiledQuery()
#3 /srv/mediawiki/php-1.32.0-wmf.20/includes/libs/rdbms/database/Database.php(1655): Wikimedia\Rdbms\Database->query()
#4 /srv/mediawiki/php-1.32.0-wmf.20/includes/pager/IndexPager.php(368): Wikimedia\Rdbms\Database->select()
#5 /srv/mediawiki/php-1.32.0-wmf.20/includes/pager/IndexPager.php(225): IndexPager->reallyDoQuery()
#6 /srv/mediawiki/php-1.32.0-wmf.20/includes/pager/IndexPager.php(422): IndexPager->doQuery()
#7 /srv/mediawiki/php-1.32.0-wmf.20/includes/specials/SpecialNewimages.php(108): IndexPager->getBody()
#8 /srv/mediawiki/php-1.32.0-wmf.20/includes/specialpage/SpecialPage.php(569): SpecialNewFiles->execute()
#9 /srv/mediawiki/php-1.32.0-wmf.20/includes/specialpage/SpecialPageFactory.php(581): SpecialPage->run()
#10 /srv/mediawiki/php-1.32.0-wmf.20/includes/MediaWiki.php(288): MediaWiki\Special\SpecialPageFactory->executePath()
#11 /srv/mediawiki/php-1.32.0-wmf.20/includes/MediaWiki.php(868): MediaWiki->performRequest()
#12 /srv/mediawiki/php-1.32.0-wmf.20/includes/MediaWiki.php(525): MediaWiki->main()
#13 /srv/mediawiki/php-1.32.0-wmf.20/index.php(42): MediaWiki->run()

and

A connection error occurred. 
Query: SELECT  img_name,img_size,img_width,img_height,img_metadata,img_bits,img_media_type,img_major_mime,img_minor_mime,img_timestamp,img_sha1,COALESCE( comment_img_description.comment_text, img_description ) AS `img_description_text`,comment_img_description.comment_data AS `img_description_data`,comment_img_description.comment_id AS `img_description_cid`,img_user,img_user_text,NULL AS `img_actor`,img_metadata  FROM `image` LEFT JOIN `image_comment_temp` `temp_img_description` ON ((temp_img_description.imgcomment_name = img_name)) LEFT JOIN `comment` `comment_img_description` ON ((comment_img_description.comment_id = temp_img_description.imgcomment_description_id))   WHERE img_name IN ('###','Cyclopaedia,_Chambers_-_Volume_1.djvu','Cyclopaedia,_Chambers_-_Volume_2.djvu','Cyclopaedia,_Chambers_-_Supplement,_Volume_1.djvu','Cyclopaedia,_Chambers_-_Supplement,_Volume_2.djvu')   
Function: LocalRepo::findFiles
Error: 2013 Lost connection to MySQL server during query (10.192.32.167)

#2 /srv/mediawiki/php-1.32.0-wmf.20/includes/libs/rdbms/database/Database.php(1655): Wikimedia\Rdbms\Database->query(string, string)
#3 /srv/mediawiki/php-1.32.0-wmf.20/includes/filerepo/LocalRepo.php(316): Wikimedia\Rdbms\Database->select(array, array, array, string, array, array)
#4 /srv/mediawiki/php-1.32.0-wmf.20/includes/filerepo/RepoGroup.php(209): LocalRepo->findFiles(array, integer)
#5 /srv/mediawiki/php-1.32.0-wmf.20/extensions/ParsoidBatchAPI/includes/ApiParsoidBatch.php(106): RepoGroup->findFiles(array)
#6 /srv/mediawiki/php-1.32.0-wmf.20/includes/api/ApiMain.php(1587): ApiParsoidBatch->execute()
#7 /srv/mediawiki/php-1.32.0-wmf.20/includes/api/ApiMain.php(531): ApiMain->executeAction()
#8 /srv/mediawiki/php-1.32.0-wmf.20/includes/api/ApiMain.php(502): ApiMain->executeActionWithErrorHandling()
#9 /srv/mediawiki/php-1.32.0-wmf.20/api.php(87): ApiMain->execute()

Krinkle moved this task from Apr 2019 / 1.33.wmf.25+ to Older on the Wikimedia-production-error board.Sep 19 2018, 12:57 AM

Krinkle edited projects, added MediaWiki-File-management; removed MediaWiki-libs-Rdbms.Feb 10 2019, 12:15 AM

• mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:09 PM

Xover mentioned this in T240562: Text extraction fails on seemingly bog standard DjVu file.Dec 18 2019, 8:19 AM

There may or may not be a different manifestation of this problem in T240562.

And regardless of the direct problems, I think storing the OCR text layer from multipage media as "metadata" is just plain wrong. The text layer is no more metadata than the scanned image of each page is, and those are generated on-demand / pre-generated and stored as content. It also should be no more resource intensive to extract the text layer for a page on demand than it is for the image layer. If there is a need to store the text in the database layer in order to let the search indexer get at it, then it should be in a table designed specifically for the purpose (and almost certainly in a different table than the metadata lives in).

Incidentally, this problem causes failures that are opaque to users: there's no user-visible error message, no obvious indication of where the problem stems from, and several possible culprits for the visible symptoms. It is thus not clear what to report or where to direct it; which means it mostly just looks like "Mediawiki" fails for no reason. Absent a permanent fix, anything that can be done to fail early (i.e. at upload) and surfacing a comprehensible and specific error message ("This file exceeded the metadata limit …") will be valuable in terms of reducing user confusion.

TheDJ added a project: User-TheDJ.Dec 18 2019, 9:18 AM

Just to note: Multimedia was tagged as a key player here, but that Phab team seems to have been archived. Who owns the components previously in that group's remit now?

@Xover: See https://www.mediawiki.org/wiki/Developers/Maintainers (but note that it's up to each team what they'd like to see on the workboard of their own team project tag).

No one owns it. Like many wikimedia extensions, they live by the grace of the last person to maintain them, because the foundation refuses to invest in increased development capacity.

@Aklapper Thanks, but as @TheDJ notes, going by that overview nobody owns multimedia features in WMF wikis now. That's a pretty sad state of affairs given how central multimedia is for almost all the projects (including Wikidata and whatever "Abstract" will end up as).

But in the hopes of finding some kind of home for this task, rather then let it slowly gather cobwebs forever, let me tag in Structured Data Engineering (which seems to be the home for many former Multimedia people) and @fgiunchedi from SRE who owns Swift and "Media storage". I'll also throw in Multi-Content-Revisions as it seems to be one (the?) likely solution, and @Bawolff for the Reading team who own PdfHandler (since PDFs have the exact same problem, but just rarely trigger it for various reasons).

Folks: pardon the spam, but this task will just gather cobwebs until it's closed as stale if nobody has it on their list. Can any of you advice on the right box on the org chart to keep track of it? From what I could glean in the "Maintainers" list, any solution here will of necessity involve multiple teams.

Moving the text layer out of the metadata and into a MCR slot is moderately big surgery, but it'll make the DBAs happy, the API folks happy, and will likely prevent a ton of knock-on problems that never rise to the level of "Unbreak Now!" but which bug people again and again at a low grade (and which fails in silent and unpredictable ways for end users). It'll also eliminate one big source of problems that will make isolating the cause of other observed phenomenon much easier (cf. eg. T240562).

Xover added projects: Structured Data Engineering, Multi-Content-Revisions.Aug 3 2020, 2:03 PM

Restricted Application added a project: Structured-Data-Backlog. · View Herald TranscriptAug 3 2020, 2:03 PM

The maintenance of Formerly-Multimedia™ components is an ongoing issue for the Structured Data team and, realistically, everyone - see T240281 - but we're starting to work through and prioritize some of those components as we have the capacity. Sadly, most of our expertise on the SD team is not in the multimedia components which makes it difficult for us to prioritize and diagnose, much less actually fix, these bugs.

...that said, we have gotten particularly good at brow-beating^W politely petitioning folks who know a little bit more about the problems we can't handle, so I'll be doing a little bit of that and see if there's any way for us to help out here.

For what it's worth, I've found that the below search gives a good representation of the files with issues (282 files currently):

https://commons.wikimedia.org/wiki/Special:Search?sort=relevance&search=filemime%3Aimage%2Fvnd.djvu+filew%3A0+fileh%3A0&profile=advanced&fulltext=1&ns6=1

TheDJ mentioned this in T32906: Store DjVu, PDF extracted text in a structured table instead of img_metadata.Aug 4 2020, 1:01 PM

CBogen moved this task from Triage to Multimedia on the Structured-Data-Backlog board.Aug 25 2020, 4:47 PM

Xover mentioned this in T265571: MediaWiki 1.36/wmf.13 needlessly HTML encodes ASCII characters in DjVu text layer.Oct 15 2020, 7:37 AM

Aklapper removed a subscriber: Anomie.Oct 16 2020, 5:01 PM

Xover mentioned this in T275268: Address "image" table capacity problems by storing pdf/djvu text outside file metadata.Jun 27 2021, 8:15 AM

Nemo_bis subscribed.Jun 29 2021, 5:26 AM

The observed end-user affect of metadata numbers missing is not itself a production error.

The internal effects of this used to cause a production error, which was captured by T275268: .

and @Bawolff for the Reading team who own PdfHandler (since PDFs have the exact same problem, but just rarely trigger it for various reasons).

To be clear, i don't work for wmf anymore. I was never on the reading team. I used to be on the multimedia team a long time ago (i'm pretty sure before the reading team existed).

I would probably still fix critical bugs in pdfhandler as a volunteer if they came up, i had free time and nobody else was willing to, but generally i'm not working on pdfhandler things (or really any mw things for that matter)

In T192866#7445987, @Krinkle wrote:

The observed end-user affect of metadata numbers missing is not itself a production error.

"Wikimedia-production-errors" is a really non-intuitive name for "stuff that makes noise in our logs"…

But in any case, I just ran into what I think is a slightly odd variant of this issue.

c:File:A_Latin_Dictionary_(1984).djvu shows up with the right dimensions, thumbnails, etc.

s:File:A Latin Dictionary (1984).djvu shows up with 0x0 dimensions, no thumbs, etc.

And consequently, s:Index:A Latin Dictionary (1984).djvu blows up and shows a big fat error message from Proofread Page ("Error: Invalid interval", because it can't get the page count of the file).

PRP's failure mode is the same as for those files that are 0x0 on Commons (cf. eg. s:fr:Livre:Nouveau Larousse illustré, 1898, III.djvu), but in this case the metadata is ok on Commons and somehow broken on its way to Wikisource.

But in both cases the file is not possible to use with PRP.

@MarkTraceur Any luck on finding someone to look at this?

@Bawolff Note that you're listed on several components in the maintainers list. Which reminds me: @aaron You're listed for "File Management" along with Brian, and tagged as being on the Structured Data team (which owns ex-Multimedia components). Any chance this'd be your bailiwick and you can beg, borrow, or steal some time to look into this?

The individuals and teams on the maintainers list are separate. If a person is listed for a component and a team is listed, it doesn't mean that that person is on that team. (Assuming Aaron hasn't switched teams, he is on the performance team not sdc)

c:File:A_Latin_Dictionary_(1984).djvu shows up with the right dimensions, thumbnails, etc.

s:File:A Latin Dictionary (1984).djvu shows up with 0x0 dimensions, no thumbs, etc.

Interesting. I've just reuploaded this file (for unrelated reasons) and now the previous revision is showing up on Commons with 0x0 pixel dimensions (but still 245MB file size). The current revision now shows up fine on Commons (as the previous did at the time it was uploaded) with dimensions and size, but when viewed from enWS it's broken (0x0) and PRP still chokes on it.

In T192866#7470114, @Xover wrote:

c:File:A_Latin_Dictionary_(1984).djvu shows up with the right dimensions, thumbnails, etc.

s:File:A Latin Dictionary (1984).djvu shows up with 0x0 dimensions, no thumbs, etc.

Interesting. I've just reuploaded this file (for unrelated reasons) and now the previous revision is showing up on Commons with 0x0 pixel dimensions (but still 245MB file size). The current revision now shows up fine on Commons (as the previous did at the time it was uploaded) with dimensions and size, but when viewed from enWS it's broken (0x0) and PRP still chokes on it.

Similar to T291665 but that was in opposite direction: rendered on Wikisource, 0x0 on Commons.

Just in case it's useful, some info about the file itself:

djvudump output

FORM:DJVM [1792272940] 
  DIRM [15825]      Document directory (bundled, 2052 files 2052 pages)
  FORM:DJVU [130552] {latindictionaryf00andr_0001.djvu} [P1]
    INFO [10]         DjVu 2495x3665, v24, 100 dpi, gamma=2.2
    BG44 [19126]      IW4 data #1, 74 slices, v1.2 (color), 2495x3665
    BG44 [40113]      IW4 data #2, 15 slices
    BG44 [71162]      IW4 data #3, 10 slices
    TXTz [96]         Hidden text (text, etc.)
  FORM:DJVU [22655] {latindictionaryf00andr_0002.djvu} [P2]
    INFO [10]         DjVu 2495x3665, v24, 100 dpi, gamma=2.2
    BG44 [7949]       IW4 data #1, 74 slices, v1.2 (color), 2495x3665
    BG44 [3028]       IW4 data #2, 15 slices
    BG44 [11582]      IW4 data #3, 10 slices
    TXTz [41]         Hidden text (text, etc.)
…

djvutext (dumps the hidden text layer) produces ~25MB (25,734,297 bytes) of output in 16 seconds on my older laptop. djvutoxml produces ~287MB (300,529,260 bytes) of output in ~35 minutes. Incidentally, the text output zip compresses to ~11MB (a 2:1 ratio, 56%) while the xml compresses to 56MB (a 5:1 ratio, 80%).

Asking Quarry for this file in the image table dumps all 287MB to the browser, and so I would imagine would asking the API for it.

The difference between the text and xml is that the xml contains the position and geometry of every single word (in the form <WORD coords="648,1760,967,1678">Digitized</WORD>), line, paragraph, region, column, and page that was detected by the OCR. This information (with granularity of individual characters) is stored permanently in the internal structure of DjVu files (not sure about PDF) so it can be regenerated or fetched on demand, but so far as I know it is currently never used by any part of the MW stack (we have community-maintained Toolforge tools that do make use of the embedded data, but not the extracted XML).

From memory, the Xml is needed to figure out the w x h of each page and how many pages there are... and to get the text of each page.

But stripping that out of the xml and storing it in a more efficient structure, probably would make sense, based on the numbers you are listing..

In T192866#7489510, @TheDJ wrote:

From memory, the Xml is needed to figure out the w x h of each page and how many pages there are... and to get the text of each page.

Page number and geometry can also be easily retrieved using djvudump.

In T192866#7489510, @TheDJ wrote:

But stripping that out of the xml and storing it in a more efficient structure, probably would make sense, based on the numbers you are listing..

I think there's ongoing work on that in T275268, T28741, and T289228. But I'm not sure, since that file was uploaded on 31 October and contains the XML structure and the patch to remove djvutoxml went out on 13 September.

In T192866#7489692, @Xover wrote:

In T192866#7489510, @TheDJ wrote:

But stripping that out of the xml and storing it in a more efficient structure, probably would make sense, based on the numbers you are listing..

the patch to remove djvutoxml went out on 13 September.

That doesn't have anything to do with our production, it's for third parties.

That being said, It's part of changes that actually address this problem. First the metadata will change to be json (not being xml anymore) and then text part of the metadata be split to be out of core databases (to external storage). It's ongoing and I hope to finish it soon.

Restricted Application added a project: User-Ladsgroup. · View Herald TranscriptNov 11 2021, 8:05 PM

Change 738280 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] media: Build and use JSON for metadata of djvu instead of XML

https://gerrit.wikimedia.org/r/738280

gerritbot added a project: Patch-For-Review.Nov 11 2021, 8:16 PM

Change 738280 merged by jenkins-bot:

[mediawiki/core@master] media: Build and use JSON for metadata of djvu instead of XML

https://gerrit.wikimedia.org/r/738280

ReleaseTaggerBot added a project: MW-1.38-notes (1.38.0-wmf.9; 2021-11-16).Nov 12 2021, 2:00 AM

Maintenance_bot removed a project: Patch-For-Review.Nov 12 2021, 2:10 AM

Change 738638 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@wmf/1.38.0-wmf.7] media: Build and use JSON for metadata of djvu instead of XML

https://gerrit.wikimedia.org/r/738638

gerritbot added a project: Patch-For-Review.Nov 15 2021, 9:58 AM

Change 738638 merged by jenkins-bot:

[mediawiki/core@wmf/1.38.0-wmf.7] media: Build and use JSON for metadata of djvu instead of XML

https://gerrit.wikimedia.org/r/738638

Mentioned in SAL (#wikimedia-operations) [2021-11-15T10:23:56Z] <ladsgroup@deploy1002> Synchronized php-1.38.0-wmf.7/includes/media/: Backport: [[gerrit:738638|media: Build and use JSON for metadata of djvu instead of XML (T275268 T192866)]] (duration: 00m 56s)

ReleaseTaggerBot edited projects, added MW-1.38-notes (1.38.0-wmf.7; 2021-11-02); removed MW-1.38-notes (1.38.0-wmf.9; 2021-11-16).Nov 15 2021, 11:00 AM

Maintenance_bot removed a project: Patch-For-Review.Nov 15 2021, 11:10 AM

After running refresh, those files no longer show 0x0 now. I slowly fix the rest of the files. This is done.

Maintenance_bot moved this task from Incoming to Done on the User-Ladsgroup board.Nov 15 2021, 2:15 PM

Verified: all the files mentioned above show correct dimensions and thumbnails, both on Commons and Wikisource, and Proofread Page is now able to work with them without choking.

Umherirrender mentioned this in T298417: Undeleted djvu files show incorrect metadata: 0x0 size, no page number info.Jan 2 2022, 7:18 PM

Umherirrender mentioned this in T298446: MediaWiki API fails with prop=imageinfo&iiprop=metadata to show or continue the metadata blob (for djvu with useSplitMetadata = true).Jan 2 2022, 7:35 PM

Umherirrender mentioned this in T298447: Use of Maintenance::purgeRedundantText could be risky for repos with splitted metadata.Jan 2 2022, 7:42 PM

Mitar mentioned this in T301039: Provide a dump of PDF/DjVU metadata.Feb 5 2022, 12:13 PM

jcrespo mentioned this in T63111: Convert primary key integers and references thereto from int to bigint (unsigned).May 1 2022, 8:07 PM