Page MenuHomePhabricator

Some DjVu files have too much metadata to fit in their database column
Open, Needs TriagePublic

Event Timeline

Yann created this task.Apr 24 2018, 5:35 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 24 2018, 5:35 AM
Yann updated the task description. (Show Details)Apr 24 2018, 5:39 AM

Others DjVu files have the same problem. How to get a complete list ?

Restricted Application added a project: Multimedia. · View Herald TranscriptApr 24 2018, 8:23 AM
Aklapper renamed this task from DjVu files with 0 × 0 pixels to Some DjVu files with a file size >100MB are shown as "0 × 0 pixels".Apr 24 2018, 11:00 AM
Aklapper removed a project: MediaWiki-Uploading.

@Aklapper, most of the files in the list given by the link above are < 100 MB.

Aklapper renamed this task from Some DjVu files with a file size >100MB are shown as "0 × 0 pixels" to Some DjVu files are shown as "0 × 0 pixels".Apr 24 2018, 11:33 AM
Yann added a comment.Apr 24 2018, 1:45 PM

I downloaded a small one https://commons.wikimedia.org/wiki/File:Pen_And_Pencil_Sketches_-_Volume_1.djvu for testing.
This one is corrupted after page 17. It is supposed to have 356 pages, so it can't be 1 MB.

Yann added a comment.Apr 24 2018, 1:51 PM

Testing with another one: https://commons.wikimedia.org/wiki/File:Niva_1906-44.djvu
This one has pages 8 and 9 corrupted. Rest of the file looks OK.

Yann added a comment.EditedApr 24 2018, 7:04 PM

https://commons.wikimedia.org/wiki/File:Nouveau_Larousse_illustr%C3%A9,_1898,_IV.djvu is OK on Commons after being reuploaded. However it still doesn't work on Wikisource, even after a purge and an edit there, so there is at least 2 different bugs.
I think there is even 3 issues, as according to @Reptilien.19831209BE1, the faulty version on Commons is identical to the working version. Only the metadata was changed.

The issue seems to be the metadata being too large for mysql configuration, causing query errors:

SELECT  img_name,img_size,img_width,img_height,img_metadata,img_bits,img_media_type,img_major_mime,img_minor_mime,img_timestamp,img_sha1,COALESCE( comment_img_description.comment_text, img_description ) AS `img_description_text`,comment_img_description.comment_data AS `img_description_data`,comment_img_description.comment_id AS `img_description_cid`,img_user,img_user_text,NULL AS `img_actor`,img_metadata  FROM `image` LEFT JOIN `image_comment_temp` `temp_img_description` ON ((temp_img_description.imgcomment_name = img_name)) LEFT JOIN `comment` `comment_img_description` ON ((comment_img_description.comment_id = temp_img_description.imgcomment_description_id))   WHERE img_name IN ('Blue_pencil.svg','Wikidata-logo.svg','Wikisource-logo.svg','PD-icon.svg','Cyclopaedia,_Chambers_-_Volume_1.djvu','Cyclopaedia,_Chambers_-_Volume_2.djvu','Cyclopaedia,_Chambers_-_Supplement,_Volume_1.djvu','Cyclopaedia,_Chambers_-_Supplement,_Volume_2.djvu');
ERROR 2020 (HY000): Got packet bigger than 'max_allowed_packet' bytes

While max_allowed_packet could be increased a bit, the truth is that it is not wise to store more than 33MB in a single row of a database, nor serve so much content. A limit on metadata content should be set up, or stored elsewhere by the application. For me this is an unbreak now, highly impacting bug on the application side, and I can give the other place where metadata size was too long and is creating issues with backups, too. Trimming metadata to, let's say 10MB I think would be reasonable compromise (no data loss, as the metadata would be still available in the original file). Large metadata files in the past have created issues with bandwidth, so I refuse to increase more the buffers.

For the user side- I think a "rule" should be set in which files, let's say, with metadata larger than 10MB should be rejected until there are better technical ways to deal with them.

TheDJ added a subscriber: TheDJ.Apr 25 2018, 9:58 AM

33MB ??? I wonder if djvu stores thumbnails in the metadata or something..

TheDJ renamed this task from Some DjVu files are shown as "0 × 0 pixels" to Some DjVu files have too much metadata to fit in their database column.Apr 25 2018, 10:00 AM
jcrespo added subscribers: tstarling, Anomie.

@tstarling @Anomie I am involving platform here after askimg my manager because aside from the specific bug, I think we have a general issue with the image table (file database metadata T192866#4156291 ) and would want your experience. It is ok to refer me to other team if someone else is better suited to come up with a solution (setup size limit + trimming existing db records?). We can setup a separate ticket if necessary.

Yann added a comment.Apr 25 2018, 10:17 AM

Uploading an empty file, and reverting fixed the problem on Commons.

TheDJ added a comment.EditedApr 25 2018, 11:13 AM

Regarding the 33Mb. that is just some of those files. Many of them simply are corrupt. The ones that do hit that limit however, that is mostly because we store the entire text of those documents (for later search engine indexing) and put it in metadata as well. That's not really what that field is for of course..

Example output for https://commons.wikimedia.org/wiki/File:Nouveau_Larousse_illustré,_1898,_VI.djvu
The primary metadata itself is already 569617 characters (djvudump /Users/djhartman/Downloads/Nouveau_Larousse_illustré\,_1898\,_VI.djvu)

And then add to that the text that we parse out of it, another 18443512 characters...
djvutxt --detail=page /Users/djhartman/Downloads/Nouveau_Larousse_illustré\,_1898\,_VI.djvu

To that we add some XML wrapping, but you can imagine that even with compression, that many characters will simply trigger limits.

Logic: https://github.com/wikimedia/mediawiki/blob/master/includes/media/DjVuImage.php#L246

I think large metadata could be kept, as additional content (multicontent revision, anyone?), while keeping the image table with minimal metadata (just as an index). But I am no developer to suggest the best way to proceed.

I think large metadata could be kept, as additional content (multicontent revision, anyone?), while keeping the image table with minimal metadata (just as an index). But I am no developer to suggest the best way to proceed.

To quote MCR:

Derived (possibly virtual) content: OCR text from DeJaVu and PDF files (as derived content). Works best if the upload history is managed using MCR. There are currently no plans to implement this.

Alternative ideas

  • decouple metadata into a separate blobstore
  • don't store all the text content, and instead request it upon demand (but then that api needs to be really specifically marked to be only used for incidental use), and it might be slow...

Images in general have several issues. At the database level we already have T28741: Migrate file tables to a modern layout (image/oldimage; file/file_revision; add primary keys) and T67264: File storage uses (title,timestamp) as a unique key, but this is not unique, then the action API has T89971: ApiQueryImageInfo is crufty, needs rewrite and its many related tasks, and now this too.

For a short-term fix on the MediaWiki code side the Multimedia team probably knows more about the requirements in this area.

Long term, we should move this metadata storage elsewhere if we want to be able to store it. Is there a database-side limit on the size of data that can be put into ExternalStore, i.e. can it handle 33MB of data in one row? Otherwise, our options would seem to be either reparsing the metadata from the file on demand or storing it as a "file" in the file storage backend alongside the file itself.

@Anomie I think, for short term, either making uploads fail or truncate metadata to, let's say, 10MB and document such a limitation should be enough. While answering to you, I probably really mean to talk to Multimedia, if you say they should be the ones on the know.

I am not so worried right now about longer term solutions.

Referring to @MarkTraceur and @matthiasmullie for any insight on this issue.

This is the lead cause of commons mediawiki database errors. I think errors for the existing record should be easy to fix by deleting the large row(s):

{
  "_index": "logstash-2018.04.26",
  "_type": "mediawiki",
  "_id": "AWMBIm8PpesmgM3luPXO",
  "_version": 1,
  "_score": null,
  "_source": {
    "server": "commons.wikimedia.org",
    "db_server": "10.64.48.150",
    "wiki": "commonswiki",
    "channel": "DBQuery",
    "type": "mediawiki",
    "error": "Lost connection to MySQL server during query (10.64.48.150)",
    "http_method": "POST",
    "@version": 1,
    "host": "mw1226",
    "shard": "s4",
    "sql1line": "SELECT  img_name,img_size,img_width,img_height,img_metadata,img_bits,img_media_type,img_major_mime,img_minor_mime,img_timestamp,img_sha1,COALESCE( comment_img_description.comment_text, img_description ) AS `img_description_text`,comment_img_description.comment_data AS `img_description_data`,comment_img_description.comment_id AS `img_description_cid`,img_user,img_user_text,NULL AS `img_actor`,img_metadata  FROM `image` LEFT JOIN `image_comment_temp` `temp_img_description` ON ((temp_img_description.imgcomment_name = img_name)) LEFT JOIN `comment` `comment_img_description` ON ((comment_img_description.comment_id = temp_img_description.imgcomment_description_id))   WHERE img_name IN ('Blue_pencil.svg','Wikidata-logo.svg','Wikisource-logo.svg','PD-icon.svg','Cyclopaedia,_Chambers_-_Volume_1.djvu','Cyclopaedia,_Chambers_-_Volume_2.djvu','Cyclopaedia,_Chambers_-_Supplement,_Volume_1.djvu','Cyclopaedia,_Chambers_-_Supplement,_Volume_2.djvu')   ",
    "fname": "LocalRepo::findFiles",
    "errno": 2013,
    "unique_id": "WuGRdgpAMD0AADmKcTIAAABC",
    "method": "Wikimedia\\Rdbms\\Database::makeQueryException",
    "level": "ERROR",
    "ip": "10.64.48.61",
    "mwversion": "1.32.0-wmf.1",
    "message": "LocalRepo::findFiles\t10.64.48.150\t2013\tLost connection to MySQL server during query (10.64.48.150)\tSELECT  img_name,img_size,img_width,img_height,img_metadata,img_bits,img_media_type,img_major_mime,img_minor_mime,img_timestamp,img_sha1,COALESCE( comment_img_description.comment_text, img_description ) AS `img_description_text`,comment_img_description.comment_data AS `img_description_data`,comment_img_description.comment_id AS `img_description_cid`,img_user,img_user_text,NULL AS `img_actor`,img_metadata  FROM `image` LEFT JOIN `image_comment_temp` `temp_img_description` ON ((temp_img_description.imgcomment_name = img_name)) LEFT JOIN `comment` `comment_img_description` ON ((comment_img_description.comment_id = temp_img_description.imgcomment_description_id))   WHERE img_name IN ('Blue_pencil.svg','Wikidata-logo.svg','Wikisource-logo.svg','PD-icon.svg','Cyclopaedia,_Chambers_-_Volume_1.djvu','Cyclopaedia,_Chambers_-_Volume_2.djvu','Cyclopaedia,_Chambers_-_Supplement,_Volume_1.djvu','Cyclopaedia,_Chambers_-_Supplement,_Volume_2.djvu')   ",
    "normalized_message": "{fname}\t{db_server}\t{errno}\t{error}\t{sql1line}",
    "url": "/w/api.php",
    "tags": [
      "syslog",
      "es",
      "es"
    ],
    "reqId": "WuGRdgpAMD0AADmKcTIAAABC",
    "referrer": null,
    "@timestamp": "2018-04-26T08:47:03.000Z",
    "db_name": "commonswiki",
    "db_user": "***"
  },
  "fields": {
    "@timestamp": [
      1524732423000
    ]
  },
  "sort": [
    1524732423000
  ]
}

This is the lead cause of commons mediawiki database errors. I think errors for the existing record should be easy to fix by deleting the large row(s):

Don't delete the whole rows, that will lose the image entirely. Setting img_metadata or oi_metadata for those rows to the empty string should work.

Sure, sorry, that is what I meant. s/rows/metadata fields on those rows/.

I am seeing lots of read timeout errors for api due to this /w/api.php?format=json&action=query&list=allpages&apmaxsize=0&aplimit=15&apnamespace=6 Unless someone has a better option, I will empty the mentioned field on Monday to prevent further errors.

Anomie added a comment.EditedApr 27 2018, 8:43 PM

I am seeing lots of read timeout errors for api due to this /w/api.php?format=json&action=query&list=allpages&apmaxsize=0&aplimit=15&apnamespace=6 Unless someone has a better option, I will empty the mentioned field on Monday to prevent further errors.

That query shouldn't be touching the image table at all. I'd think that would be timing out due to more common issues of scanning through all rows in page with page_namespace = 6 (using the name_title index) and filtering for only those with page_len <= 0. Of which there probably are extremely few if any (enwiki doesn't seem to have any at all, for example).

Ankry added a subscriber: Ankry.May 12 2018, 9:31 PM
Vvjjkkii renamed this task from Some DjVu files have too much metadata to fit in their database column to xdeaaaaaaa.Jul 1 2018, 1:14 AM
Vvjjkkii triaged this task as High priority.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
Wong128hk renamed this task from xdeaaaaaaa to Some DjVu files have too much metadata to fit in their database column.Jul 1 2018, 6:09 AM
Wong128hk raised the priority of this task from High to Needs Triage.
Wong128hk updated the task description. (Show Details)
Wong128hk added a subscriber: Aklapper.
Krinkle added a subscriber: Krinkle.

The listed urls still show 0px x 0px and the log shows entries similar to the one mentioned here, but not sure if it is the same.

Expectation (readQueryTime <= 5) by MediaWiki::main not met (actual: 6.7309379577637):
query: SELECT img_name,img_timestamp,img_width,img_height,img_metadata,img_bits,img_media_type,img_major_mime,img_minor_mime,img_timestamp,img_sha1,COALESCE( comment_img_description.comment_text, img_description ) AS `img_description_text`,comment_img_descriptio [TRX#24550b]
#0 /srv/mediawiki/php-1.32.0-wmf.20/includes/libs/rdbms/TransactionProfiler.php(228): Wikimedia\Rdbms\TransactionProfiler->reportExpectationViolated()
#1 /srv/mediawiki/php-1.32.0-wmf.20/includes/libs/rdbms/database/Database.php(1258): Wikimedia\Rdbms\TransactionProfiler->recordQueryCompletion()
#2 /srv/mediawiki/php-1.32.0-wmf.20/includes/libs/rdbms/database/Database.php(1155): Wikimedia\Rdbms\Database->doProfiledQuery()
#3 /srv/mediawiki/php-1.32.0-wmf.20/includes/libs/rdbms/database/Database.php(1655): Wikimedia\Rdbms\Database->query()
#4 /srv/mediawiki/php-1.32.0-wmf.20/includes/pager/IndexPager.php(368): Wikimedia\Rdbms\Database->select()
#5 /srv/mediawiki/php-1.32.0-wmf.20/includes/pager/IndexPager.php(225): IndexPager->reallyDoQuery()
#6 /srv/mediawiki/php-1.32.0-wmf.20/includes/pager/IndexPager.php(422): IndexPager->doQuery()
#7 /srv/mediawiki/php-1.32.0-wmf.20/includes/specials/SpecialNewimages.php(108): IndexPager->getBody()
#8 /srv/mediawiki/php-1.32.0-wmf.20/includes/specialpage/SpecialPage.php(569): SpecialNewFiles->execute()
#9 /srv/mediawiki/php-1.32.0-wmf.20/includes/specialpage/SpecialPageFactory.php(581): SpecialPage->run()
#10 /srv/mediawiki/php-1.32.0-wmf.20/includes/MediaWiki.php(288): MediaWiki\Special\SpecialPageFactory->executePath()
#11 /srv/mediawiki/php-1.32.0-wmf.20/includes/MediaWiki.php(868): MediaWiki->performRequest()
#12 /srv/mediawiki/php-1.32.0-wmf.20/includes/MediaWiki.php(525): MediaWiki->main()
#13 /srv/mediawiki/php-1.32.0-wmf.20/index.php(42): MediaWiki->run()

and

A connection error occurred. 
Query: SELECT  img_name,img_size,img_width,img_height,img_metadata,img_bits,img_media_type,img_major_mime,img_minor_mime,img_timestamp,img_sha1,COALESCE( comment_img_description.comment_text, img_description ) AS `img_description_text`,comment_img_description.comment_data AS `img_description_data`,comment_img_description.comment_id AS `img_description_cid`,img_user,img_user_text,NULL AS `img_actor`,img_metadata  FROM `image` LEFT JOIN `image_comment_temp` `temp_img_description` ON ((temp_img_description.imgcomment_name = img_name)) LEFT JOIN `comment` `comment_img_description` ON ((comment_img_description.comment_id = temp_img_description.imgcomment_description_id))   WHERE img_name IN ('###','Cyclopaedia,_Chambers_-_Volume_1.djvu','Cyclopaedia,_Chambers_-_Volume_2.djvu','Cyclopaedia,_Chambers_-_Supplement,_Volume_1.djvu','Cyclopaedia,_Chambers_-_Supplement,_Volume_2.djvu')   
Function: LocalRepo::findFiles
Error: 2013 Lost connection to MySQL server during query (10.192.32.167)

#2 /srv/mediawiki/php-1.32.0-wmf.20/includes/libs/rdbms/database/Database.php(1655): Wikimedia\Rdbms\Database->query(string, string)
#3 /srv/mediawiki/php-1.32.0-wmf.20/includes/filerepo/LocalRepo.php(316): Wikimedia\Rdbms\Database->select(array, array, array, string, array, array)
#4 /srv/mediawiki/php-1.32.0-wmf.20/includes/filerepo/RepoGroup.php(209): LocalRepo->findFiles(array, integer)
#5 /srv/mediawiki/php-1.32.0-wmf.20/extensions/ParsoidBatchAPI/includes/ApiParsoidBatch.php(106): RepoGroup->findFiles(array)
#6 /srv/mediawiki/php-1.32.0-wmf.20/includes/api/ApiMain.php(1587): ApiParsoidBatch->execute()
#7 /srv/mediawiki/php-1.32.0-wmf.20/includes/api/ApiMain.php(531): ApiMain->executeAction()
#8 /srv/mediawiki/php-1.32.0-wmf.20/includes/api/ApiMain.php(502): ApiMain->executeActionWithErrorHandling()
#9 /srv/mediawiki/php-1.32.0-wmf.20/api.php(87): ApiMain->execute()