Page MenuHomePhabricator

Store DjVu, PDF extracted text in a structured table instead of img_metadata
Open, NormalPublic

Description

When DjVu files contain text layers, we currently extract these and store them into the file's metadata blob, so it's available to extensions like ProofreadPage which can use it.

Unfortunately this *massively* increases the size of the file object -- which contains the uncompressed serialized metadata blob in memory -- leading to errors like T32751, running out of memory when loading a bunch of file objects at once in an API request.

In addition it's a bit awkward to access the text from other places; things like search indexing (T8421) would benefit from having a more standardish place to get at extracted text, and this could also be used for other file formats.


Version: 1.20.x
Severity: normal

Details

Reference
bz30906

Event Timeline

bzimport raised the priority of this task from to Normal.Nov 21 2014, 11:57 PM
bzimport set Reference to bz30906.
bzimport added a subscriber: Unknown Object (MLST).
brion created this task.Sep 14 2011, 10:34 PM

Changing deps from bug 6421 (DjVu-only) to bug 21062 (also notes PDF etc), so we cover wider space.

Perhaps (as an interim solution) we shouldn't be loading file metadata unless a method is called that specifically needs it. I imagine most of the time you don't need the metadata (otoh, maybe you need it more now a days that we check if jpg's need to be rotated)

GOIII added a subscriber: GOIII.Feb 22 2015, 5:47 PM
AuFCL added a subscriber: AuFCL.May 24 2015, 3:11 AM
Tgr updated the task description. (Show Details)Jul 17 2015, 9:18 AM
Tgr set Security to None.
Restricted Application added subscribers: Matanya, Aklapper. · View Herald TranscriptJul 17 2015, 9:18 AM
Yann added a subscriber: Yann.Jul 17 2015, 8:46 PM
Jdforrester-WMF moved this task from Untriaged to Backlog on the Multimedia board.Sep 4 2015, 6:34 PM
Restricted Application added a subscriber: Steinsplitter. · View Herald TranscriptSep 4 2015, 6:34 PM
brion renamed this task from Store DjVu extracted text in a structured table instead of img_metadata to Store DjVu, PDF extracted text in a structured table instead of img_metadata.Oct 7 2016, 10:30 PM
Restricted Application added a project: Commons. · View Herald TranscriptOct 7 2016, 10:30 PM
AuFCL removed a subscriber: AuFCL.Nov 21 2016, 7:36 PM
Restricted Application added a subscriber: Poyekhali. · View Herald TranscriptNov 21 2016, 7:36 PM
Xover added a subscriber: Xover.Wed, Oct 23, 6:10 AM
Xover added a comment.Wed, Oct 23, 6:19 AM

Hmm. While DjVu and PDF (and, I think, TIFF) has explicit text layers; any kind of image can in principle contain text and could benefit from a structured way to store the OCR as actual text. We have oodles of images-of-text in JPEG, PNG, etc. formats in addition to the "book" formats (DjVu, PDF). Even book scans are not-infrequently uploaded as a couple hundred JPEGs gathered in a category (if we're lucky).

This would make all those images content-searchable, which, combined with structured data for the description page, would be an extremely powerful tool!