Store DjVu extracted text in a structured table instead of img_metadata


When DjVu files contain text layers, we currently extract these and store them into the file's metadata blob, so it's available to extensions like ProofreadPage which can use it.

Unfortunately this *massively* increases the size of the file object -- which contains the uncompressed serialized metadata blob in memory -- leading to errors like T32751, running out of memory when loading a bunch of file objects at once in an API request.

In addition it's a bit awkward to access the text from other places; things like search indexing (T8421) would benefit from having a more standardish place to get at extracted text, and this could also be used for other file formats.

Version: 1.20.x
Severity: normal

bzimport added a subscriber: Unknown Object (MLST).
bzimport set Reference to bz30906.
brion created this task.Via LegacySep 14 2011, 10:34 PM
brion added a comment.Via ConduitSep 14 2011, 10:36 PM

Changing deps from bug 6421 (DjVu-only) to bug 21062 (also notes PDF etc), so we cover wider space.

Bawolff added a comment.Via ConduitSep 15 2011, 5:08 AM

Perhaps (as an interim solution) we shouldn't be loading file metadata unless a method is called that specifically needs it. I imagine most of the time you don't need the metadata (otoh, maybe you need it more now a days that we check if jpg's need to be rotated)

Gilles added a project: Multimedia.Via WebNov 24 2014, 3:38 PM
GOIII added a subscriber: GOIII.Via WebFeb 22 2015, 5:47 PM
Aklapper added a project: Wikisource.Via WebMar 10 2015, 4:15 PM
aaron added a subscriber: aaron.Via WebMay 15 2015, 5:42 PM

Related to T96360

Nemo_bis added a subscriber: Nemo_bis.Via WebMay 17 2015, 10:02 PM
AuFCL added a subscriber: AuFCL.Via WebMay 24 2015, 3:11 AM
Tgr edited the task description. (Show Details)Via WebJul 17 2015, 9:18 AM
Tgr set Security to None.
Restricted Application added subscribers: Matanya, Aklapper. · View Herald TranscriptVia HeraldJul 17 2015, 9:18 AM
zhuyifei1999 added a subscriber: zhuyifei1999.Via WebJul 17 2015, 10:21 AM
Yann added a subscriber: Yann.Via WebJul 17 2015, 8:46 PM
Jdforrester-WMF moved this task to Backlog on the Multimedia workboard.Via WebSep 4 2015, 6:34 PM
Restricted Application added a subscriber: Steinsplitter. · View Herald TranscriptVia HeraldSep 4 2015, 6:34 PM

Add Comment