Add uploaded file text and metadata from files to fulltext search set


We're starting to integrate text extraction for djvu and pdf files -- currently used for ProofreadPage extension -- but it's not currently exposed to the search indexing.

This is also something frequently desired for text document types like .doc and .odf, and there are some extensions for doing that but there's not a clean interface to plug it in to that can be supported for all search backends.

Note that supporting the Lucene search which updates separately might require some additional attention.

Related bugs:

  • bug 6421 - search djvu file text
  • bug 6422 - search pdf file text
  • bug 13370 - search file metadata

Also interesting idea:

  • bug 18045 - search text of linked files (but if these are remote, that's much harder to handle!)

Things we need:

  • clear interface on File for things that need to be fetched (exif metadata, page text)
  • clear interface on the SearchEngine class for plugging additional info in to updates
  • a way to expose additional searchable info to the Lucene search's updaters (plugin to oai interface maybe to toss in extra data fields?)

Version: unspecified
Severity: enhancement

bzimport added a project: MediaWiki-Search.Via ConduitNov 21 2014, 10:51 PM
bzimport added a subscriber: wikibugs-l.
bzimport set Reference to bz21061.
brion created this task.Via LegacyOct 8 2009, 5:52 PM
bzimport added a comment.Via ConduitDec 15 2009, 1:04 AM

test5555 wrote:


*Bug 21795 "camera categories" (proposal c would allow searching metadata through categories they generate)

bzimport added a comment.Via ConduitDec 29 2013, 10:00 PM

dr.trigon wrote:

bug 6421 could finally be closed - thanks to everybody involved there!

Aklapper added a project: Wikisource.Via WebTue, Mar 10, 4:16 PM

Add Comment

Column Prototype
This is a very early prototype of a persistent column. It is not expected to work yet, and leaving it open will activate other new features which will break things. Press "\" (backslash) on your keyboard to close it now.