Add uploaded file text and metadata from files to fulltext search set
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• brooke
	Oct 8 2009, 5:52 PM

Description

We're starting to integrate text extraction for djvu and pdf files -- currently used for ProofreadPage extension -- but it's not currently exposed to the search indexing.

This is also something frequently desired for text document types like .doc and .odf, and there are some extensions for doing that but there's not a clean interface to plug it in to that can be supported for all search backends.

Note that supporting the Lucene search which updates separately might require some additional attention.

Related bugs:

bug 6421 - search djvu file text
bug 6422 - search pdf file text
bug 13370 - search file metadata

Also interesting idea:

bug 18045 - search text of linked files (but if these are remote, that's much harder to handle!)

Things we need:

clear interface on File for things that need to be fetched (exif metadata, page text)
clear interface on the SearchEngine class for plugging additional info in to updates
a way to expose additional searchable info to the Lucene search's updaters (plugin to oai interface maybe to toss in extra data fields?)

Version: unspecified
Severity: enhancement

Details

Reference: bz21061

Related Objects
Search...

Status	Subtype	Assigned	Task
Resolved	Feature	AnneT	T10738 Improve media (image) search display
Duplicate		None	T15370 Search media (images, videos, sounds, etc) by relevant metadata
Invalid		None	T43037 [DO NOT USE] PDF related bugs and enhancements (tracking)
Resolved		None	T8421 Extract embedded text from DjVu and PDF documents for search
Resolved		None	T8422 Extract embedded text from PDF documents for search
Resolved		• Deskana	T23061 Add uploaded file text and metadata from files to fulltext search set
Resolved		None	T23062 Interface to add more data/text fields for Lucene search engine (eg uploaded file text and metadata)
Resolved		Ladsgroup	T32906 Store DjVu, PDF extracted text in a structured table instead of img_metadata

Event Timeline

• bzimport raised the priority of this task from to Low.Nov 21 2014, 10:51 PM

• bzimport added a project: MediaWiki-Search.

• bzimport set Reference to bz21061.

• bzimport added a subscriber: Unknown Object (MLST).

• brooke created this task.Oct 8 2009, 5:52 PM

test5555 wrote:

*Bug 21795 "camera categories" (proposal c would allow searching metadata through categories they generate)

dr.trigon wrote:

bug 6421 could finally be closed - thanks to everybody involved there!

Aklapper added a project: All-and-every-Wikisource.Mar 10 2015, 4:16 PM

Liuxinyu970226 removed a parent task: T37925: [DO NOT USE. Please use the Wikisource project] Wikisource related bugs and enhancements (tracking).Dec 23 2016, 12:10 PM

Liuxinyu970226 added a project: Community-Wishlist-Survey-2016.Jan 1 2017, 2:59 AM

Liuxinyu970226 removed a subscriber: • wikibugs-l-list.

Restricted Application added projects: Discovery-ARCHIVED, Discovery-Search. · View Herald TranscriptJan 1 2017, 2:59 AM

Liuxinyu970226 removed a project: Community-Wishlist-Survey-2016.Jan 2 2017, 12:59 AM

Updates to [search file metadata] could possibly be done with the Structured Data on Commons work, we already have some, but probably more would could be done. We might need more information first though, to really get a sense of what is wanted here. (It's an old ticket.)

This task talks about adding indexing of content of djvu and pdf files, which, according to @EBernhardson, is now done for those file types. Accordingly, this is resolved. For more complex metadata, waiting for Structured Data on Commons as @debt suggests is the best course of action.

Add uploaded file text and metadata from files to fulltext search setClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Add uploaded file text and metadata from files to fulltext search set
Closed, ResolvedPublic
Actions

Related Objects
Search...