Extract embedded text from PDF documents for search
Closed, ResolvedPublic
Actions

Assigned To

None

Authored By

	• brion
	Jun 24 2006, 8:50 AM

Description

PDF files may contain a machine-readable form of the text contained
in the represented document. It could be useful to extract this
text and include it in the search index for the file's description
page.

I'm pretty sure there are open-source tools for extracting text
data from PDFs out and about, but haven't looked into it.

Version: 1.7.x
Severity: enhancement

Details

Reference: bz6422

Related Objects
Search...

Status	Assigned	Task
Invalid	None	T43037 [DO NOT USE] PDF related bugs and enhancements (tracking)
Resolved	None	T8421 Extract embedded text from DjVu and PDF documents for search
Resolved	None	T8422 Extract embedded text from PDF documents for search
Resolved	• Deskana	T23061 Add uploaded file text and metadata from files to fulltext search set
Resolved	None	T23062 Interface to add more data/text fields for Lucene search engine (eg uploaded file text and metadata)
Resolved	Ladsgroup	T32906 Store DjVu, PDF extracted text in a structured table instead of img_metadata

Event Timeline

• bzimport raised the priority of this task from to Low.Nov 21 2014, 9:16 PM

• bzimport added a project: MediaWiki-Search.

• bzimport set Reference to bz6422.

• bzimport added a subscriber: Unknown Object (MLST).

• brion created this task.Jun 24 2006, 8:50 AM

PdfHandler extension does text extraction using 'pdftotext' utility if $wgPdftoText is on.

Currently this is stored into the metadata blob and isn't available for search, but may be used by Extension:ProofreadPage.

dchandler wrote:

@Brian: Thanks so much for posting this. I have desperately been trying to add the capability of searching within pdfs. I'm definitely a non-expert though and can generally only install extensions or make modifications that are well-documented.

Have you already implemented this on a wiki or know anyone who has? I've seen it suggested that FileIndexer (http://www.mediawiki.org/wiki/Extension_talk:FileIndexer) may be another approach. Do you have any advice for which approach is easier to implement for a non-expert? Do you think that the Extension:Proofreadpage method might be easier or more stable than using the other extension?

Do you know of any step-by-step guides to doing this with pdftotext and Proofread page?

Thanks so much in advance for any suggestion or guidance you have.

dr.trigon wrote:

As mentioned in bug 6421 (comment #3) - DrTrigonBot could do text extraction and store it into a dedicated wiki page in order to be accessible by search. But since PdfHandler does text extraction as well this should not be needed.

As I see we have everything needed:
1.) text extraction (PdfHandler or DrTrigonBot)
2.) indexing for search (see bug 6421)
...so as I understand we should be able to finish this and close the ticket/bug, or am I wrong? Could somebody comment on this?

Thanks and Greetings

I don't think there's anything left here to do, we index PDF/DJVU data in the new search.

• Deskana closed subtask T23061: Add uploaded file text and metadata from files to fulltext search set as Resolved.May 4 2017, 5:19 PM

Restricted Application added projects: Discovery-ARCHIVED, Discovery-Search. · View Herald TranscriptMay 4 2017, 5:19 PM

Extract embedded text from PDF documents for searchClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Extract embedded text from PDF documents for search
Closed, ResolvedPublic
Actions

Related Objects
Search...