Page MenuHomePhabricator

CirrusSearch / ElasticSearch doesn't index PDFs
Closed, InvalidPublic

Description

Not sure if this is a bug or misconfiguration, but we have a small wiki running CirrusSearch 0.2 (c23ae6a) with ElasticSearch 2.4.4, and we can't get it to index PDFs.

We have tried removing the CirrusSearch/ElasticSearch database and reindexing but haven't had any luck.

Is this a bug, or a misconfiguration on our side?

Any help would be greatly appreciated.

Event Timeline

Restricted Application added projects: Discovery, Discovery-Search. · View Herald TranscriptApr 25 2017, 10:29 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Have you installed the PdfHandler extension?
CirrusSearch does not directly support PDF.

We haven't, I will try that now. Are similar modules required for other formats (e.g. .doc, .docx)? Is any other configuration required?

Thanks for your help.

dcausse added a comment.EditedApr 25 2017, 10:40 AM

This has been requested multiple times but as far as I know there are no extensions that support Microsoft doc formats in such a way that their content end up being indexed by CirrusSearch (exactly what PdfHandler does for PDFs).

OK, thank you so much for the information and the very fast responses!

debt closed this task as Declined.Apr 27 2017, 5:04 PM
debt added a subscriber: debt.

This looks to be a good workaround for mediawiki that has been discussed in this thread. Closing this ticket as declined.

demon changed the task status from Declined to Invalid.Apr 27 2017, 6:04 PM
demon added a subscriber: demon.

This looks to be a good workaround for mediawiki that has been discussed in this thread. Closing this ticket as declined.

It's not really a workaround, it's an actual dependency. MediaWiki core doesn't support PDF rendering -- so you need PdfHandler.

Cirrus can't index document types MediaWiki doesn't support :)

(Swapping declined for invalid, seems a little more correct)