Page MenuHomePhabricator

Tables not being indexed
Closed, ResolvedPublic

Description

Author: jlemley

Description:
CirrusSearch does not appear to be indexing tables, not even into "auxiliary_text". In fact, the "auxiliary_text" field in my index appears to be empty for all entries, and pages that have content consisting of just tables is not indexed at all - even the title is not there. 977e3f9 branch with Elasticsearch 1.3.2.


Version: master
Severity: normal

Details

Reference
bz71233

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:52 AM
bzimport added a project: CirrusSearch.
bzimport set Reference to bz71233.
bzimport created this task.Sep 24 2014, 1:36 PM

Its certainly _supposed_ to index them in the auxiliary text. What is in the index for pages with just tables? Can you run ?action=cirrusdump on one? Can you null edit a page and see if anything interesting is logged when the index is updated?

jlemley wrote:

Performing action=cirrusdump shows "auxiliary_text":[]

The actual table is indexed in the "source_text" field.

Also, it appears that the page *did* get indexed, I just couldn't find it in Sense by title search.

Nothing gets logged by Elasticsearch when the index runs (logging set to "DEBUG"). I also did a regular edit, then checked ?action=cirrusdump and confirmed that the new text is there (not in a table, of course).

I see that it's working on Wikipedia, so could it just be that my index is bad?

If the json comes back with "auxiliary_text":[] that means Cirrus is sending the table empty. Do you have any auxiliary text in the index at all? Maybe it needs tidy or something. What version of Mediawiki and PHP are you using?

jlemley wrote:

As far as I can tell auxiliary_text is completely empty in the index.

I'm on MW 1.23.3, PHP 5.3.10.

Ok. It won't work properly right now with MW version 1.24wmf10. That version has a change where the HtmlFormatter can return the text that it filtered out. This is how auxiliary text works for us. Let me see if I can work around that.

Change 162653 had a related patch set uploaded by Manybubbles:
Don't remove auxiliary text if mw is too old

https://gerrit.wikimedia.org/r/162653

I've uploaded a patch to Cirrus that should leave the table text in the "text" field if MediaWiki doesn't yet support what we need to build the auxiliary text properly.

jlemley wrote:

Thanks for the quick turnaround! I confirmed that editing a page with tables now causes the the table contents to be indexed into the "text" field. I will rebuild the index to take care of the rest.

Thanks again!

Change 162653 merged by jenkins-bot:
Don't remove auxiliary text if mw is too old

https://gerrit.wikimedia.org/r/162653