Page MenuHomePhabricator

Full text search index is corrupt (index rebuild ignores content model)
Closed, ResolvedPublic

Description

When searching some items, where there are non-English letters, like "Kimi Räikkönen", it does not give any results. However it recognizes letter "ü". When When searching just "Kimi", then Kimi Räikkönen can be found, but the string uses the following code "Kimi R\u00e4ikk\u00f6nen".

A possible reason could be that the unserialization fails for the indexed document.


Version: unspecified
Severity: major
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=45860

Details

Reference
bz42234

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 1:14 AM
bzimport set Reference to bz42234.

This is caused (or rather, fixed) by bug 41532, which is closed because the fix is in master now. Closing.

If the problem still be there in about two weeks (when the new version should have gone live), please re-open.

I'm not sure is this problem related to this, but if I search "täsmennyssivu" (which is disambiguation page in Finnish language) I don't get any results. Then search engine asks, did you mean: u00e4smennyssivu. And "u00e4smennyssivu" gives results for täsmennyssivu.

And if I search "Räikkönen" search engine (http://www.wikidata.org/w/index.php?search=r%C3%A4ikk%C3%B6nen&title=Special%3ASearch) gives only one result; Ville Räikkönen. It does not find e.g. Kimi Räikkönen.

And Kimi Räikkönen can be found if I search Kimi R\u00e4ikk\u00f6nen, but not if I search Kimi Räikkönen.

JSON is now being indexed again. Are we using an old version of the OAI extension?

The OAI extension is providing a flat text version for indexing: https://www.wikidata.org/w/index.php/Special:OAIRepository?verb=ListRecords&metadataPrefix=lsearch&from=2013-01-10T20:30:00Z

Was LuceneSearch changed to no longer use this?

The small municipality Höör (Q765434)‎ in Sweden got updated and it is not possible to find it in the search (http://www.wikidata.org/w/index.php?search=H%C3%B6%C3%B6r&title=Special%3ASearch) that could mean the indexes are not updated, but note that it could also mean the search is broken also.

A new item is "The Man Who Shook the Hand of Vicente Fernandez" (http://www.wikidata.org/w/index.php?title=Special%3ASearch&profile=default&search=The+Man+Who+Shook+the+Hand+of+Vicente+Fernandez&fulltext=Search) and that too can't be found.

The city of "Ålesund" (http://www.wikidata.org/w/index.php?search=%C3%85lesund&title=Special%3ASearch) is be found, that is an old item.

The city of "Göteborg" (http://www.wikidata.org/w/index.php?search=g%C3%B6teborg&title=Special%3ASearch) can also be found, and this too is an old item.

Seems to me that the index is broken.

Tim, could you take a look at this?

(In reply to comment #5)

The small municipality Höör (Q765434)‎ in Sweden got updated and it is not
possible to find it in the search
(http://www.wikidata.org/w/index.
php?search=H%C3%B6%C3%B6r&title=Special%3ASearch)
that could mean the indexes are not updated, but note that it could also mean
the search is broken also.

This still doesn't work.

A new item is "The Man Who Shook the Hand of Vicente Fernandez"
(http://www.wikidata.org/w/index.
php?title=Special%3ASearch&profile=default&search=The+Man+Who+Shook+the+Hand+
of+Vicente+Fernandez&fulltext=Search)
and that too can't be found.

This now works.

Another example reported on it.wiki is "Iván Moro". You have to use the search gadget to find it.

Tim: Did you have a chance to take a look at this?

The search index for wikidatawiki probably needs to be rebuilt.

Bash history and file modification timestamps on searchidx2 and searchidx1001 seem to indicate that the wikidatawiki index hasn't been rebuilt since November 14.

Thanks for investigating, Tim. Any chance you can fix this? Anything I can tell the community (who's rather unhappy about the search)?

ram wrote:

Looks like Tim fixed it -- timestamp on searchidx1001 for wikidatawiki is today:

cat ../status/wikidatawiki
#Last incremental update timestamp
#Fri Mar 01 03:42:21 UTC 2013
timestamp=2013-03-01T03\:41\:07Z

Many of the index files have a timestamp of yesterday or today.

danny.leinad wrote:

(In reply to comment #13)

Thanks for investigating, Tim. Any chance you can fix this? Anything I can
tell
the community (who's rather unhappy about the search)?

Hi,
I would like you to suggest to postpone deploy Wikidata on projects like plwiki until fix this bug - this is really important issue and in my opinion it will cause negative impressions of new tool. On plwiki we still have a problem to convince community about advantages of Wikidata and such bugs won't help us.

Names like "Łódź" are impossible to search: http://www.wikidata.org/w/index.php?search=%C5%81%C3%B3d%C5%BA&title=Special%3ASearch

(In reply to comment #15)

Names like "Łódź" are impossible to search:
http://www.wikidata.org/w/index.
php?search=%C5%81%C3%B3d%C5%BA&title=Special%3ASearch

On it.wiki users were just told not to use Special:Search at all, because it's completely useless, and to rely on the search gadget (enabled by default on Vector) which is activated by clicking the arrow next to the search bar. You should probably do the same and forget the standard search: this helped a lot on it.wiki.

Make that last comment RT #4625

<notpeter> I have rebuilt the index from a fresh dump of wikidatawiki. this should hopefully fix the problem. if the problem persists, please re-open this ticket.

(In reply to comment #15)

Names like "Łódź" are impossible to search:
http://www.wikidata.org/w/index.
php?search=%C5%81%C3%B3d%C5%BA&title=Special%3ASearch

Still getting no result as of now. The other examples here seem to work.

Confirming that Łódź is still a problem for wikidata.org.
Reopening as per comment 19, though not sure if this is the same problem.

(In reply to comment #19)

<notpeter> I have rebuilt the index from a fresh dump of wikidatawiki. this
should hopefully fix the problem. if the problem persists, please re-open
this ticket.

Oh... how does rebuilding the index from a dump work? Which code does it use? Can it handle non-wikitext content at all? If not, it will index the JSON...

For the live updates, I have implemented the required support in the OAI extension, so OAI's lsearch output is not JSON but (generated) plain text. The same needs to be done when re-indexing based on dumps, I suppose. So far, I assumed that the rebuild would be using the same interface to access the data. If that is not the case, rebuilding the index might actually cause *more* breakage.

ram wrote:

Not sure exactly how notpeter did it but one way is to use the import-file()
function in puppet/files/lucene/lucene.jobs.sh. There is also an import-db()
function that dumps the DB to a file and runs the former function on that file.

It uses the Java class org.wikimedia.lsearch.importer.BuildAll. I don't yet know
this part of the code well enough to answer the other questions.

(In reply to comment #23)

Not sure exactly how notpeter did it but one way is to use the import-file()
function in puppet/files/lucene/lucene.jobs.sh. There is also an import-db()
function that dumps the DB to a file and runs the former function on that
file.

It uses the Java class org.wikimedia.lsearch.importer.BuildAll. I don't yet
know
this part of the code well enough to answer the other questions.

We don't have any handling of non-wikitext content in Java, and I don't see how it could be added... we'd either have to create specialized dumps, or implement the entire content handler infrastructure in Java (including java versions of content handlers supplied by extensions), or not use dumps and always call the API.

None of the options sounds good :\

  • Bug 45860 has been marked as a duplicate of this bug. ***

A brief discussion on wikitech-l suggests using a special XML dump for this purpose, see http://www.gossamer-threads.com/lists/wiki/wikitech/340638

I filed that as bug 45983.

RT comment is "Nothing else for ops to do right now."
Tentatively assigning to Ram, though this needs more time (see comment 23).

Currently the index rewrite mechanism is being reworked. Until then, there's not much we will do here.

Also not that the search of Entities by their label actually do work for the given examples. It is merely the full text search that does not return appropriate results.

Is this still a problem since we're using Cirrus on wikidatawiki?

The examples I could find all work. I'm closing it. If there are still issues please reopen.