Page MenuHomePhabricator

Inconsistent search results with language links
Closed, DeclinedPublic

Description

When searching for "fiets" (dutch for bicycle) in English Wikipedia, [[Bicycle]] is second in the list of results. But the displayed text is not always the same. Most times it shows the beginning of the article (which does not include that word or any similar words):

Bicycle
Bicycle (disambiguation) File:Marin bike. jpg | A mountain bike , a popular multi-use bicycle. A bicycle, also known as a bike, pushbike or ...
55 KB (7,627 words) - 02:20, 26 August 2011

But sometimes it displays the language links which has the words "fiets":

Bicycle
[[af:Fiets]] [[nl:Fiets]]
55 KB (8,105 words) - 02:20, 26 August 2011

The sample text and the word count are different.

Searching for translations of bicycle in other languages ("cykel", "fietse", "sykkel") does not usually find [[Bicycle]]. Unless that word is also used in other places than the language links, like German "Fahrrad" in the name of an image. So it looks like language links should not not usually be included in the search.


Version: unspecified
Severity: normal
URL: http://en.wikipedia.org/w/index.php?title=Special%3ASearch&search=fiets

Details

Reference
bz30595
TitleReferenceAuthorSource BranchDest Branch
Improve `toolforge-tunnel` CLI with error handlingrepos/mwbot-rs/toolforge!8milkydefercli-v2main
Use `clap` for argument parsingrepos/mwbot-rs/toolforge!6milkydeferclapmain
Store pools as member variables instead of a RwLock<HashMap>repos/mwbot-rs/toolforge!4legoktmunroll-poolsmain
pool: Add a builder so connection options are immutablerepos/mwbot-rs/toolforge!3legoktmpool-buildermain
Customize query in GitLab

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 11:55 PM
bzimport set Reference to bz30595.
bzimport added a subscriber: Unknown Object (MLST).

ForoaW wrote:

Interlanguage links seems to be found in article space, not in category space. On Commons, the interlanguage links in categories are essential.
On en:wiki searching fiets returns article:bicyle but not category:Bicycles.

Foroa

I suspect should rather be moved to Wikimedia>Lucene component?
Also of relevance, what will happen when inter(language)wikilinks are moved to Wikidata (in a few days)?

(In reply to comment #2)
I assumed the search indexing used the wikicode of the article. So when language links were moved from wikicode to Wikidata they would not be found by searching. But a search for "fiets" still finds [[Bicycle]]. This is still inconsistent with not finding [[Bicycle]] when searching for other language links ("cykel", "fietse", "sykkel").

Is it possible that the search database (lucene?) contains incorrect data that somehow connects "fiets" with article [[Bicycle]].

(In reply to comment #1)
That is not what I have seen. Searching for interlanguage links usually finds neither articles nor pages in other namespaces (unless the word is also used in some other way on that page). Dutch "fiets" finding article [[Bicycle]] seems to be an exception.

(In reply to comment #3)

(In reply to comment #2)
I assumed the search indexing used the wikicode of the article. So when
language links were moved from wikicode to Wikidata they would not be found
by
searching.

So this bug is no longer a problem, strictly speaking: results are no longer inconsistent because you can never get the interwiki as search snippet.

But a search for "fiets" still finds [[Bicycle]]. This is still
inconsistent with not finding [[Bicycle]] when searching for other language
links ("cykel", "fietse", "sykkel").

Is it possible that the search database (lucene?) contains incorrect data
that
somehow connects "fiets" with article [[Bicycle]].

From this particular article they were removed 3 days ago, so the index should be up to date; however, it's possible that "fiets" is the label of some link to the article: there's no reason to believe it's a mistake, on the contrary it's consistent with your previous observations about other languages.
If you find actual errors in search results, please file a separate bug.

(In reply to comment #5)

however, it's possible that "fiets" is the label of some link
to the article: there's no reason to believe it's a mistake, on the contrary
it's consistent with your previous observations about other languages.

The langlinks were the only instance of "fiets" in the the raw text of the article. They were removed at 06:39, 20 February 2013‎. It is just over 3 days since so the index may be slower to catch up.

ForoaW wrote:

My main complaints where that the interlanguagelinks in the categories are not found by the search engine. It still finds the associated article bicycle (where fiets only appeared in IL), God knows why, but not the associated category. I've seen that documented/discussed somewhere, but I can't find it back.

I've seen that all interlanguage links have been removed. As I've documented on Commons many thousands of categories with such links, I would like to know how they are setup now and what the search engine does with it.