Page MenuHomePhabricator

Search works incorrectly when the query contains words used as namespace names and a colon (:)
Closed, ResolvedPublic

Description

Occurs on Russian Wikipedia, doesn't occur on English Wikipedia.

If the search query contains some word, which is by accident used as a namespace name, and a colon sign : anywhere after it, the search is performed in that namespace.

For instance,
Query: У холмов есть глаза: Начало (the Russian name of the graphic novel) (https://ru.wikipedia.org/w/index.php?title=Служебная:Поиск&profile=default&fulltext=Search&search=У+холмов+есть+глаза:+Начало&searchToken=7v64zbxyubf5yyw9dan118cma)
Expected result: Search is performed in the main namespace, with the possible outcome of the article about the novel being found.
Actual result: Search is performed in the "Участник" (User) namespace.

The obvious reason for this is that the word "у", which is a preposition in Russian, is an alias for the "Участник" namespace. Other examples of common words used as namespace aliases are "и" (and) and "к" (to). So, it's like if you searched "A Nightmare on Elm Street 2: Freddy's Revenge" on English Wikipedia, if "A" was a namespace alias.

The position of the namespace name in the query is irrelevant: it can be in the middle as well as in the beginning.

This bug can be quite painful, as it actually forbids entering several frequently used words as search items, without user remotely having a clue as to what is going on. It is also a huge headache when trying to find something with prefix: keyword, as a colon becomes present and the namespace names such as "Википедия" (Wikipedia) are frequently used in project talks (say, Википедия prefix:Шаблон: searches in "Википедия" namespace instead of "Шаблон" (Template)).

The same is true for the suggestions bar: if one types "У холмов есть глаза:", he gets pages in the "Участник" namespace.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

This is very strange...
it does not seem to be limited to prefix, any query with a namespace name before a : will be run on that namespace.
e.g. This bug Участник is very Strange:search => perform a search with search on the namespace Участник ignoring everything else.

Yes, you are totally right. I will change the task title.

Jack_who_built_the_house renamed this task from "prefix:" works incorrectly when parts of the query contain words used as namespace names to Search works incorrectly when the query contain words used as namespace names and a colon (:).Oct 17 2016, 12:23 AM
Jack_who_built_the_house renamed this task from Search works incorrectly when the query contain words used as namespace names and a colon (:) to Search works incorrectly when the query contains words used as namespace names and a colon (:).
Jack_who_built_the_house triaged this task as Medium priority.
Jack_who_built_the_house updated the task description. (Show Details)

The namespace mapping seems wrong on ruwiki:

"namespace": {
  "_timestamp": {},
  "properties": {
    "name": {
      "type": "string"
    }
  }
}

But should be :

"namespace": {
  "dynamic": "false",
  "_all": {
    "enabled": false
  },
  "_timestamp": {},
  "properties": {
    "name": {
      "type": "string",
      "norms": {
        "enabled": false
      },
      "index_options": "docs",
      "analyzer": "near_match_asciifolding",
      "ignore_above": 5000
    }
  }
}

Basically this causes namespace text to be tokenized allowing partial strings to match. near_match_asciifolding disable any tokenization and force a full match.

I think this explain the bug, now I don't understand why ruwiki does not have the same mapping...
ruwiki has :

analysis version: 0.9
mapping version: 1.7

but enwiki is:

analysis version: 0.10
mapping version: 1.9

It's probable that during the last full reindex an error occured on this index and no one noticed...

In short:
Fix for this bug will happen after the BM25 reindex. The new ruwiki index is ready on codfw and shows the correct mapping. I hope to be able to switch traffic to codfw today, I think the bug will be fixed by then.

dcausse raised the priority of this task from Medium to High.Oct 17 2016, 7:55 AM
dcausse moved this task from needs triage to Current work on the Discovery-Search board.

I have encountered this bug some time before (several months maybe), but attributed it to some "prefix:" minor issues. Is it consistent with your explanation?

Thank you for involvement.

@Jack_who_built_the_house yes, if my hypothesis is right this bug has been around for a very long time (maybe more than 1 year) and affects all queries with a (:) and a namespace before that colon.

Just a small update to say that we planned to deploy a config change to route the traffic to the new index in codfw (our backup DC) where the russian wikipedia index is supposed to be fixed.
Unfortunately due to various problems this could not happen this week, we hope to do this early next week.

Traffic is sent to codfw (with a new ruwki index) since yesterday 19:00 UTC, the bug seems to be fixed.
@Jack_who_built_the_house can you confirm?

@Jack_who_built_the_house thanks and sorry about that. It seems the reindex failed for ruwiki on the production cluster (we switched traffic back to production cluster last week).
Running a reindex to fix the issue.

Our reindex process is definitely unreliable (tracked by T149333).

Mentioned in SAL (#wikimedia-operations) [2016-11-23T14:37:54Z] <dcausse> elastic@eqiad: reindexing ruwiki from terbium, logs in ~dcausse/bm25_reindex/cirrus_log (T148344)

Mentioned in SAL (#wikimedia-operations) [2016-11-23T15:50:48Z] <dcausse> elastic@eqiad: ruwiki reindex done (T148344)

The issue is fixed, @Jack_who_built_the_house thanks for your vigilance.