Page MenuHomePhabricator

EPIC: CirrusSearch: various undesirables detected in Russian Wikipedia
Closed, InvalidPublic

Description

At the moment we have a big discussion in Russian Wikipedia that the search engine is working worse than previous one.

The key of the problem is that we have a gadget that suggests wiki-links for a selected word based on the internal search results.
I am also posting link to the gadget itself here for information but the issue is connected not with it but with the search itself as we still can't understand its logics.

We gathered some examples of the search results:

  1. searching for the following article - https://ru.wikipedia.org/wiki/Гагарин,_Юрий_Алексеевич:
  1. searching for - https://ru.wikipedia.org/wiki/Феофан_Затворник
  1. searching for - https://ru.wikipedia.org/wiki/Иван_Грозный
  1. searching for - https://ru.wikipedia.org/wiki/Болотов,_Василий_Васильевич

See also: T68969: intitle search doesn't match stop words

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Болотов В* В* looks to be caused by the generated query not including the phrase rescore that is usually included.

Иван Гр* is the same thing.

And here is the same problem in English: b* b* king.

The tests hit an English wiki so if possible I try to rewrite issues into English. Its also easier for me to read.....

Now, the "Феофан Затворни*" issue I don't yet understand.

intitle:Болотов intitle:В* intitle:В* also doesn't get the phrase rescore so its going to have trouble sorting. OTOH you mentioned that Болотов,_Валерий_Дмитриевич is the incorrect result. If you only want results in that form then intitle:"Болотов В* В*" is likely to do a better job for you.

Getting intitle:Болотов intitle:В* intitle:В* to do the right phrase search is going to be more difficult to fix. I don't know when I'll have the time for that given the other projects I'm on.

Still need to figure out "Феофан Затворни*".

Scratch that intitle: theory - it was wrong.

Scratch that intitle: theory - it was wrong.

Are there any ideas? Can we get an old search engine for a period till CirrusSearch is fixed?
We can organize a local discussion if it is needed

Scratch that intitle: theory - it was wrong.

Are there any ideas? Can we get an old search engine for a period till CirrusSearch is fixed?
We can organize a local discussion if it is needed

The old lsearchd infrastructure has been shutdown and decommissioned so switching back to it from CirrusSearch is not possible for any wiki.

Are there any ideas? Can we get an old search engine for a period till CirrusSearch is fixed?
We can organize a local discussion if it is needed

The old lsearchd infrastructure has been shutdown and decommissioned so switching back to it from CirrusSearch is not possible for any wiki.

Anything is *possible*. But in this case it'd be really time consuming and upset lots of people. I think we're much better off spending time fixing this in Cirrus than rolling back.

I'd love to work on this but I don't see when I'll have time. I dunno if this is something we can delegate to Chad with his new role or not. I'd feel much better if we had some kind of commitment to handle this soon. As is I just feel upset that I can't help ruwiki because I'm spoken for. It gives me that upset stomach feeling and a headache.

Now that I think about it, @Jdouglas asked if he could help with the WDQ stuff. Maybe he can help with this. Some degree of cross training would be useful. He also knows Java so if it comes to hacking on Elasticsearch or one of our plugins he's a good candidate any way.

@Manybubbles <hat type="PM">Getting more folks comfortable with the search stack is a great idea</hat>

@Jdouglas Are you interested? @chad how about you if James can't pick this up?

@bd808 just having another person who could do second line support in case something goes sideways would be sweet. I _think_ a couple of these issues are quick to fix. I believe some of them are more time consuming but can't be sure.

Just talked to @Jdouglas and it looks like he's better suited to WDQ stuff than this. He has some past experiences with knowledge graphs and stuff. So, @bd808, maybe find another person? I really think we should get this started as soon as we can.

Here are some more examples that are related to (2) above:

target: https://en.wikipedia.org/wiki/Functional_programming

queryresults
intitle:functional intitle:programmingsecond result
intitle:functional* intitle:programming*no results
intitle:functional* intitle:programmin*no results
intitle:functional intitle:programmin*no results
intitle:functional intitle:programming*no results
intitle:functional intitle:p*no results
"functional programming"second result
"functional programming*"second result
"functional programmin*"no results

How is "foo*" supposed to work? The case of asterisks within quotes is not documented on Help:CirrusSearch, so it might be somewhat ambiguous to users.

So what ruwiki needs is for the phrase rescore to still work for searches
like `foo* bar*`. That _currently_ relies on Elasticsearch's behavior
for "foo* bar*". By work I mean that it should be a phrase search of
prefix searches.....

So what ruwiki needs is for the phrase rescore to still work for searches
like `foo* bar*`. That _currently_ relies on Elasticsearch's behavior
for "foo* bar*". By work I mean that it should be a phrase search of
prefix searches.....

Does ES currently support this? It seems like it's not yet in Lucene: https://issues.apache.org/jira/browse/LUCENE-1486

So what ruwiki needs is for the phrase rescore to still work for searches
like `foo* bar*`. That _currently_ relies on Elasticsearch's behavior
for "foo* bar*". By work I mean that it should be a phrase search of
prefix searches.....

Does ES currently support this? It seems like it's not yet in Lucene: https://issues.apache.org/jira/browse/LUCENE-1486

Deep down it probably does. My guess is it doesn't recognize the syntax and no one noticed until ruwiki.

Some updates (and links) for the above searches:

Note: dropped from this comment and replaced into the task description

Jdouglas updated the task description. (Show Details)
Jdouglas updated the task description. (Show Details)
Jdouglas renamed this task from inefficient work of CirrusSearch in Russian Wikipedia to [Epic] CirrusSearch: various undesirables detected in Russian Wikipedia.Apr 2 2015, 8:16 PM
Jdouglas renamed this task from [Epic] CirrusSearch: various undesirables detected in Russian Wikipedia to [epic] CirrusSearch: various undesirables detected in Russian Wikipedia.
Jdouglas renamed this task from [epic] CirrusSearch: various undesirables detected in Russian Wikipedia to [EPIC] CirrusSearch: various undesirables detected in Russian Wikipedia.
Jdouglas renamed this task from [EPIC] CirrusSearch: various undesirables detected in Russian Wikipedia to [Epic] CirrusSearch: various undesirables detected in Russian Wikipedia.
Jdouglas renamed this task from [Epic] CirrusSearch: various undesirables detected in Russian Wikipedia to [epic] CirrusSearch: various undesirables detected in Russian Wikipedia.

Heh, sorry about the noise. I'm done renaming this now.

Everything in the task description has been captured in break-out issues, and I have removed the poor-man's status-tracking strikethrough text.

I think the current practice is adding Epic as a project, not in the title.

Aklapper renamed this task from [epic] CirrusSearch: various undesirables detected in Russian Wikipedia to CirrusSearch: various undesirables detected in Russian Wikipedia.Apr 3 2015, 11:57 AM
Aklapper added a project: Epic.

(Setting the Epic tag instead of prefixes in the summary.)

Wow Rubin16, thanks for the detailed report!

Note, I misremembered that Lucene had a lot of custom analysis for Russian, but I find little (not even a stopwords list). Maybe it was another language.

Deskana renamed this task from CirrusSearch: various undesirables detected in Russian Wikipedia to EPIC: CirrusSearch: various undesirables detected in Russian Wikipedia.Dec 3 2015, 5:54 PM
Deskana lowered the priority of this task from High to Low.
Deskana subscribed.

Since we have not touched this task for months, I am lowering its priority to reflect reality.

This isn't really a task, this is just a collection of semi-related issues. Each individual issue in here has its own subtask, so there's nothing to do here. Closing as invalid.