intitle search doesn't match stop words
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Bennylin
	Jun 23 2014, 7:09 AM

Description

I tried to search articles with "intitle:dari Spanyol" (from Spain) in the title, but it gave 0 result, the same if I search "intitle:dari" (from), but it gave the expected result when I searched "intitle:Spanyol" (Spain).

Expecting some kind of error message other than "There were no results matching the query."

Version: unspecified
Severity: normal
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=54875

Details

Reference: bz66969

	Subject	Repo	Branch	Lines +/-
	Match stop words with intitle keyword	mediawiki/extensions/CirrusSearch	master	+511 -177

Customize query in gerrit

Related Objects

Mentioned In: T56875: Automatic stopwords for the 200+ languages without their own analyzer available
T88724: EPIC: CirrusSearch: various undesirables detected in Russian Wikipedia

Event Timeline

• bzimport raised the priority of this task from to High.Nov 22 2014, 3:31 AM

• bzimport added a project: CirrusSearch.

• bzimport set Reference to bz66969.

Bennylin created this task.Jun 23 2014, 7:09 AM

Something is a certainly weird here. Temporary work around:
https://id.wikipedia.org/w/index.php?title=Istimewa%3APencarian&profile=default&search=intitle%3A%22dari+spanyol%22&fulltext=Search

I suspect it is some kind of language-based stop words, in this case Indonesian language, because of three reasons:

other Indonesian stop words also didn't show up ("intitle:di" - in, "intitle:ke" - to)
those words ("intitle:di", "intitle:ke", "intitle:dari") are found in other projects
based on my experience, id.wp's CirrusSearch employ some kind of Indonesian-language stemmer

If that is true, is it possible to disable the stop words?

Further investigation:

Searching "intitle:di" in Italian Wikipedia also failed https://it.wikipedia.org/w/index.php?title=Speciale%3ARicerca&profile=advanced&search=intitle%3Adi&fulltext=Search&ns0=1&profile=advanced

But searching "intitle:from" in English Wikipedia and "intitle:von" in German Wikipedia yields the expected results.

(btw, my searching context was noble titles, e.g. "ABC from XYZ" which translates "ABC dari XYZ" in id.wp and "ABC di XYZ" in it.wp, and so on)

Further investigation: searching in similar projects

id.wp and ms.wp are similar, while it.wp and scn.wp and en.wp and simple.wp are also compared:

"intitle:dari"
1 id.wp - failed
2 id.wp - success

"intitle:di"
3 it.wp - failed
4 scn.wp - failed

"intitle:of"
5 en.wp - error
6 simple.wp - success

links:
1 https://id.wikipedia.org/w/index.php?title=Special%3ASearch&profile=advanced&fulltext=Search&search=intitle%3Adari
2 https://ms.wikipedia.org/w/index.php?title=Special%3ASearch&profile=advanced&fulltext=Search&search=intitle%3Adari
3 https://it.wikipedia.org/w/index.php?title=Speciale%3ARicerca&profile=advanced&fulltext=Search&search=intitle%3Adi
4 https://scn.wikipedia.org/w/index.php?title=Special%3ASearch&profile=advanced&fulltext=Search&search=intitle%3Adi
5 https://en.wikipedia.org/w/index.php?title=Special%3ASearch&profile=advanced&fulltext=Search&search=intitle%3Aof
6 https://simple.wikipedia.org/w/index.php?title=Special%3ASearch&profile=advanced&fulltext=Search&search=intitle%3Aof

Probably related

[[bugzilla:54875]] Automatic stopwords for the 200+ languages without their own analyzer available
[[bugzilla:60362]] CirrusSearch: Stopwords are not optional and are worth as much as exact matches
https://www.mail-archive.com/mediawiki-commits@lists.wikimedia.org/msg169298.html

So, where can I look at the Indonesian stopwords list, and/or stemmer?

Looks like this is the stemmer:
https://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/analysis/common/src/java/org/apache/lucene/analysis/id/IndonesianStemmer.java
These are the stopwords:
https://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/analysis/common/src/resources/org/apache/lucene/analysis/id/stopwords.txt

Those bugs are related. The reason we haven't fixed them is because its a pretty large effort and we're still concentrating on performance. Its on the list, but it isn't as high as I'd like it to be.

Nemo_bis mentioned this in T88724: EPIC: CirrusSearch: various undesirables detected in Russian Wikipedia.Apr 21 2015, 9:50 AM

• demon unsubscribed.Aug 19 2015, 3:44 PM

Restricted Application added a project: Discovery-ARCHIVED. · View Herald TranscriptAug 19 2015, 3:44 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

What about at least giving better error than "There were no results matching the query."?

like "Matching one or more stopwords. Search aborted."

or better, if it gives no results,

if search terms contain other words than stopwords, search again without the stopwords, or
search without the intitle (it works: https://id.wikipedia.org/w/index.php?title=Istimewa%3APencarian&profile=default&search=dari+spanyol&fulltext=Search)

side question: why the stopwords are enabled in intitle search anyway?

• Deskana renamed this task from intitle search doesn't work to intitle search doesn't work in some cases on the Indonesian Wikipedia.Dec 29 2015, 10:57 PM

• Deskana removed • Manybubbles as the assignee of this task.

• Deskana lowered the priority of this task from High to Lowest.

• Deskana set Security to None.

• Deskana moved this task from Needs triage to Search on the Discovery-ARCHIVED board.

TJones renamed this task from intitle search doesn't work in some cases on the Indonesian Wikipedia to intitle search doesn't match stop words.Jul 17 2018, 6:53 PM

Restricted Application added a project: Discovery-Search. · View Herald TranscriptJul 17 2018, 6:53 PM

I stumbled across this ticket today and I thought I'd add some information.

I changed the title to reflect that this is not just a problem on Indonesian-language wikis, and that the problem is really about stop words.

If you really need to match stop words in titles, there is another way to do it. intitle recently got an upgrade to support regular expression searches. It is case sensitive unless you specify the i flag. So, intitle:/dari/i would work, though it will get extra results because you can't specify word boundaries (so it matches "Darius" and "Darin"). Regex searches will often time out and give incomplete results unless you specify additional search terms, like intitle:/dari/i spanyol or intitle:/dari/i intitle:spanyol. Ranking on plain regex searches is pretty random, too.

As for why it is this way, language analysis on titles is probably a good thing, since it allows stemmed matches and intitle:cats matches the article "Cat" on enwiki. One downside is that stop words get dropped. Eliminating stop word processing from titles is possible, but would require us to unpack every language analyzer and have parallel configurations for titles and text (though maybe we could be smarter about that—but it would definitely be much more complex). The alternative would be to not have language analysis on titles. At first glance, language analysis on titles seems good, but I haven't investigated it carefully.

As for why we can't have better error messages, there aren't really hooks into the stemmer to report what happened to each token and why. Punctuation also gets dropped, for example, so intitle:, doesn't find anything either.

@dcausse thinks this might be an easy fix—enabling the title plain field—so I've moved it to our backlog.

In T68969#4431595, @TJones wrote:

As for why we can't have better error messages, there aren't really hooks into the stemmer to report what happened to each token and why.

We do have an explain analyze api in elasticsearch that will emit the tokens after each stage of transformation. I'm not sure that could be turned into any sane UI that would make sense to an end user though.

Change 447943 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/CirrusSearch@master] Match stop words with intitle keyword

https://gerrit.wikimedia.org/r/447943

gerritbot added a project: Patch-For-Review.Jul 26 2018, 12:31 AM

intitle:/dari/i intitle:spanyol would not work, as the second intitle would be ignored and it would search "spanyol" on the text body. But intitle:/dari spanyol/i works like my original intention. Thanks.

In T68969#4431595, @TJones wrote:

If you really need to match stop words in titles, there is another way to do it. intitle recently got an upgrade to support regular expression searches. It is case sensitive unless you specify the i flag. So, intitle:/dari/i would work, though it will get extra results because you can't specify word boundaries (so it matches "Darius" and "Darin"). Regex searches will often time out and give incomplete results unless you specify additional search terms, like intitle:/dari/i spanyol or intitle:/dari/i intitle:spanyol. Ranking on plain regex searches is pretty random, too.

The second intitle definitely does something, because searching with it gives 35 hits, while searching without it gives 331 hits. @dcausse has been improving the query parser, and it may be much smarter about two occurrences of intitle than it was when you opened this ticket.

It's always hard to figure out the intent of a search! It's a recurring problem. I wasn't sure if you were originally looking for the phrase dari spanyol in the title (which it seems you were), or dari in the title and spanyol in the title, or dari in the title and spanyol anywhere. (And that's why query parsing is hard!)

Anyway, even if intitle:/dari spanyol/i does what you want, it wouldn't hurt to add intitle:spanyol if you want to run a lot of queries like that, because the non-regex query will run first, which means the regex only has to look at 230 titles, instead of 430K (i.e., all of idwiki).

The good news is that intitle:dari should be working within a week or two of patch 447943 (above) being merged. It turned out to be easier than I'd expected. The unanalyzed index (which doesn't strip stop words) was already there, it just wasn't being searched in this case (it's used for regexes, of course!) so changing up the underlying query was reasonably straightforward.

dcausse moved this task from Incoming to Needs review on the Discovery-Search (Current work) board.Jul 30 2018, 2:31 PM

Change 447943 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Match stop words with intitle keyword

https://gerrit.wikimedia.org/r/447943

EBernhardson moved this task from Needs review to Needs Reporting on the Discovery-Search (Current work) board.Jul 30 2018, 10:54 PM

EBernhardson claimed this task.Jul 31 2018, 5:26 PM

debt closed this task as Resolved.Jul 31 2018, 5:57 PM

Liuxinyu970226 mentioned this in T56875: Automatic stopwords for the 200+ languages without their own analyzer available.Sep 24 2018, 7:21 AM