Page MenuHomePhabricator

intitle search doesn't match stop words
Closed, ResolvedPublic

Description

I tried to search articles with "intitle:dari Spanyol" (from Spain) in the title, but it gave 0 result, the same if I search "intitle:dari" (from), but it gave the expected result when I searched "intitle:Spanyol" (Spain).

  1. https://id.wikipedia.org/w/index.php?title=Istimewa%3APencarian&profile=default&search=intitle%3Adari+spanyol&fulltext=Search&uselang=en
  2. https://id.wikipedia.org/w/index.php?title=Istimewa%3APencarian&profile=default&search=intitle%3Adari&fulltext=Search&uselang=en
  3. https://id.wikipedia.org/w/index.php?title=Istimewa%3APencarian&profile=default&search=intitle%3Aspanyol&fulltext=Search&uselang=en

Expecting some kind of error message other than "There were no results matching the query."


Version: unspecified
Severity: normal
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=54875

Details

Reference
bz66969
Related Gerrit Patches:
mediawiki/extensions/CirrusSearch : masterMatch stop words with intitle keyword

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 3:31 AM
bzimport added a project: CirrusSearch.
bzimport set Reference to bz66969.
Bennylin created this task.Jun 23 2014, 7:09 AM

I suspect it is some kind of language-based stop words, in this case Indonesian language, because of three reasons:

  1. other Indonesian stop words also didn't show up ("intitle:di" - in, "intitle:ke" - to)
  2. those words ("intitle:di", "intitle:ke", "intitle:dari") are found in other projects
  3. based on my experience, id.wp's CirrusSearch employ some kind of Indonesian-language stemmer

If that is true, is it possible to disable the stop words?

Further investigation:

Searching "intitle:di" in Italian Wikipedia also failed https://it.wikipedia.org/w/index.php?title=Speciale%3ARicerca&profile=advanced&search=intitle%3Adi&fulltext=Search&ns0=1&profile=advanced

But searching "intitle:from" in English Wikipedia and "intitle:von" in German Wikipedia yields the expected results.

(btw, my searching context was noble titles, e.g. "ABC from XYZ" which translates "ABC dari XYZ" in id.wp and "ABC di XYZ" in it.wp, and so on)

Probably related

So, where can I look at the Indonesian stopwords list, and/or stemmer?

Looks like this is the stemmer:
https://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/analysis/common/src/java/org/apache/lucene/analysis/id/IndonesianStemmer.java
These are the stopwords:
https://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/analysis/common/src/resources/org/apache/lucene/analysis/id/stopwords.txt

Those bugs are related. The reason we haven't fixed them is because its a pretty large effort and we're still concentrating on performance. Its on the list, but it isn't as high as I'd like it to be.

demon removed a subscriber: demon.Aug 19 2015, 3:44 PM
Restricted Application added a project: Discovery. · View Herald TranscriptAug 19 2015, 3:44 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

What about at least giving better error than "There were no results matching the query."?

like "Matching one or more stopwords. Search aborted."

or better, if it gives no results,

  1. if search terms contain other words than stopwords, search again without the stopwords, or
  2. search without the intitle (it works: https://id.wikipedia.org/w/index.php?title=Istimewa%3APencarian&profile=default&search=dari+spanyol&fulltext=Search)

side question: why the stopwords are enabled in intitle search anyway?

Deskana renamed this task from intitle search doesn't work to intitle search doesn't work in some cases on the Indonesian Wikipedia.Dec 29 2015, 10:57 PM
Deskana removed Manybubbles as the assignee of this task.
Deskana lowered the priority of this task from High to Lowest.
Deskana set Security to None.
Deskana moved this task from Needs triage to Search on the Discovery board.
TJones renamed this task from intitle search doesn't work in some cases on the Indonesian Wikipedia to intitle search doesn't match stop words.Jul 17 2018, 6:53 PM
Restricted Application added a project: Discovery-Search. · View Herald TranscriptJul 17 2018, 6:53 PM
TJones added a subscriber: TJones.Jul 17 2018, 7:17 PM

I stumbled across this ticket today and I thought I'd add some information.

I changed the title to reflect that this is not just a problem on Indonesian-language wikis, and that the problem is really about stop words.

If you really need to match stop words in titles, there is another way to do it. intitle recently got an upgrade to support regular expression searches. It is case sensitive unless you specify the i flag. So, intitle:/dari/i would work, though it will get extra results because you can't specify word boundaries (so it matches "Darius" and "Darin"). Regex searches will often time out and give incomplete results unless you specify additional search terms, like intitle:/dari/i spanyol or intitle:/dari/i intitle:spanyol. Ranking on plain regex searches is pretty random, too.

As for why it is this way, language analysis on titles is probably a good thing, since it allows stemmed matches and intitle:cats matches the article "Cat" on enwiki. One downside is that stop words get dropped. Eliminating stop word processing from titles is possible, but would require us to unpack every language analyzer and have parallel configurations for titles and text (though maybe we could be smarter about that—but it would definitely be much more complex). The alternative would be to not have language analysis on titles. At first glance, language analysis on titles seems good, but I haven't investigated it carefully.

As for why we can't have better error messages, there aren't really hooks into the stemmer to report what happened to each token and why. Punctuation also gets dropped, for example, so intitle:, doesn't find anything either.

TJones added a subscriber: dcausse.

@dcausse thinks this might be an easy fix—enabling the title plain field—so I've moved it to our backlog.

As for why we can't have better error messages, there aren't really hooks into the stemmer to report what happened to each token and why.

We do have an explain analyze api in elasticsearch that will emit the tokens after each stage of transformation. I'm not sure that could be turned into any sane UI that would make sense to an end user though.

Change 447943 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/CirrusSearch@master] Match stop words with intitle keyword

https://gerrit.wikimedia.org/r/447943

intitle:/dari/i intitle:spanyol would not work, as the second intitle would be ignored and it would search "spanyol" on the text body. But intitle:/dari spanyol/i works like my original intention. Thanks.

If you really need to match stop words in titles, there is another way to do it. intitle recently got an upgrade to support regular expression searches. It is case sensitive unless you specify the i flag. So, intitle:/dari/i would work, though it will get extra results because you can't specify word boundaries (so it matches "Darius" and "Darin"). Regex searches will often time out and give incomplete results unless you specify additional search terms, like intitle:/dari/i spanyol or intitle:/dari/i intitle:spanyol. Ranking on plain regex searches is pretty random, too.

The second intitle definitely does something, because searching with it gives 35 hits, while searching without it gives 331 hits. @dcausse has been improving the query parser, and it may be much smarter about two occurrences of intitle than it was when you opened this ticket.

It's always hard to figure out the intent of a search! It's a recurring problem. I wasn't sure if you were originally looking for the phrase dari spanyol in the title (which it seems you were), or dari in the title and spanyol in the title, or dari in the title and spanyol anywhere. (And that's why query parsing is hard!)

Anyway, even if intitle:/dari spanyol/i does what you want, it wouldn't hurt to add intitle:spanyol if you want to run a lot of queries like that, because the non-regex query will run first, which means the regex only has to look at 230 titles, instead of 430K (i.e., all of idwiki).

The good news is that intitle:dari should be working within a week or two of patch 447943 (above) being merged. It turned out to be easier than I'd expected. The unanalyzed index (which doesn't strip stop words) was already there, it just wasn't being searched in this case (it's used for regexes, of course!) so changing up the underlying query was reasonably straightforward.

Change 447943 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Match stop words with intitle keyword

https://gerrit.wikimedia.org/r/447943

debt closed this task as Resolved.Jul 31 2018, 5:57 PM