Page MenuHomePhabricator

Search does not return exact title match
Closed, ResolvedPublic2 Estimated Story Points

Event Timeline

dcausse triaged this task as Medium priority.Jul 14 2020, 1:19 PM
dcausse moved this task from needs triage to elastic / cirrus on the Discovery-Search board.
dcausse added subscribers: TJones, dcausse.

Pinging @TJones
I believe the problem is similar to T245642 and I'd suggest to increase the near_match weight for zh wiki.

I checked and there doesn't seem to be anything weird going on with the query or page title name for 黄以云. I was worried that there could be a hidden traditional-to-simplified conversion that was causing title and query to get segmented differently, or something similar; but everything seems to be straightforward.

Should we consider increasing the near_match weight everywhere? Would that potentially solve T152442, too?

I'm fine changing this for all other wikis. As for T152442 I'm not entirely sure, I think it may fix it for eswiki only because LTR is not enabled there, on wikis where LTR is enabled the model will take precedence over the near_match weight.

CBogen set the point value for this task to 2.Jul 27 2020, 5:23 PM

Change 618974 had a related patch set uploaded (by ZPapierski; owner: ZPapierski):
[operations/mediawiki-config@master] Bump the weight of near match for search

https://gerrit.wikimedia.org/r/618974

Change 618974 merged by jenkins-bot:
[operations/mediawiki-config@master] Bump the weight of near match for search

https://gerrit.wikimedia.org/r/618974

Mentioned in SAL (#wikimedia-operations) [2020-08-10T18:04:47Z] <catrope@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Bump the weight of near match for search (T257922) (duration: 00m 59s)

Exact matches should not necessarily be the first result. An example in English is palin—the first result is politician Sarah Palin, second is comedian Michael Palin, and third is exact match Palin. Since popularity and other factors go into the ranking, it makes sense that the two most famous Palins could come before the generic Palin article. Other examples: Schwarzschild (4th); bowel (2nd),

Also, in the English results, before the first result, there is text that says There is a page named "Palin" on Wikipedia with a direct link, and similarly in Chinese, 在维基百科上已有名为“邓亦武”的页面 (automatic translation: There is already a page named " Deng Yiwu " on Wikipedia).

It is also easy in both cases to navigate directly to the page using the search box in the upper right corner, since exact matches there are always the first suggestion there.

Chinese presents extra difficulties because breaking a string into words is harder than in English, for example. In this case, each character in 邓亦武 is treated as a separate word. And as such, other articles score better based on how common those individual "words" are and how common they are in each article (along with article popularity and other factors). At least some Chinese searchers know this and artificially break their queries into words to improve the parsing of the words. You can also use quotes, as in "邓亦武", in which case the exact title match article is the second result. This seems appropriate, since the 邓亦武 article is a stub, while the first result article is for the company he recently became general manager of, and it is much more well developed.

There will always be specific search results that people disagree on, and even results that no one thinks are ideal. It is generally bad practice in search (and also generally ineffective) to try to optimize for specific individual queries, so we have to accept a few specific examples that aren't perfect.

If you think there is a general systematic failure to bring unambiguously good exact matches to the first page of results on Chinese Wikipedia, and that Chinese Wikipedia users don't generally use the upper right search box to navigate quickly to an exact-match page, and don't use quotes to improve ranking, then please open another ticket to that effect.

Unfortunately, such a problem would require a lot more investigation to uncover whether the problem is as bad as it may seem—by looking at exact title searches in the upper right corner search box and ranking of exact title matches in full text search results, use and effectiveness of quotes in exact title matches, etc.—to get a sense of the impact of the problem, and then trying to uncover the underlying cause of any ranking failures, and finally devising an effective and hopefully general solution that doesn't break other searches on Chinese wikis specifically or other wikis in general. Such a task would obviously require a lot more effort than what we did for this ticket, and would have to be prioritized separately.

Hope that all provides some insight into how we go about trying to improve search. The fix for this ticket was something of an easy win; in general, specific ranking problems can be a lot harder!

Why I open this task is that Mix'n'Match uses such search results.

I'm not familiar with Mx'n'Match, but I just looked at its page.. if it is trying to match titles, it should not be using the full-text search. It should use the OpenSearch API (which is what powers the upper right corner completion suggester search box). It only matches against page titles, and exact matches are always at the top of the list.

Example searches for 黄以云 and 邓亦武 on zhwiki, and Palin, Schwarzschild, and bowel on enwiki—all have exact matches as their first result.