Search does not return exact title match
Closed, ResolvedPublic2 Estimated Story Points
Actions

Assigned To

Authored By

	Bugreporter
	Jul 14 2020, 1:11 PM

Description

Reproduce: https://zh.wikipedia.org/w/index.php?title=Special:%E6%90%9C%E7%B4%A2&limit=500&offset=0&profile=default&search=%E9%BB%84%E4%BB%A5%E4%BA%91&advancedSearch-current={}&ns0=1

There is an article with the exact name, but it is not in the first 500 results.

See also T152442: Exact title search for page in extra content namespace does not return that page in the first 500 results but it is different as the page is in main namespace.

Details

	Subject	Repo	Branch	Lines +/-
	Bump the weight of near match for search	operations/mediawiki-config	master	+1 -2

Customize query in gerrit

Related Objects

Mentioned In: T260644: Handle conversion between traditional and simplified Chinese
T152442: Exact title search for page in extra content namespace does not return that page in the first 500 results
Mentioned Here: T245642: Increase the near match weight
T152442: Exact title search for page in extra content namespace does not return that page in the first 500 results

Event Timeline

Bugreporter created this task.Jul 14 2020, 1:11 PM

Restricted Application added subscribers: Stang, Aklapper. · View Herald TranscriptJul 14 2020, 1:11 PM

Pinging @TJones
I believe the problem is similar to T245642 and I'd suggest to increase the near_match weight for zh wiki.

I checked and there doesn't seem to be anything weird going on with the query or page title name for 黄以云. I was worried that there could be a hidden traditional-to-simplified conversion that was causing title and query to get segmented differently, or something similar; but everything seems to be straightforward.

Should we consider increasing the near_match weight everywhere? Would that potentially solve T152442, too?

I'm fine changing this for all other wikis. As for T152442 I'm not entirely sure, I think it may fix it for eswiki only because LTR is not enabled there, on wikis where LTR is enabled the model will take precedence over the near_match weight.

dcausse edited projects, added Discovery-Search (Current work); removed Discovery-Search.Jul 15 2020, 3:52 PM

dcausse moved this task from Incoming to Ready for Dev -- SWE on the Discovery-Search (Current work) board.

CBogen set the point value for this task to 2.Jul 27 2020, 5:23 PM

• Zbyszko claimed this task.Aug 5 2020, 6:22 AM

• Zbyszko moved this task from Ready for Dev -- SWE to In Progress on the Discovery-Search (Current work) board.

Change 618974 had a related patch set uploaded (by ZPapierski; owner: ZPapierski):
[operations/mediawiki-config@master] Bump the weight of near match for search

https://gerrit.wikimedia.org/r/618974

gerritbot added a project: Patch-For-Review.Aug 7 2020, 12:39 PM

• Zbyszko moved this task from In Progress to Needs review on the Discovery-Search (Current work) board.Aug 7 2020, 12:56 PM

• Zbyszko moved this task from Needs review to To Be Deployed on the Discovery-Search (Current work) board.Aug 10 2020, 4:01 PM

Change 618974 merged by jenkins-bot:
[operations/mediawiki-config@master] Bump the weight of near match for search

https://gerrit.wikimedia.org/r/618974

Mentioned in SAL (#wikimedia-operations) [2020-08-10T18:04:47Z] <catrope@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Bump the weight of near match for search (T257922) (duration: 00m 59s)

Maintenance_bot removed a project: Patch-For-Review.Aug 10 2020, 6:10 PM

• Zbyszko moved this task from To Be Deployed to Needs Reporting on the Discovery-Search (Current work) board.Aug 11 2020, 6:13 AM

Gehel closed this task as Resolved.Aug 17 2020, 12:33 PM

Gehel mentioned this in T152442: Exact title search for page in extra content namespace does not return that page in the first 500 results .

In https://zh.wikipedia.org/w/index.php?search=%E9%82%93%E4%BA%A6%E6%AD%A6&title=Special%3A%E6%90%9C%E7%B4%A2&fulltext=1&ns0=1, the exact page is the 17th result (which should be the first).

Exact matches should not necessarily be the first result. An example in English is palin—the first result is politician Sarah Palin, second is comedian Michael Palin, and third is exact match Palin. Since popularity and other factors go into the ranking, it makes sense that the two most famous Palins could come before the generic Palin article. Other examples: Schwarzschild (4th); bowel (2nd),

Also, in the English results, before the first result, there is text that says There is a page named "Palin" on Wikipedia with a direct link, and similarly in Chinese, 在维基百科上已有名为“邓亦武”的页面 (automatic translation: There is already a page named " Deng Yiwu " on Wikipedia).

It is also easy in both cases to navigate directly to the page using the search box in the upper right corner, since exact matches there are always the first suggestion there.

Chinese presents extra difficulties because breaking a string into words is harder than in English, for example. In this case, each character in 邓亦武 is treated as a separate word. And as such, other articles score better based on how common those individual "words" are and how common they are in each article (along with article popularity and other factors). At least some Chinese searchers know this and artificially break their queries into words to improve the parsing of the words. You can also use quotes, as in "邓亦武", in which case the exact title match article is the second result. This seems appropriate, since the 邓亦武 article is a stub, while the first result article is for the company he recently became general manager of, and it is much more well developed.

There will always be specific search results that people disagree on, and even results that no one thinks are ideal. It is generally bad practice in search (and also generally ineffective) to try to optimize for specific individual queries, so we have to accept a few specific examples that aren't perfect.

If you think there is a general systematic failure to bring unambiguously good exact matches to the first page of results on Chinese Wikipedia, and that Chinese Wikipedia users don't generally use the upper right search box to navigate quickly to an exact-match page, and don't use quotes to improve ranking, then please open another ticket to that effect.

Unfortunately, such a problem would require a lot more investigation to uncover whether the problem is as bad as it may seem—by looking at exact title searches in the upper right corner search box and ranking of exact title matches in full text search results, use and effectiveness of quotes in exact title matches, etc.—to get a sense of the impact of the problem, and then trying to uncover the underlying cause of any ranking failures, and finally devising an effective and hopefully general solution that doesn't break other searches on Chinese wikis specifically or other wikis in general. Such a task would obviously require a lot more effort than what we did for this ticket, and would have to be prioritized separately.

Hope that all provides some insight into how we go about trying to improve search. The fix for this ticket was something of an easy win; in general, specific ranking problems can be a lot harder!

Why I open this task is that Mix'n'Match uses such search results.

I'm not familiar with Mx'n'Match, but I just looked at its page.. if it is trying to match titles, it should not be using the full-text search. It should use the OpenSearch API (which is what powers the upper right corner completion suggester search box). It only matches against page titles, and exact matches are always at the top of the list.

Example searches for 黄以云 and 邓亦武 on zhwiki, and Palin, Schwarzschild, and bowel on enwiki—all have exact matches as their first result.

TJones mentioned this in T260644: Handle conversion between traditional and simplified Chinese.Aug 20 2020, 10:26 PM

Stang unsubscribed.Nov 7 2021, 5:14 PM

Search does not return exact title matchClosed, ResolvedPublic2 Estimated Story PointsActions

Description

Details

Related Objects

Event Timeline

Search does not return exact title match
Closed, ResolvedPublic2 Estimated Story Points
Actions