Page MenuHomePhabricator

Inlabel search results don't show the best matching alias in match data
Closed, ResolvedPublic3 Estimated Story PointsBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

  • create an item with multiple aliases that share at least one word, e.g. "The Potato", "Sweet Potato" (Q16176689)
  • search for the second alias "Sweet Potato"

What happens?:

  • it responds with a search result saying that the alias "The Potato" was matched

What should have happened instead?:

  • it responds with a search result saying that the alias "Sweet Potato" was matched

This problem only occurs when part of the search term matches multiple aliases, i.e. the correct alias would be returned as the match if the word "potato" didn't appear in both of them.

It seems that this is considered a match across multiple values, and the match (highlight) that is returned is the first alias containing a word that was part of the search term, even if another alias is a better match. In the example, when searching for "sweet potato" the ElasticSearch response only contains the following value in the highlight field: 8:12-18:18|the potato.

Production example:
At the time of writing (2025-08-07), the behavior outlined above can be observed in the following request: https://www.wikidata.org/w/rest.php/wikibase/v0/search/items?language=en&q=sweet%20potato. It contains an entry for Q16176689 with "match": { "type": "alias", "language": "en", "text": "The Potato" } despite the item having another alias "Sweet Potato" which would be a better match.

Event Timeline

Jakob_WMDE renamed this task from Simple search results show the wrong alias in match data to Simple search results don't show the best matching alias in match data.Jun 5 2025, 4:20 PM
Jakob_WMDE updated the task description. (Show Details)
pfischer set the point value for this task to 3.Jun 23 2025, 3:48 PM

The problem here is that we are primarily highlighting against the text field which is a wide variety of data stuffed together into a single string. The highlighter doesn't know it should be considering this to be many different strings and picking between them. It highlights it as if it were highlighting paragraphs of content. We are doing some post-processing in the php side to turn that highlighted text field into something more presentable, but it's always going to be hacky trying to solve this problem there.

At a general level, i suspect the solution is to change our text content from:

"text": "감자
Potatoes
Terpomoj
김동인의 단편소설
Korean-language short story by Kim Dong-in
korelingva novelo de Kim Dong-in
The Potato
Sweet Potato
감자 (소설)
감자"

into the following, which will allow the highlighter to score the lines individually and then return the highest ranked ones (the same way we index redirects, categories, etc):

"text": [
    "감자",
    "Potatoes",
    "Terpomoj",
    "김동인의 단편소설",
    "Korean-language short story by Kim Dong-in",
    "korelingva novelo de Kim Dong-in",
    "The Potato",
    "Sweet Potato",
    "감자 (소설)",
    "감자"
]

It's not clear yet what the best way to implement this is. A few options:

  • We could adjust the wikibase content handler to return an array of strings, instead of a single large string with everything concatenated together. I suspect threading this through will be a little tedious, but haven't looked closely yet.
  • We could add a feature flag that splits text content on \n when indexing and turn that on for wikibase instances
  • We could skip the feature flag and make splitting text content on \n the new default behaviour

The effect of splitting text content on \n will include:

  • highlights will no longer match across \n. This is probably desirable, I suspect users are surprised when the search for a quoted phrase and it matches the last word in one paragraph and the first word of the next. The highlights might also be more coherant if they don't cross paragraphs.
  • phrase queries will no longer match across \n. This is also probably desirable, for the same reason regarding end of one paragraph and the beginning of the next.

I'm leaning towards making this the default everwhere, but will have to poke around some more and get a few peoples opinions.

@Jakob_WMDE can you clarify if this problem relates to Special:Search or the new inlabel search, if the latter could you update the description to add more precise info on how to reproduce this?

Jakob_WMDE renamed this task from Simple search results don't show the best matching alias in match data to Inlabel search results don't show the best matching alias in match data.Aug 7 2025, 1:02 PM
Jakob_WMDE updated the task description. (Show Details)

@Jakob_WMDE can you clarify if this problem relates to Special:Search or the new inlabel search, if the latter could you update the description to add more precise info on how to reproduce this?

Done, hope this clarifies it!

The InLabelSearch relies on EntityElasticTermResult which will utilize the return_snippets_and_offsets option and useful to determine if the match is a label or an alias.
EntityElasticTermResult was initially designed with completion in mind for which matches start at the beginning of the string and there are not really matches that are better than the others except that you might prefer matches in languages of the user.
For tokenized fields this is different, we might want to tune the highlighter to score & re-order the snippets using "order": "score".
Unfortunately, while this might solve this particular issue it won't choose the best match across all language but only within the first language that has a match. Say you search in french, you get a poor match on a french alias, if there's a better match in english it won't be attempted because of the skip_if_last_matched option.

Change #1178005 had a related patch set uploaded (by DCausse; author: DCausse):

[mediawiki/extensions/WikibaseCirrusSearch@master] Order snippets by score when highlighting labels on .plain

https://gerrit.wikimedia.org/r/1178005

Change #1178005 merged by jenkins-bot:

[mediawiki/extensions/WikibaseCirrusSearch@master] Order snippets by score when highlighting labels on .plain

https://gerrit.wikimedia.org/r/1178005