Page MenuHomePhabricator

U+002C comma is not being excluded by default in simple search input box for CirrusSearch
Open, MediumPublic

Description

(Lydia asked that I write this up, just in case)

I thought that "," comma was already added to the Elasticsearch standard tokenizer and would be excluded from simple search?
But it seems that there is some overriding decision to have the default config this way on Wikidata? Perhaps the word_delimiter is being used and incorrectly?

Avoid using the word_delimiter filter with tokenizers that remove punctuation, such as the standard tokenizer. This could prevent the word_delimiter filter from splitting tokens correctly. It can also interfere with the filter’s configurable parameters, such as catenate_all or preserve_original. We recommend using the keyword or whitespace tokenizer instead.

Below as seen in my screenshot, I was looking for entities that contained all 3 words, but it seemed if I DID NOT include the comma, then the entity was not found.
The only way that it was displayed was if I did include the comma.

search_dropdown_screenshot.png (436×807 px, 22 KB)

I noticed that the string "foot locker inc" will not show the entity in the dropdown, but only "foot locker, inc." which includes the comma?
Exact match should only happen by default if a user wraps in double quotes, such as

"Foot Locker, Inc."

where in my example screenshot I have to include the comma to find the entity. But my expectation was that any U+002C comma in the search string would not be included in the search query.
(On that entity, I have since added the full legal name into the alias field to help improve searchability, but still would like to know the decision on why U+002C comma is not being excluded)

Why was U+002C comma decided to be included in simple search?
Must users use the Advanced Search on Wikidata or the API if they want to actually do simple searches that are not exact match phrases? Doing something advanced in order to do something simple would seem counter-intuitive and the reverse of most users expectations.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Should be evaluated alongside T237645 I think as both these tickets involve the same kind of modifications to the analysis chains.

MPhamWMF triaged this task as Medium priority.Aug 23 2021, 3:55 PM
MPhamWMF moved this task from needs triage to Wikibase Search on the Discovery-Search board.