Page MenuHomePhabricator

Autocomplete search results do not list exact match (disambiguation page) first
Closed, DeclinedPublicBUG REPORT

Description

Search by the non-existing tem "ebc" (lowercase).

What happens?:
The first hit is "EBCDIC".

What should have happened instead?:
The first search hit should have been the existing page "EBC" (uppercase).

Software version (skip for WMF-hosted wikis like Wikipedia):
English Wikipedia

Other information (browser name/version, screenshots, etc.):

Screenshot_20230101-182902.png (2×1 px, 308 KB)

Event Timeline

matmarex subscribed.

I think they usually are, for example "prc" gives "PRC" as the first result and not "PRC1" and other pages with that prefix.

I wonder if "EBC" is getting intentionally deprioritized in the results because it's a disambiguation page?

This has nothing to do with case sensitive or not - both EBCDIC and EBC have capital letters only

Aklapper renamed this task from search results should be case insensitive to Automcomplete search results do not list exact match (disambiguation page) first.Jan 2 2023, 11:14 AM
Gehel subscribed.

Search is always about heuristic and trade-offs, in this case, the heuristic does not match the exact expectations. This probably happens only on short queries. As such, we will not invest time into optimizing for this specific case. This could also be fixed by redirects (and is already the case for a lot of cases).

Feel free to disagree and re-open.

I agree with @Gehel, but wanted to add some details, answer some questions, and provide a specific workaround.

BTW, @matmarex, we don't deprioritize disambiguation pages in autocomplete results. Part of the reason that prc gives "China" as the first result is that someone added a redirect for Prc (there is some magic to capitalize the first letter of a query since titles always (or nearly always) have a capitalized first letter). If you use "random" capitalization (like, pRc) you don't get China/Prc/PRC as the first result—you get "PRC1". Typing PrC gives "PRC (disambiguation)", because of an exact case-matching redirect.

Allowing normalized queries to match for the purposes of "exact matching" for the top spot in autocomplete results would lead to ambiguities, especially in shorter queries. A, Å, Ä, Á, À, Â, Ã, etc. would all be competing for the top spot in a search for any of them, too. There are plenty of titles that differ only by accent (especially in other wikis, like Wiktionary), but also ones that differ only by case, eg Mcg and McG, ConFLiCT/Conflict, TWiT/Twit, etc. in English Wikipedia. Generally, doing partial matching and ranking by popularity works well, but nothing is 100%, and the shorter the query, the more ambiguity.

In general, handling case-only differences differently would likely be too expensive for autocomplete, which has to be super fast. (It is all done with a pre-built data structure held in memory; we can't make that structure vastly bigger—there isn't room in memory to hold it—or require accessing another data source—that would be too slow to be able to return results for almost every keystroke.)

Also of note, searching for ebc gives the desired result here, because the algorithm for finding an "exact match" title for routing to an article can take a couple hundred extra milliseconds to do more normalization, and there are no other contenders.

@Fgnievinski, if this is a common problem for this very specific query, then adding a redirect from "Ebc" to "EBC" would create an exact match for the purposes of autocomplete. I know that's not ideal, but it isn't often necessary, so it has historically been a reasonable compromise.

I've tried creating a redirect "Ebc" to the existing dab "EBC":
https://en.wikipedia.org/w/index.php?title=Ebc&redirect=no
Now, searching for "Ebc" correctly produces EBC as the first hit.
However, searching for "ebc" still produces EBCDIC.
Is it a side effect of the inability to create pages with a lowercase initial?
Since lowercase is the default keyboard setting for most users,
I still think a denormalization would improve the user experience.
Would it be terribly expensive to create all-lowercase shadows of every term?
Although it'd duplicate the memory requirements, it's still the same order of magnitude.

I'm not the autocomplete expert on the team (his work day is done and it's his evening time now), so I'm not sure what the current situation is because this is an unusual edge case. The autocomplete index is rebuilt daily, and it is unlikely that it has been rebuilt since you added the redirect. Obviously autocomplete needs to be able to find titles for recently added articles and redirects, but I just don't know how that is handled in the time before the index is rebuilt.

Let's give it ~24 hours and see if anything changes. It may not, but I think that it will. If not, I will try to get more info on what's happening in our team meeting on Wednesday.

Exact matches (case sensitive ones) do hit the mysql database and thus should be visible relatively soon, what you might have encountered is I think an effect of your browser cache, thanks to your redirect searching for ebc now returns EBC first followed by EBCDIC.
As pointed out by Trey, making sure that "very close matches" do always get ranked first might not be entirely trivial resource wise but also we'd have to agree on what is a "very close match" and what to do when it's highly ambiguous. It does not seem that this problem is widespread enough to justify finding a dedicated solution to improve these particular cases, esp. if more more server resources are to be used or if search latencies are to be impacted.

Thanks, @dcausse! It's working for me today, too.

thanks all, I should have tested on an incognito tab.

Aklapper renamed this task from Automcomplete search results do not list exact match (disambiguation page) first to Autocomplete search results do not list exact match (disambiguation page) first.Jan 10 2023, 8:38 PM