Page MenuHomePhabricator

Language search does not show some prefix matches
Closed, ResolvedPublic

Assigned To
Authored By
Pginer-WMF
Dec 22 2021, 2:29 PM

Description

Looking at the initial implementation of the language selector (T253303) available for Section Translation at test wiki, when searching for a source language the list of languages is not always showing the expected results. For example, searching for "Id" does not show "Ido" in the results.

Ido is in the list and it is shown in the results when searching for other terms such as "I" or "Ido", but not for "Id". This may be related to the way multiple search criteria are currently combined (since "ID" is the ISO code for Indonesian).

An example of the results shown in each case.

Searching for "I" suggests "Ido"Searching for "id" does not show "Ido"Searching for "Ido" shows "Ido" again
test.m.wikipedia.org_wiki_Special_ContentTranslation(iPhone 6_7_8) (2).png (750×1,334 px, 81 KB)
test.m.wikipedia.org_wiki_Special_ContentTranslation(iPhone 6_7_8) (3).png (750×1,334 px, 34 KB)
test.m.wikipedia.org_wiki_Special_ContentTranslation(iPhone 6_7_8) (4).png (750×1,334 px, 31 KB)

Another example (based on the case reported in T376863) is for Hindi, where each of the search strings "Hi", "Hindi", and "Hindi " (notice the extra space at the end), leads to different results where "हिन्दी", "Fiji Hindi" or both are shown on each case. The expected result would be for both results to be visible in these different cases.

Searching for "Hi" suggests "हिन्दी"Searching for "Hindi" suggests "Fiji Hindi"Searching for "Hindi " with trailing space shows "हिन्दी" and "Fiji Hindi"
test.m.wikipedia.org_w_index.php_title=Special_ContentTranslation&active-list=suggestions&from=es&to=en(Wiki Mobile).png (320×568 px, 9 KB)
test.m.wikipedia.org_w_index.php_title=Special_ContentTranslation&active-list=suggestions&from=es&to=en(Wiki Mobile) (2).png (320×568 px, 10 KB)
test.m.wikipedia.org_w_index.php_title=Special_ContentTranslation&active-list=suggestions&from=es&to=en(Wiki Mobile) (3).png (320×568 px, 11 KB)

Workflow steps for Google Chrome Recorder in F58323224

Event Timeline

Pginer-WMF triaged this task as Medium priority.
Pginer-WMF renamed this task from Search does not show some prefix matches to Language search does not show some prefix matches.Dec 23 2021, 11:06 AM
Pginer-WMF moved this task from Backlog to Entry points on the SectionTranslation board.

@Pginer-WMF the language selector in the desktop dashboard is using the language API for the language search feature.

In the unified dashboard, the search is first performed on a local dataset and goes to that same API if no results were found locally. The local search first tries to find an exact match for the ISO code and stops if it finds it. That explains why "id" returns Bahasa Indonesia and nothing else. I can change that behavior to continue the local search with the autonyms and script names even if a ISO code match was found and it gets a little better but it doesn't help with the "Hindi" example above as the local data is not as rich as what the API is working with.

Removing the local search steps completely and only using the API gives the best and most consistent results. Any insight on why the local search optimization was put in place or any reasons to keep it?

Change #1117218 had a related patch set uploaded (by Sbisson; author: Sbisson):

[mediawiki/extensions/ContentTranslation@master] Include code substring in local language search

https://gerrit.wikimedia.org/r/1117218

Change #1117220 had a related patch set uploaded (by Sbisson; author: Sbisson):

[mediawiki/extensions/ContentTranslation@master] Use languagesearch api directly instead of local search

https://gerrit.wikimedia.org/r/1117220

NOTE: There are two different approaches that are proposed to improve the language search. The patches are MUTUALLY EXCLUSIVE.

Change #1117218 abandoned by Nik Gkountas:

[mediawiki/extensions/ContentTranslation@master] Include code substring in local language search

Reason:

Abandoning this, as we merged the conflicting solution provided in Ie0e940feaf51af78384322f0edd0da6871a8c1f9

https://gerrit.wikimedia.org/r/1117218

Change #1117220 merged by jenkins-bot:

[mediawiki/extensions/ContentTranslation@master] Use languagesearch api directly instead of local search

https://gerrit.wikimedia.org/r/1117220

Change #1118478 had a related patch set uploaded (by Nik Gkountas; author: Nik Gkountas):

[mediawiki/extensions/ContentTranslation@master] CX3 Build 0.2.0+20250210

https://gerrit.wikimedia.org/r/1118478

@Pginer-WMF the language selector in the desktop dashboard is using the language API for the language search feature.

In the unified dashboard, the search is first performed on a local dataset and goes to that same API if no results were found locally. The local search first tries to find an exact match for the ISO code and stops if it finds it. That explains why "id" returns Bahasa Indonesia and nothing else. I can change that behavior to continue the local search with the autonyms and script names even if a ISO code match was found and it gets a little better but it doesn't help with the "Hindi" example above as the local data is not as rich as what the API is working with.

If the local search is used, continuing for additional results makes sense. We may want to support cases where the user types intentionally an ISO code, but also those where the user looks for a language which initial characters happen to be the ISO code of an unrelated one.

Removing the local search steps completely and only using the API gives the best and most consistent results. Any insight on why the local search optimization was put in place or any reasons to keep it?

I don't have much more context about this particular optimization. If it is possible to find languages by both name and ISO code, I don't have a strong preference about where the code is running.

Change #1118478 merged by jenkins-bot:

[mediawiki/extensions/ContentTranslation@master] CX3 Build 0.2.0+20250210

https://gerrit.wikimedia.org/r/1118478

"id" shows results including both Indonesian and Ido"Hindi" shows both "Hindi" and "Fiji Hindi"
test.wikipedia.org_w_index.php_title=Special_ContentTranslation&filter-type=automatic&filter-id=previous-edits&active-list=suggestions&from=es&to=id(Wiki Mobile).png (320×568 px, 21 KB)
test.wikipedia.org_w_index.php_title=Special_ContentTranslation&filter-type=automatic&filter-id=previous-edits&active-list=suggestions&from=es&to=id(Wiki Mobile) (1).png (320×568 px, 14 KB)