Page MenuHomePhabricator

Enable dotted_I_fix (almost?) everywhere
Closed, ResolvedPublic5 Estimated Story Points

Assigned To
Authored By
TJones
Feb 26 2024, 3:31 PM
Referenced Files
F45702529: image.png
Apr 10 2024, 7:28 PM
F45701406: image.png
Apr 10 2024, 7:28 PM
F45701660: image.png
Apr 10 2024, 7:28 PM
F45700456: image.png
Apr 10 2024, 7:28 PM
F45700427: image.png
Apr 10 2024, 7:28 PM
F45700298: image.png
Apr 10 2024, 7:28 PM

Description

User story: As a searcher, I want words like Istanbul (common non-Turkish spelling) and İstanbul (Turkish spelling) to match when I search for one or the other.

Our custom analyzers that were unpacked generally use icu_normalizer instead of lowercase for lowercasing and other normalization. It converts İstanbul to i̇stanbul (normal English lowercase i with an extra vertical dot above—depending on your fonts, browser, and OS, the extra dot can be rendered above, next to, or invisibly on top of the regular dot on the i.) The character filter dotted_I_fix is used to fix this in analzyers converted as part of the unpacking project, but some older unpacked analyzers do not use it. The default analysis chain also uses icu_normalizer without dotted_I_fix.

There are a small number of languages (mostly Turkic, it looks like) that distinguish I/ı and İ/i, and they should probably not use dotted_I_fix and should use Turkish lowercasing (which is the same as lowercase except for the İ/i and I/ı pairs) before icu_normalizer, like Turkish does.

It might make sense to also see whether there is an appreciable difference in speed between using Turkish lowercasing and a simple character filter that maps İ/i and I/ı before letting icu_normalizer do the rest. (In the past, we just turned on the Turkish variant of lowercase because it existed and it was easy, even though icu_normalizer still has to run to handle all the more interesting basic normalization.)

Acceptance Criteria: Either dotted_I_fix or some form of İ/i and I/ı lowercasing is enabled everywhere icu_normalizer is enabled (with a few possible exceptions for language-specific analyzer components).

Event Timeline

TJones triaged this task as High priority.
TJones set the point value for this task to 5.

I prioritized this task to have a smaller task to work on as a break after the ginormous T332337 and T356643, and to have something more interruptable to work on while T342444 is running in the background.

The full write up is on MediaWiki.

TL;DR: Figuring it all out was a little funky, but the final implementation was pretty straightforward, and the results are what we want for both Turkic and non-Turkic languages. There are a few rare corner cases, but that's always the case. Also, surprise bonus Norwegian!

Not getting automated tags for some reason, but this is included in 1.42.0-wmf.25, so it will be deployed soon.

Not sure if this task fixes that, lowercasing I and dotted I (İ) returns different lowercase letters:

image.png (90×360 px, 4 KB)

They have different length,

image.png (139×188 px, 3 KB)

but same char code. If you try to compare them, it will return false
image.png (165×178 px, 3 KB)

This is bad, for example, in azwiki, if you use uppercase dotted I in template parameter (which adds category to that page based on same parameter),

image.png (56×634 px, 8 KB)

image.png (67×735 px, 24 KB)

it will create different category with different lowercase dotted i:

image.png (145×420 px, 13 KB)

In T358495#9705136, @NMW03 wrote:

Not sure if this task fixes that, lowercasing I and dotted I (İ) returns different lowercase letters

This task will address the issue, but only in the context of on-wiki search. Wikis in Azerbaijani, Crimean Tatar, Gagauz, Kazakh, and Tatar will, like Turkish wikis, lowercase I to ı & İ to i when searching, so that a search for istanbul will match on-wiki text İstanbul. On most other wikis, I, İ, and i will all match, and on some wikis ı will also match the other three ("most" and "some" other wikis rather than "all" other wikis for annoying technical reasons).

image.png (90×360 px, 4 KB)

I'm not an expert on wiki magic words, but I think it would require either a new magic word, or a localization parameter for {{lc}} to get proper lowercasing for Turkic languages that use I/ı & İ/i.

but same char code. If you try to compare them, it will return false

image.png (165×178 px, 3 KB)

This looks like Javascript. charCodeAt() takes a (zero-based) position parameter, so you have an implicit 0 in there, which points to the first letter of the string, which is i / 105 for both. However, the second character (at index 1) in the longer one is the combining dot.

>   'İ'.toLowerCase().charCodeAt(1)
<   775

If your browser languages is Azerbaijani (or Turkish, or one of the other Turkic languages), Javascript's toLocaleLowerCase() might do the right thing, but it can be unreliable if you share code with people in other locales.. for example, my browser language is English, so toLowerCase() and toLocaleLowerCase() do the same thing. Other programming languages might take a locale paramter and be more consistent. Java can do this, so you can say things like str=str.toLowerCase(new Locale("tr","TR")); to get Turkish/Turkic lowercasing.

This is bad, for example, in azwiki, if you use uppercase dotted I in template parameter (which adds category to that page based on same parameter)

Yeah, I know what you are saying. Again, I only work on the search engine, and I'm doing stuff deep down in the guts of the search engine, not in Mediawiki.

I'd suggest opening a new ticket and adding the MediaWiki-Internationalization and I18n tags.

P.S.: Note that the changes for dotted I on Azerbaijani and other Turkic-languge wikis will go live after we reindex the wikis, not immediately after this ticket is deployed, unfortunately. It may be a while. There's so much data out there—and it takes a while to do the reindexing, so we often bundle a few features that need reindexing together.

I will open a new task, thanks!