Page MenuHomePhabricator

Standardize ASCII-folding/ICU-folding across analyzers
Closed, ResolvedPublic8 Estimated Story Points

Description

User Story: As a multi-lingual searcher, I would like more consistency and predictability in how character folding works across wikis.

Some languages have ASCII folding disabled, some have it enabled, some have it enabled with the option to preserve the unfolded original; some upgrade ASCII folding (with or without preserving the original) to ICU folding.

Acceptance Critera:

  • Either an update to AnalysisConfigBuilder to make ASCII-folding / ASCII-folding preserve more consistently used or a better understanding of why it should be different across languages.
  • Bonus: An easy mechanism to enable custom ICU folding for a given language code without having to create a full analysis config for that language. (This may already exist.)

Summary list of affected languages: Assamese (as), Azerbaijani (az), Crimean Tatar (crh), Greek (el), French (fr), Gagauz (gag), Gujarati (gu), Indonesian (id), Igbo (ig), Italian (it), Georgian (ka), Kazakh (kk), Khmer (km), Kannada (kn), Korean (ko), Malayalam (ml), Marathi (mr), Malay(ms), Mirandese (mwl), Burmese (my), Nepali (ne), Odia (or), Punjabi (pa), Polish (pl), Sinhala (si), Slovenian (sl), Albanian (sq), Swedish (sv), Swahili (sw), Tamil (ta), Telugu (te), Tagalog (tl), Tatar (tt), Uzbek (uz), Vietnamese (vi), Chinese (zh)

Event Timeline

TJones changed the point value for this task from 5 to 8.Jul 17 2023, 3:50 PM

Moving this back to the backlog in favor of a smaller next harmonization project.

TJones triaged this task as High priority.Feb 26 2024, 3:03 PM
TJones moved this task from needs triage to Language Stuff on the Discovery-Search board.

Full write up on Mediawiki. In summary:

  1. English, Italian, French, Swedish, and Greek are wild. They can and should be tamed. Mostly we can and should blame The Ancient Ones.
  2. In general, icu_folding languages should use icu_folding for the text, text_search, and lowercase_keyword analyzers. plain should use "icu_folding_preserve", and plain_search needs no extra folding.
  3. dedup_asciifolding hasn't been necessary since Lucene 6.3—we're at 8.7. We can remove it—and declare victory! It got fixed in part because we noticed the problem and opened a ticket upstream.
  4. The problem with Greek is in part because we sometimes use asciifolding as a proxy for "we want icu_folding here." We should not do that. Refactor!
  5. A few languages (Chinese, Indonesian, Khmer, Korean, Polish) have no extra folding but should have had icu_folding applied when they were unpacked or otherwise customized. We could add icu_folding for others, too, though that may require new code.

This is a lot, and some of these changes touch a lot of the test fixtures in overlapping ways, so I'm breaking it up into at least one patch for each of the above. The first three patches are ready, and Gerrit willing, should be up soon.

Change #1062465 had a related patch set uploaded (by Tjones; author: Tjones):

[mediawiki/extensions/CirrusSearch@master] Harmonize asciifolding and icu_folding--Part 1

https://gerrit.wikimedia.org/r/1062465

Change #1062466 had a related patch set uploaded (by Tjones; author: Tjones):

[mediawiki/extensions/CirrusSearch@master] Harmonize asciifolding and icu_folding--Part 2

https://gerrit.wikimedia.org/r/1062466

Change #1062467 had a related patch set uploaded (by Tjones; author: Tjones):

[mediawiki/extensions/CirrusSearch@master] Harmonize asciifolding and icu_folding--Part 3

https://gerrit.wikimedia.org/r/1062467

Change #1062465 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Harmonize asciifolding and icu_folding--Part 1

https://gerrit.wikimedia.org/r/1062465

Change #1062466 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Harmonize asciifolding and icu_folding--Part 2

https://gerrit.wikimedia.org/r/1062466

Change #1062467 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Harmonize asciifolding and icu_folding--Part 3

https://gerrit.wikimedia.org/r/1062467

Change #1064126 had a related patch set uploaded (by Tjones; author: Tjones):

[mediawiki/extensions/CirrusSearch@master] Harmonize asciifolding and icu_folding--Part 4

https://gerrit.wikimedia.org/r/1064126

Full Part 4—Refactoring & Analysis Notes are on Mediawiki.

Quick summary:

  • much refactoring; very wow!
    • automated addition of remove_empty after icu_folding and found a case where it was missed, and a case where it was invoked twice.
    • refactored tests, too
  • quite a few languages do want asciifolding even when no upgrade to icu_folding is available; these have been made explicit and intentional, so third-party users without all the plugins we use will have better analysis chains

Change #1064126 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Harmonize asciifolding and icu_folding--Part 4

https://gerrit.wikimedia.org/r/1064126

Apparently this should have been an (oxymoronic) mini-epic.

I've got a write up on Mediawiki for the changes needed to the first round of applying icu_folding to more languages.

Highlights:

  • 20½ languages will have ICU folding added: Chinese, Indonesian / Malay, Khmer, Korean, Polish; Mirandese, Azerbaijani, Crimean Tatar, Gagauz, Kazakh, Tatar; Vietnamese, Igbo, Swahili, Tagalog, Slovenian, Georgian, Tamil, Uzbek, and Albanian.
    • Indonesian / Malay and Swahili don't have any Latin diacritics that need protecting, so they get asciifolding, too.
    • Some languages that don't use the Latin alphabet still need some icu_folding exceptions.
      • icu_folding doesn't like viramas, but we can make it play nice.
    • Chinese is an outlier because the smartcn_tokenizer does weird things, but enabling "icu_folding_preserve" on the plain field is still a good thing.
    • Lots of languages that use the Latin alphabet (some in conjunction with the Cyrillic alphabet) need exceptions.
      • A few needed mappings for comma/cedilla confusion (similar to Romanian) for Șș, Țț vs Şş, Ţţ.
      • Vietnames capital "d with stroke" Đđ looks a lot like capital "eth" Ðð... compensatory corrective mapping ensues.
  • This expands ICU folding coverage to the languages of the top 50 Wikipedias and 61 of the top 90 in my list (sorted by number of unique queries in a month), and to the languages of 65 Wikipedias overall.

Patches for these changes to follow—though they are a little complicated, so don't hold your breath.

The second round will start by looking at 11 more Indic languages with Brahmic scripts.

Change #1069282 had a related patch set uploaded (by Tjones; author: Tjones):

[mediawiki/extensions/CirrusSearch@master] Harmonize asciifolding and icu_folding--Part 5A

https://gerrit.wikimedia.org/r/1069282

Change #1069289 had a related patch set uploaded (by Tjones; author: Tjones):

[mediawiki/extensions/CirrusSearch@master] Harmonize asciifolding and icu_folding--Part 5B

https://gerrit.wikimedia.org/r/1069289

Change #1069296 had a related patch set uploaded (by Tjones; author: Tjones):

[mediawiki/extensions/CirrusSearch@master] Harmonize asciifolding and icu_folding--Part 5C

https://gerrit.wikimedia.org/r/1069296

Change #1069282 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Harmonize asciifolding and icu_folding--Part 5A

https://gerrit.wikimedia.org/r/1069282

Change #1069289 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Harmonize asciifolding and icu_folding--Part 5B

https://gerrit.wikimedia.org/r/1069289

Change #1069296 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Harmonize asciifolding and icu_folding--Part 5C

https://gerrit.wikimedia.org/r/1069296

Change #1074518 had a related patch set uploaded (by Tjones; author: Tjones):

[mediawiki/extensions/CirrusSearch@master] [WIP] Harmonize asciifolding and icu_folding--Part 6

https://gerrit.wikimedia.org/r/1074518

A full write up with details of the 11 languages using Indic scripts (Marathi, Burmese, Malayalam, Telugu, Sinhala, Kannada, Gujarati, Nepali, Assamese, Punjabi, and Odia) that are configured in this last patch is on Mediawiki.

Experience is something you don't get until just after you need it. —Steven Wright

Within the Indic abugidas, there are patterns—somewhat inconsistent patterns, but patterns nonetheless: ICU folding has a strong dislike for viramas; in some Indic scripts vowel signs are clobberized, in others they are untouched, so you gotta check; nuktas are generally okay to strip, though there can be specific exceptions; "composed" diacritics are sometimes written in pieces, and the number of fonts that render underlyingly different strings the same is a rough predictor of how likely icu_normalizer is to fix them; normalizing numerals is a good thing.

With previous updates this adds support for customized ICU folding to 31 new languages, with ICU folding coverage for the languages of the top 50 Wikipedias and 72 of the top 90 in my list (by unique query volume), and for the languages of 76 Wikipedias overall.

If you are familiar with any of the languages above, and you think nuktas should be kept, or viramas should be stripped, or you can think of any other characters that should be considered equivalent, drop me a line... leave a comment here, open a new ticket, or contact me in any of the usual places.

I'll open tickets for future ICU folding upgrades and some of the incidental items I discovered along the way, but that's a task for next week, I think.

Change #1074518 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Harmonize asciifolding and icu_folding--Part 6

https://gerrit.wikimedia.org/r/1074518