Standardize ASCII-folding/ICU-folding across analyzers
Closed, ResolvedPublic8 Estimated Story Points
Actions

Assigned To

Authored By

	TJones
	Mar 16 2023, 6:12 PM

Description

User Story: As a multi-lingual searcher, I would like more consistency and predictability in how character folding works across wikis.

Some languages have ASCII folding disabled, some have it enabled, some have it enabled with the option to preserve the unfolded original; some upgrade ASCII folding (with or without preserving the original) to ICU folding.

Acceptance Critera:

Either an update to AnalysisConfigBuilder to make ASCII-folding / ASCII-folding preserve more consistently used or a better understanding of why it should be different across languages.
Bonus: An easy mechanism to enable custom ICU folding for a given language code without having to create a full analysis config for that language. (This may already exist.)

Summary list of affected languages: Assamese (as), Azerbaijani (az), Crimean Tatar (crh), Greek (el), French (fr), Gagauz (gag), Gujarati (gu), Indonesian (id), Igbo (ig), Italian (it), Georgian (ka), Kazakh (kk), Khmer (km), Kannada (kn), Korean (ko), Malayalam (ml), Marathi (mr), Malay(ms), Mirandese (mwl), Burmese (my), Nepali (ne), Odia (or), Punjabi (pa), Polish (pl), Sinhala (si), Slovenian (sl), Albanian (sq), Swedish (sv), Swahili (sw), Tamil (ta), Telugu (te), Tagalog (tl), Tatar (tt), Uzbek (uz), Vietnamese (vi), Chinese (zh)

Details

Subject	Repo	Branch	Lines +/-
Harmonize asciifolding and icu_folding--Part 6	mediawiki/extensions/CirrusSearch	master	+6 K -20
Harmonize asciifolding and icu_folding--Part 5C	mediawiki/extensions/CirrusSearch	master	+394 -183
Harmonize asciifolding and icu_folding--Part 5B	mediawiki/extensions/CirrusSearch	master	+601 -127
Harmonize asciifolding and icu_folding--Part 5A	mediawiki/extensions/CirrusSearch	master	+4 K -13
Harmonize asciifolding and icu_folding--Part 4	mediawiki/extensions/CirrusSearch	master	+747 -3 K
Harmonize asciifolding and icu_folding--Part 3	mediawiki/extensions/CirrusSearch	master	+98 -737
Harmonize asciifolding and icu_folding--Part 2	mediawiki/extensions/CirrusSearch	master	+111 -1 K
Harmonize asciifolding and icu_folding--Part 1	mediawiki/extensions/CirrusSearch	master	+51 -123

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T219550 [EPIC] Harmonize language analysis across languages
Resolved	TJones	T332342 Standardize ASCII-folding/ICU-folding across analyzers
Resolved	TJones	T375557 Reindex all wikis to enable folding harmonization and new functionality

Event Timeline

TJones created this task.Mar 16 2023, 6:12 PM

TJones mentioned this in T219550: [EPIC] Harmonize language analysis across languages.

TJones edited projects, added Discovery-Search (Current work); removed Discovery-Search.Mar 20 2023, 4:27 PM

• MPhamWMF set the point value for this task to 5.Apr 10 2023, 3:49 PM

• MPhamWMF moved this task from Incoming to Ready for Dev -- SWE on the Discovery-Search (Current work) board.

TJones changed the point value for this task from 5 to 8.Jul 17 2023, 3:50 PM

TJones moved this task from Ready for Dev -- SWE to In Progress on the Discovery-Search (Current work) board.Jul 31 2023, 7:03 PM

TJones claimed this task.Jul 31 2023, 8:44 PM

Moved back to ready for dev while working on T346051

TJones mentioned this in T346051: Refactor slow global analysis components.Sep 27 2023, 5:02 PM

Moving this back to the backlog in favor of a smaller next harmonization project.

TJones triaged this task as High priority.Feb 26 2024, 3:03 PM

TJones moved this task from needs triage to Language Stuff on the Discovery-Search board.

TJones edited projects, added Discovery-Search (Current work); removed Discovery-Search.Jul 8 2024, 3:19 PM

TJones claimed this task.Jul 16 2024, 6:35 PM

TJones moved this task from Ready for Dev -- SWE to In Progress on the Discovery-Search (Current work) board.

Full write up on Mediawiki. In summary:

English, Italian, French, Swedish, and Greek are wild. They can and should be tamed. Mostly we can and should blame The Ancient Ones.
In general, icu_folding languages should use icu_folding for the text, text_search, and lowercase_keyword analyzers. plain should use "icu_folding_preserve", and plain_search needs no extra folding.
dedup_asciifolding hasn't been necessary since Lucene 6.3—we're at 8.7. We can remove it—and declare victory! It got fixed in part because we noticed the problem and opened a ticket upstream.
The problem with Greek is in part because we sometimes use asciifolding as a proxy for "we want icu_folding here." We should not do that. Refactor!
A few languages (Chinese, Indonesian, Khmer, Korean, Polish) have no extra folding but should have had icu_folding applied when they were unpacked or otherwise customized. We could add icu_folding for others, too, though that may require new code.

This is a lot, and some of these changes touch a lot of the test fixtures in overlapping ways, so I'm breaking it up into at least one patch for each of the above. The first three patches are ready, and Gerrit willing, should be up soon.

Change #1062465 had a related patch set uploaded (by Tjones; author: Tjones):

[mediawiki/extensions/CirrusSearch@master] Harmonize asciifolding and icu_folding--Part 1

https://gerrit.wikimedia.org/r/1062465

Change #1062466 had a related patch set uploaded (by Tjones; author: Tjones):

[mediawiki/extensions/CirrusSearch@master] Harmonize asciifolding and icu_folding--Part 2

https://gerrit.wikimedia.org/r/1062466

Change #1062467 had a related patch set uploaded (by Tjones; author: Tjones):

[mediawiki/extensions/CirrusSearch@master] Harmonize asciifolding and icu_folding--Part 3

https://gerrit.wikimedia.org/r/1062467

Change #1062465 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Harmonize asciifolding and icu_folding--Part 1

https://gerrit.wikimedia.org/r/1062465

ReleaseTaggerBot added a project: MW-1.43-notes (1.43.0-wmf.19; 2024-08-20).Aug 15 2024, 9:00 PM

Change #1062466 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Harmonize asciifolding and icu_folding--Part 2

https://gerrit.wikimedia.org/r/1062466

Change #1062467 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Harmonize asciifolding and icu_folding--Part 3

https://gerrit.wikimedia.org/r/1062467

Maintenance_bot removed a project: Patch-For-Review.Aug 16 2024, 8:30 PM

Change #1064126 had a related patch set uploaded (by Tjones; author: Tjones):

[mediawiki/extensions/CirrusSearch@master] Harmonize asciifolding and icu_folding--Part 4

https://gerrit.wikimedia.org/r/1064126

gerritbot added a project: Patch-For-Review.Aug 20 2024, 10:20 PM

Full Part 4—Refactoring & Analysis Notes are on Mediawiki.

Quick summary:

much refactoring; very wow!
- automated addition of remove_empty after icu_folding and found a case where it was missed, and a case where it was invoked twice.
- refactored tests, too
quite a few languages do want asciifolding even when no upgrade to icu_folding is available; these have been made explicit and intentional, so third-party users without all the plugins we use will have better analysis chains

Change #1064126 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Harmonize asciifolding and icu_folding--Part 4

https://gerrit.wikimedia.org/r/1064126

Apparently this should have been an (oxymoronic) mini-epic.

I've got a write up on Mediawiki for the changes needed to the first round of applying icu_folding to more languages.

Highlights:

20½ languages will have ICU folding added: Chinese, Indonesian / Malay, Khmer, Korean, Polish; Mirandese, Azerbaijani, Crimean Tatar, Gagauz, Kazakh, Tatar; Vietnamese, Igbo, Swahili, Tagalog, Slovenian, Georgian, Tamil, Uzbek, and Albanian.
- Indonesian / Malay and Swahili don't have any Latin diacritics that need protecting, so they get asciifolding, too.
- Some languages that don't use the Latin alphabet still need some icu_folding exceptions.
  - icu_folding doesn't like viramas, but we can make it play nice.
- Chinese is an outlier because the smartcn_tokenizer does weird things, but enabling "icu_folding_preserve" on the plain field is still a good thing.
- Lots of languages that use the Latin alphabet (some in conjunction with the Cyrillic alphabet) need exceptions.
  - A few needed mappings for comma/cedilla confusion (similar to Romanian) for Șș, Țț vs Şş, Ţţ.
  - Vietnames capital "d with stroke" Đđ looks a lot like capital "eth" Ðð... compensatory corrective mapping ensues.
☞ This expands ICU folding coverage to the languages of the top 50 Wikipedias and 61 of the top 90 in my list (sorted by number of unique queries in a month), and to the languages of 65 Wikipedias overall.

Patches for these changes to follow—though they are a little complicated, so don't hold your breath.

The second round will start by looking at 11 more Indic languages with Brahmic scripts.

Change #1069282 had a related patch set uploaded (by Tjones; author: Tjones):

[mediawiki/extensions/CirrusSearch@master] Harmonize asciifolding and icu_folding--Part 5A

https://gerrit.wikimedia.org/r/1069282

Change #1069289 had a related patch set uploaded (by Tjones; author: Tjones):

[mediawiki/extensions/CirrusSearch@master] Harmonize asciifolding and icu_folding--Part 5B

https://gerrit.wikimedia.org/r/1069289

Change #1069296 had a related patch set uploaded (by Tjones; author: Tjones):

[mediawiki/extensions/CirrusSearch@master] Harmonize asciifolding and icu_folding--Part 5C

https://gerrit.wikimedia.org/r/1069296

Change #1069282 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Harmonize asciifolding and icu_folding--Part 5A

https://gerrit.wikimedia.org/r/1069282

Change #1069289 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Harmonize asciifolding and icu_folding--Part 5B

https://gerrit.wikimedia.org/r/1069289

Change #1069296 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Harmonize asciifolding and icu_folding--Part 5C

https://gerrit.wikimedia.org/r/1069296

ReleaseTaggerBot edited projects, added MW-1.43-notes (1.43.0-wmf.22; 2024-09-10); removed MW-1.43-notes (1.43.0-wmf.19; 2024-08-20).Sep 3 2024, 4:00 PM

Maintenance_bot removed a project: Patch-For-Review.Sep 3 2024, 4:30 PM

Change #1074518 had a related patch set uploaded (by Tjones; author: Tjones):

[mediawiki/extensions/CirrusSearch@master] [WIP] Harmonize asciifolding and icu_folding--Part 6

https://gerrit.wikimedia.org/r/1074518

gerritbot added a project: Patch-For-Review.Sep 20 2024, 7:09 PM

A full write up with details of the 11 languages using Indic scripts (Marathi, Burmese, Malayalam, Telugu, Sinhala, Kannada, Gujarati, Nepali, Assamese, Punjabi, and Odia) that are configured in this last patch is on Mediawiki.

Experience is something you don't get until just after you need it. —Steven Wright

Within the Indic abugidas, there are patterns—somewhat inconsistent patterns, but patterns nonetheless: ICU folding has a strong dislike for viramas; in some Indic scripts vowel signs are clobberized, in others they are untouched, so you gotta check; nuktas are generally okay to strip, though there can be specific exceptions; "composed" diacritics are sometimes written in pieces, and the number of fonts that render underlyingly different strings the same is a rough predictor of how likely icu_normalizer is to fix them; normalizing numerals is a good thing.

With previous updates this adds support for customized ICU folding to 31 new languages, with ICU folding coverage for the languages of the top 50 Wikipedias and 72 of the top 90 in my list (by unique query volume), and for the languages of 76 Wikipedias overall.

If you are familiar with any of the languages above, and you think nuktas should be kept, or viramas should be stripped, or you can think of any other characters that should be considered equivalent, drop me a line... leave a comment here, open a new ticket, or contact me in any of the usual places.

I'll open tickets for future ICU folding upgrades and some of the incidental items I discovered along the way, but that's a task for next week, I think.

TJones moved this task from In Progress to Needs review on the Discovery-Search (Current work) board.Sep 23 2024, 3:05 PM

TJones moved this task from Needs review to To Be Deployed on the Discovery-Search (Current work) board.

TJones moved this task from To Be Deployed to Needs review on the Discovery-Search (Current work) board.

Change #1074518 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Harmonize asciifolding and icu_folding--Part 6

https://gerrit.wikimedia.org/r/1074518

TJones moved this task from Needs review to To Be Deployed on the Discovery-Search (Current work) board.Sep 23 2024, 3:12 PM

Maintenance_bot removed a project: Patch-For-Review.Sep 23 2024, 3:30 PM

TJones mentioned this in T375557: Reindex all wikis to enable folding harmonization and new functionality.Sep 24 2024, 8:13 PM

TJones mentioned this in T375561: Apply ICU folding to more languages.Sep 24 2024, 9:19 PM

dr0ptp4kt moved this task from To Be Deployed to Needs Reporting on the Discovery-Search (Current work) board.Oct 1 2024, 3:04 PM

Gehel closed this task as Resolved.Oct 4 2024, 7:44 AM