Investigate changing ICU tokenization from whitelist to blacklist
Closed, DuplicatePublic
Actions

Assigned To

None

Authored By

	TJones
	Oct 10 2017, 7:52 PM

Description

Working on T147959 highlighted that some wikis with languages that get the "default" config use the Standard Tokenizer, and some use the ICU tokenizer. After discussion with @dcausse it seems plausible that we should just enable it for all default languages, but this needs testing.

Run a largish sample from enwiki, which will contain lots of characters in lots of scripts and lots of words in lots of languages, to see if there are any obvious problems switching tokenizers.
If not, then review the languages that use the standard tokenizer and see if any are likely to have problems.
Convert $languagesWithIcuTokenization and shouldActivateIcuTokenization() from a whitelist to a blacklist (and double check its interaction with language-specific analyzers).
Deploy and re-index a loooooooot of wikis. (Probably all the ones that are marked "ICU normalizer + Standard tokenizer" in this list.)

Related Objects

Mentioned In: T147959: Generic language fallbacks in Mediawiki should not be used for Elasticsearch language analyzers
Mentioned Here: T147959: Generic language fallbacks in Mediawiki should not be used for Elasticsearch language analyzers

Event Timeline

TJones created this task.Oct 10 2017, 7:52 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 10 2017, 7:52 PM

TJones mentioned this in T147959: Generic language fallbacks in Mediawiki should not be used for Elasticsearch language analyzers.Oct 10 2017, 8:02 PM

debt triaged this task as Medium priority.Oct 12 2017, 5:09 PM

debt moved this task from needs triage to This Quarter on the Discovery-Search board.

TJones removed TJones as the assignee of this task.Oct 24 2017, 4:09 PM

debt moved this task from This Quarter to Tech Debt/Misc on the Discovery-Search board.Oct 24 2017, 5:26 PM

TJones renamed this task from Investigate changing ICU tokenization from whitelist to blacklist. to Investigate changing ICU tokenization from whitelist to blacklist.Nov 13 2017, 3:48 PM

TJones moved this task from Tech Debt/Misc to Language Stuff on the Discovery-Search board.Jan 29 2019, 6:38 PM

TJones closed this task as a duplicate of T356643: Enable icu_tokenizer (almost) everywhere and update AnalysisConfigBuilder to use icu_token_repair.Apr 16 2024, 2:45 PM

Investigate changing ICU tokenization from whitelist to blacklistClosed, DuplicatePublicActions

Description

Related Objects

Event Timeline

Investigate changing ICU tokenization from whitelist to blacklist
Closed, DuplicatePublic
Actions