Page MenuHomePhabricator

Investigate changing ICU tokenization from whitelist to blacklist
Closed, DuplicatePublic

Description

Working on T147959 highlighted that some wikis with languages that get the "default" config use the Standard Tokenizer, and some use the ICU tokenizer. After discussion with @dcausse it seems plausible that we should just enable it for all default languages, but this needs testing.

  1. Run a largish sample from enwiki, which will contain lots of characters in lots of scripts and lots of words in lots of languages, to see if there are any obvious problems switching tokenizers.
  2. If not, then review the languages that use the standard tokenizer and see if any are likely to have problems.
  3. Convert $languagesWithIcuTokenization and shouldActivateIcuTokenization() from a whitelist to a blacklist (and double check its interaction with language-specific analyzers).
  4. Deploy and re-index a loooooooot of wikis. (Probably all the ones that are marked "ICU normalizer + Standard tokenizer" in this list.)