Page MenuHomePhabricator

Investigate changing ICU tokenization from whitelist to blacklist
Open, MediumPublic

Description

Working on T147959 highlighted that some wikis with languages that get the "default" config use the Standard Tokenizer, and some use the ICU tokenizer. After discussion with @dcausse it seems plausible that we should just enable it for all default languages, but this needs testing.

  1. Run a largish sample from enwiki, which will contain lots of characters in lots of scripts and lots of words in lots of languages, to see if there are any obvious problems switching tokenizers.
  2. If not, then review the languages that use the standard tokenizer and see if any are likely to have problems.
  3. Convert $languagesWithIcuTokenization and shouldActivateIcuTokenization() from a whitelist to a blacklist (and double check its interaction with language-specific analyzers).
  4. Deploy and re-index a loooooooot of wikis. (Probably all the ones that are marked "ICU normalizer + Standard tokenizer" in this list.)

Event Timeline

debt triaged this task as Medium priority.Oct 12 2017, 5:09 PM
debt moved this task from needs triage to This Quarter on the Discovery-Search board.
TJones renamed this task from Investigate changing ICU tokenization from whitelist to blacklist. to Investigate changing ICU tokenization from whitelist to blacklist.Nov 13 2017, 3:48 PM