Page MenuHomePhabricator

Investigate changing ICU tokenization from whitelist to blacklist
Open, NormalPublic

Description

Working on T147959 highlighted that some wikis with languages that get the "default" config use the Standard Tokenizer, and some use the ICU tokenizer. After discussion with @dcausse it seems plausible that we should just enable it for all default languages, but this needs testing.

  1. Run a largish sample from enwiki, which will contain lots of characters in lots of scripts and lots of words in lots of languages, to see if there are any obvious problems switching tokenizers.
  2. If not, then review the languages that use the standard tokenizer and see if any are likely to have problems.
  3. Convert $languagesWithIcuTokenization and shouldActivateIcuTokenization() from a whitelist to a blacklist (and double check its interaction with language-specific analyzers).
  4. Deploy and re-index a loooooooot of wikis. (Probably all the ones that are marked "ICU normalizer + Standard tokenizer" in this list.)

Event Timeline

TJones created this task.Oct 10 2017, 7:52 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 10 2017, 7:52 PM
debt triaged this task as Normal priority.Oct 12 2017, 5:09 PM
debt moved this task from needs triage to This Quarter on the Discovery-Search board.
TJones removed TJones as the assignee of this task.Oct 24 2017, 4:09 PM
TJones renamed this task from Investigate changing ICU tokenization from whitelist to blacklist. to Investigate changing ICU tokenization from whitelist to blacklist.Nov 13 2017, 3:48 PM