Page MenuHomePhabricator

[Spike] Implement script-optimized tokenization
Closed, ResolvedPublic

Description

Our regex tokenizer is slower on some scripts than other because it must handle all of them. If we can optimize the Lexicon to have the target script occur earlier in the regex, that will allow for some performance gains.

This task is done when we implement and test a script specific tokenizer and run a set of tests. If the tests look good, file a follow-up task for implementing script specific tokenization in revscoring.