Tokenization is one of the most performance sensitive operations we perform in feature extraction. It's also somewhere where we duplicate the work done in other systems -- e.g. the systems that feed into ElasticSearch. There are lots of difficulties involved in processing words and word-like text sequences in Wikitext. E.g. python's definition of "\w" in regex does not account for many non-latin word characters (e.g. diacritics).
Research:
- Profile the speed of tokenization in revscoring
- Explore ElasticSearch/Lucene do tokenization. Can we somehow share common code with those systems?
- Explore python libraries for tokenization. Anything we should invest in? (e.g. gensim, spacy, etc)
Implementation:
- Demonstrate improvements in revscoring's ability to process non-latin text
- Demonstrate improvements in the performance of tokenization.
Note that revscoring depends on deltas for tokenization.