Tokenization is one of the most performance sensitive operations we perform in feature extraction. It's also somewhere where we duplicate the work done in other systems -- e.g. the systems that feed into ElasticSearch. There are lots of difficulties involved in processing words and word-like text sequences in Wikitext. E.g. python's definition of "\w" in regex does not account for many non-latin word characters (e.g. diacritics).
## Research:
* Profile the speed of tokenization in revscoring
* Explore ElasticSearch/Lucene do tokenization. Can we somehow share common code with those systems?
* Explore python libraries for tokenization. Anything we should invest in? (e.g. gensim, spacy, etc)
## Implementation:
* Demonstrate improvements in revscoring's ability to process non-latin text
* Demonstrate improvements in the performance of tokenization.
Note that revscoring depends on [deltas](https://pythonhosted.org/deltas/) for tokenization.