Change Details

Tokenization is one of the most performance sensitive operations we perform in feature extraction. It's also somewhere where we duplicate the work done in other systems -- e.g. the systems that feed into ElasticSearch. There are lots of difficulties involved in processing words and word-like text sequences in Wikitext. E.g. python's definition of "\w" in regex does not account for many non-latin word characters (e.g. diacritics). ## Research: * Profile the speed of tokenization in revscoring * Explore ElasticSearch/Lucene do tokenization. Can we somehow share common code with those systems? * Explore python libraries for tokenization. Anything we should invest in? (e.g. gensim, spacy, etc) ## Implementation: * Demonstrate improvements in revscoring's ability to process non-latin text * Demonstrate improvements in the performance of tokenization. Note that revscoring depends on [deltas](https://pythonhosted.org/deltas/) for tokenization.