Page MenuHomePhabricator

Improve the performance and quality of tokenization in revscoring
Closed, ResolvedPublic


Tokenization is one of the most performance sensitive operations we perform in feature extraction. It's also somewhere where we duplicate the work done in other systems -- e.g. the systems that feed into ElasticSearch. There are lots of difficulties involved in processing words and word-like text sequences in Wikitext. E.g. python's definition of "\w" in regex does not account for many non-latin word characters (e.g. diacritics).


  • Profile the speed of tokenization in revscoring
  • Explore ElasticSearch/Lucene do tokenization. Can we somehow share common code with those systems?
  • Explore python libraries for tokenization. Anything we should invest in? (e.g. gensim, spacy, etc)


  • Demonstrate improvements in revscoring's ability to process non-latin text
  • Demonstrate improvements in the performance of tokenization.

Note that revscoring depends on deltas for tokenization.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

In the past, @TJones gave a second-hand recommendation for TextBlob.

Also here's some notes from our past conversations around NLP and ORES:

I'm currently running a profiling script against feature extraction for ORES English Wikipedia editquality model. I'll report the results here when it finishes.

@Halfak Any update on this? The profiling script.

Thanks for the reminder. See P10868

A few surprising things to me.

(1) I didn't realize we were running mwparserfromhell in our editquality models. That's certainly a place where we can gain some speed. We should be able to generate effective features without running that parser.
(2) I'm very surprised to see the "dict_split" be as slow as it is. This is a call to pyenchant. It matches words against our dictionaries. I've made some changes there recently.
(3) We do see tokenization show up towards the top. Because of the two surprises above, tokenization doesn't look that bad! But I assure you it is. I think we can fix (1) and (2) easily.
(4) The badwords and informal matches are pretty slow too. This is similar to the performance issue we had with idioms. I wonder if there's a general improvement we can make there to all of these matching strategies.

Great. Overall, not as bad as initially thought.

I have worked to convert the regex from being Python to Java compatible as seen here.

For the tokenizer to work on Elasticsearch, I had to make use of Elasticsearch's pattern tokenizer. Tokenization is then done through the Elasticsearch's analyze api.

On comparing the existing wikitext_split tokenizer to the tokenizer created on Elasticsearch, I realized that Elasticsearch is about two times slower.

My observation above can be seen here.

NB: The Elastisearch in use for the shared Github gist is the one installed on my local machine. Hence, to test on another machine, it also has to be installed there with necessary changes made to the domain name and port number.

One further issue currently being faced with Elasticsearch is that named groups can't be extracted. So while it tokenizes the text, I am yet to find a way to also extract the named group for each token.

I am currently reading up different resources on writing efficient regex so I can figure out possible improvements to the regex in use.

@HAKSOAT, I have a couple of suggestions, which may or may not be helpful.

I have written a fairly similar script to call Elasticsearch to tokenize text line-by-line, and I found that a lot of the time spent in that script was overhead of calling Elasticsearch. Bundling up to 100,000 characters per call improved performance a lot. So, two things: (i) you might want to re-run your test will 100 copies of "Alan Turing" joined by newlines, and (ii) the tokenizer might be faster in your real use case if you are feeding it bigger chunks of text at a time.

As for the named groups in your regex, I'm pretty sure you could write a custom token filter (in Java) to change the token type based on what group the token matches. Then you'd get the info you need in a format already supported by Elasticsearch.

Hope that helps!

Thanks for this @TJones . Yes, a lot of the time spent was due to the overhead calling the API. I don't think combining 100 Alan Turings will help though as that was just to profile performance. In the real use case, we will actually only be tokenizing a single article (or a single document depending on the scenario). Hence, we can't combine at that point.

For the custom token filter, I didn't know I could do that, I'll check up and see what I can do.

I appreciate your input :-)

For the custom token filter, we have written a few simple ones that just wrap a stemmer. Two are in the extra-analysis plugin—for Serbian and Esperanto. The * files are the actual filters, and all they do is call the stemmer and replace the text of the token. You could do something similar except change the token type. The * files register the filters for use in your analysis chains.

My dev environment is horked at the moment, but if you want more details about the fields and how to set them I can find examples after I get it working again.

Building and deploying plugins to production servers is a little more complicated, depending on the circumstances, but other folks on the Search Platform team could give you pointers on that if needed.

Thanks for this. I currently don't know a lot of Java, just Python and PHP. So I'll need as much help as I can get.

I'm working on improving the regex at the moment though. So if that doesn't yield any positives, I'll reach out to the Search Platform team regarding the custom filters.

In the last week, I have worked extensively on improving my understanding of regex engines and how to write optimized regular expressions.

I finally started working on the existing regex and got the first performance improvement, this was done by reshuffling the alternatives, putting prioritizing regex for most frequently occurring tokens.

I also put more effort into optimizing each alternative in the regular expression, now the regex gives about a 40% performance increase. Some code snippets can be seen here.

While I do not think it necessary to use Elasticsearch at this point, next step for me would be to use the new regex with Elasticsearch and see how it performs.
Asides improving performance of regex, I also worked on the correctness of some the individual expressions as the matched values we do not want them to match at certain instances.

@HAKSOAT and I chatted about using unicode ranges in the "word" token. See the "Unicode Scripts" section of

I opened a pull request for improving the regex used by wikitext_split on the deltas package.

Using Alan Turing's article for benchmarking purposes, the previous regex could tokenize about 5.1 articles per second, while the new about 7.7 articles per second.

Asides this, the new regex corrects the matching of wrong tokens. Two notable cases was seen with tag tokens matching wrongly. For example, <ref name="doi10.1093/qjmam/1.1.287"> matched as a tag instead of a ref_open.

Also, url tokens matched wrongly in different cases. For example,|archive-date=5 and</ref> matched as a url, instead of multiple tokens considering the extra characters that do not qualify as a url.

Something that caught my interest was the performance gains when the cjk and japan_punct are removed from the regex. Here are some numbers:

On the Alan Turing article, 131643 characters:

old regex: ~5.1 articles per second
new regex without cjk and japan_punct: ~8.2 articles per second
new regex with cjk and japan_punct: ~7.7 articles per second

On the Michael Jackson article, 236025 characters:

old regex: ~2.9 articles per second
new regex without cjk and japan_punct: ~4.6 articles per second
new regex with cjk and japan_punct: ~4.2 articles per second

I think the Michael Jackson article is one of the longest on wiki, perhaps there are longer ones. However, the most common token for many articles is the word token. In the regex, an attempt to match cjk and japan_punct has to be done first before an attempt to match a word token.

This means a lot of inefficient steps taken, considering there is no cjk or japan_punct token in both articles. As a result, I am thinking we could have a different tokenizer for this languages, if the performance gains from the removal of cjk and japan_punct are big enough. However, I also consider that there may be articles with a mix of cjk and japan_punct tokens in them.

Changes merged. I just released deltas-0.5.0.

Re. CJK, I think that we should put Korean chars into the word type since the language is space delimited.

For Chinese and Japanese, we can unify some of the character ranges. For some reason, I specified some smaller character ranges that could be combined into bigger ranges. E.g. from

cjk = (
    r'[' +
        r'\u4E00-\u62FF' +  # noqa Unified Ideographs
            r'\u6300-\u77FF' +
            r'\u7800-\u8CFF' +
            r'\u8D00-\u9FCC' +
        r'\u3400-\u4DFF' +  # Unified Ideographs Ext A
        r'\U00020000-\U000215FF' +  # Unified Ideographs Ext. B
            r'\U00021600-\U000230FF' +
            r'\U00023100-\U000245FF' +
            r'\U00024600-\U000260FF' +
            r'\U00026100-\U000275FF' +
            r'\U00027600-\U000290FF' +
            r'\U00029100-\U0002A6DF' +
        r'\uF900-\uFAFF' +  # Compatibility Ideographs
        r'\U0002F800-\U0002FA1F' +  # Compatibility Ideographs Suppl.
        r'\u3041-\u3096' +  # Hiragana
        r'\u30A0-\u30FF' +  # Katakana
        r'\u3400-\u4DB5' +  # Kanji
            r'\u4E00-\u9FCB' +
            r'\uF900-\uFA6A' +
        r'\u2E80-\u2FD5' +  # Kanji radicals
        r'\uFF5F-\uFF9F' +  # Katakana and Punctuation (Half Width)
        r'\u31F0-\u31FF' +  # Miscellaneous Japanese Symbols and Characters
            r'\u3220-\u3243' +

Can be simplified to:

cjk = (
    r'[' +
        r'\u4E00-\u9FCC' +  # noqa Unified Ideographs
        r'\u3400-\u4DFF' +  # Unified Ideographs Ext A
        r'\U00020000-\U0002A6DF' +  # Unified Ideographs Ext. B
        r'\uF900-\uFAFF' +  # Compatibility Ideographs
        r'\U0002F800-\U0002FA1F' +  # Compatibility Ideographs Suppl.
        r'\u3041-\u3096' +  # Hiragana
        r'\u30A0-\u30FF' +  # Katakana
        r'\u3400-\u4DB5' +  # Kanji
            r'\u4E00-\u9FCB' +
            r'\uF900-\uFA6A' +
        r'\u2E80-\u2FD5' +  # Kanji radicals
        r'\uFF5F-\uFF9F' +  # Katakana and Punctuation (Half Width)
        r'\u31F0-\u31FF' +  # Miscellaneous Japanese Symbols and Characters
            r'\u3220-\u3243' +

Generally, I wonder if we can do a better job of matching script ranges if we use the "script names" form the unicode standard. Standard python re doesn't support "\p{Latin}" but the seemingly well-maintained regex package does. See What do you think?

Thanks for linking to the regex package. I spent some time working with it and it is quite amazing. I think we can make use of it.

I checked the regex package. While it works great for the unicode scripts, it's reducing performance by about 40%. I checked the implementation of the library in other to see if I could get some inspiration and pull that into ours, but I see a lot of C code that I do not understand. Looks like I'll have to work with specifying the code ranges using the built-in re package instead.

Update: I ended up creating a lexicon to be used for cjk-dominant text and another to be used for non-cjk dominant text.

I created just these two types because chinese and japanese usually do not have spaces in between unlike other languages. Note that "cjk" in this case, doesn't include Korean because it has spaces in between words.

Another reason for creating two different lexicons is to achieve improved speed when using the right lexicon for the right kind of text. The effect can be seen in the following rough numbers:

Tokenizing Michael Jackson's article:

We can tokenize 4.8637554318290706 Michael Jackson's per second with non-cjk lexicon
We can tokenize 4.360152652496435 Michael Jackson's per second with cjk lexicon
We can tokenize 3.288650007989335 Michael Jackson's per second with the old lexicon

Tokenizing 土佐のほっぱん's article:

We can tokenize 17.67363030549188 土佐のほっぱん's per second with non-cjk lexicon
We can tokenize 38.497820037663445 土佐のほっぱん's per second with cjk lexicon
We can tokenize 31.25670183722281 土佐のほっぱん's per second with the old lexicon

From this, we can see that the lexicons are optimized for their specialized cases, and still work when not used for their specialized cases albeit not very fast.

I decided to go with the non-cjk lexicon but with the regex for matching cjk text much higher up this list. This provided a balance in speed, so non-cjk lexicon doesn't run so slow on cjk text.