Page MenuHomePhabricator

Add features for English Language idioms to articlequality models
Closed, ResolvedPublic

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

See https://github.com/wikimedia/articlequality/blob/master/articlequality/feature_lists/tests/test_enwiki.py

I'm imagining something like this:

idioms_count = english.revision.idioms.matches

local_wiki = [
  ...
  idioms,
  words_to_watch_count + idioms_count,
  (words_to_watch_count + idioms_count) / modifiers.max(wikitext.revision.words, 1),
  idioms_count / modifiers.max(wikitext.revision.words, 1)
]

I ran a test with this and discussed it yesterday with @HAKSOAT. In https://gist.github.com/halfak/b9ce3f174a066e4851d04a2de7d2437d , we can see that there are a lot of phrases that are not problematic idioms that get picked up. It seems that this is because of the way that wiktionary categorizes "idioms". @HAKSOAT and I discussed a few different strategies for cleaning up the dataset.

One method that I proposed is that we gather a set of known good articles and scan them using the idiom set. We can then remove any "idioms" that are commonly matched in the good articles. That would allow us to automatically update the dataset again in the future.

One other thing I noticed is that we tend to pick up idioms in citation text. I wonder if we could somehow exclude all tags when scanning. That would help us disregard block quotes and <ref> tags. This could be a non-issue though. And it sounds hard so if we don't find an obvious way to do this, that's OK.

I have been able to work on English idioms and excited to say that the speed has been improved by a large margin. Before now, we were simply piping together a bunch of English Idioms and this was quite inefficient.

To make this efficient, we needed to factorize repetitive characters from the list. During a chat with @Halfak , he suggested I check up the Trie data structure.

Hence, I was able to convert the list of English idioms into a trie and then back into the regex we desire. Before the change, extracting idioms took 18.34465742111206 seconds and after the change it took 0.706458568572998 seconds.