Page MenuHomePhabricator

Implement common features between languages as a meta-language features
Closed, ResolvedPublic

Description

Right now, language sets (e.g. revscoring.languages.english) implement SpaceDelimited language. This works OK, but it means that every language set implements its own "revision.words". We should, instead, have be able to import meta-languages features separately. E.g.

from revscoring.languages.space_delimited.revision import words

This would pave the way for having meta-language features for CJK

from revscoring.languages.cjk.revision import cjk_symbols

Event Timeline

Halfak raised the priority of this task from to Needs Triage.
Halfak updated the task description. (Show Details)
Halfak moved this task to Active on the Machine-Learning-Team (Active Tasks) board.
Halfak added a subscriber: Halfak.

I made some serious progress here. At first I tried mixins, but it seems like it makes more sense to just use namespaces and groups of features that rely on particular language assets.

See https://github.com/wiki-ai/revscoring/tree/features_commons/revscoring/languages/features

We now have sets of features for languages with:

  • dictionary
  • stemmed
  • stopwords
  • regexes

This will allow us to build up languages with namespaces based on what assets we have. This will allow for patterns like this:

from revscoring.languages import english

damaging = [
    ...
    english.badwords.revision.matches - english.badwords.revision.parent.matches,
    english.badwords.diff.match_prop_delta,
    english.informals.diff.match_prop_delta,
    english.stemmed.revision.stem_prop_delta,
    english.dictionary.non_dict_words_prop_delta,
    ...
]

Not how the sub-namespaces under "english" contain collections of features.

Nitpick: shouldn't "stemmed" (verb) be "stems" (noun) for consistency with the other sets of features, which are also nouns?

I was thinking that too. I like the adjectives better since they are describing some aspect of the underlying features.

Regretfully, I couldn't think of an adjective that could apply to "badwords" or "dictionary" without sounding weird.

In a few other places, we can have some nice adjectives like "parsed", "tokenized", and "temporal".

See https://github.com/wiki-ai/revscoring/pull/233

I ended up sticking with "stemmed" but I'm still open to discussion on this point.