Page MenuHomePhabricator

Implement common features between languages as a meta-language features
Closed, ResolvedPublic

Description

Right now, language sets (e.g. revscoring.languages.english) implement SpaceDelimited language. This works OK, but it means that every language set implements its own "revision.words". We should, instead, have be able to import meta-languages features separately. E.g.

from revscoring.languages.space_delimited.revision import words

This would pave the way for having meta-language features for CJK

from revscoring.languages.cjk.revision import cjk_symbols

Event Timeline

Halfak created this task.Dec 9 2015, 9:02 PM
Halfak updated the task description. (Show Details)
Halfak raised the priority of this task from to Needs Triage.
Halfak moved this task to Active on the Scoring-platform-team (Current) board.
Halfak added a subscriber: Halfak.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptDec 9 2015, 9:02 PM

I made some serious progress here. At first I tried mixins, but it seems like it makes more sense to just use namespaces and groups of features that rely on particular language assets.

See https://github.com/wiki-ai/revscoring/tree/features_commons/revscoring/languages/features

We now have sets of features for languages with:

  • dictionary
  • stemmed
  • stopwords
  • regexes

This will allow us to build up languages with namespaces based on what assets we have. This will allow for patterns like this:

from revscoring.languages import english

damaging = [
    ...
    english.badwords.revision.matches - english.badwords.revision.parent.matches,
    english.badwords.diff.match_prop_delta,
    english.informals.diff.match_prop_delta,
    english.stemmed.revision.stem_prop_delta,
    english.dictionary.non_dict_words_prop_delta,
    ...
]

Not how the sub-namespaces under "english" contain collections of features.

Halfak set Security to None.
Halfak claimed this task.Dec 23 2015, 4:17 AM
He7d3r added a subscriber: He7d3r.Dec 23 2015, 3:02 PM

Nitpick: shouldn't "stemmed" (verb) be "stems" (noun) for consistency with the other sets of features, which are also nouns?

I was thinking that too. I like the adjectives better since they are describing some aspect of the underlying features.

Regretfully, I couldn't think of an adjective that could apply to "badwords" or "dictionary" without sounding weird.

In a few other places, we can have some nice adjectives like "parsed", "tokenized", and "temporal".

See https://github.com/wiki-ai/revscoring/pull/233

I ended up sticking with "stemmed" but I'm still open to discussion on this point.

Halfak closed this task as Resolved.Jan 21 2016, 3:43 PM