Implement common features between languages as a meta-language features
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Halfak
	Dec 9 2015, 9:02 PM

Description

Right now, language sets (e.g. revscoring.languages.english) implement SpaceDelimited language. This works OK, but it means that every language set implements its own "revision.words". We should, instead, have be able to import meta-languages features separately. E.g.

from revscoring.languages.space_delimited.revision import words

This would pave the way for having meta-language features for CJK

from revscoring.languages.cjk.revision import cjk_symbols

Event Timeline

Halfak created this task.Dec 9 2015, 9:02 PM

Halfak raised the priority of this task from to Needs Triage.

Halfak updated the task description. (Show Details)

Halfak added a project: Machine-Learning-Team (Active Tasks).

Halfak moved this task to Parked on the Machine-Learning-Team (Active Tasks) board.

Halfak subscribed.

Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptDec 9 2015, 9:02 PM

Halfak moved this task from Parked to Backlog on the Machine-Learning-Team (Active Tasks) board.Dec 23 2015, 3:50 AM

I made some serious progress here. At first I tried mixins, but it seems like it makes more sense to just use namespaces and groups of features that rely on particular language assets.

See https://github.com/wiki-ai/revscoring/tree/features_commons/revscoring/languages/features

We now have sets of features for languages with:

dictionary
stemmed
stopwords
regexes

This will allow us to build up languages with namespaces based on what assets we have. This will allow for patterns like this:

from revscoring.languages import english

damaging = [
    ...
    english.badwords.revision.matches - english.badwords.revision.parent.matches,
    english.badwords.diff.match_prop_delta,
    english.informals.diff.match_prop_delta,
    english.stemmed.revision.stem_prop_delta,
    english.dictionary.non_dict_words_prop_delta,
    ...
]

Not how the sub-namespaces under "english" contain collections of features.

Halfak added a project: revscoring.Dec 23 2015, 3:56 AM

Halfak set Security to None.

Halfak claimed this task.Dec 23 2015, 4:17 AM

Nitpick: shouldn't "stemmed" (verb) be "stems" (noun) for consistency with the other sets of features, which are also nouns?

I was thinking that too. I like the adjectives better since they are describing some aspect of the underlying features.

Regretfully, I couldn't think of an adjective that could apply to "badwords" or "dictionary" without sounding weird.

In a few other places, we can have some nice adjectives like "parsed", "tokenized", and "temporal".

See https://github.com/wiki-ai/revscoring/pull/233

I ended up sticking with "stemmed" but I'm still open to discussion on this point.

ToAruShiroiNeko removed a project: revscoring.Jan 1 2016, 2:36 PM

Halfak moved this task from Review to Completed on the Machine-Learning-Team (Active Tasks) board.Jan 15 2016, 5:57 PM

Halfak closed this task as Resolved.Jan 21 2016, 3:43 PM

Implement common features between languages as a meta-language featuresClosed, ResolvedPublicActions

Description

Event Timeline

Implement common features between languages as a meta-language features
Closed, ResolvedPublic
Actions