Page MenuHomePhabricator

Add language support for Finnish
Closed, ResolvedPublic

Description

Event Timeline

Finnish does already have word list, but it's 8 months old. So will we need new word list or are we fine with the old one?

@4shadoww, I think the old one should work. 8 months isn't too old for this kind of signal.

Halfak triaged this task as Medium priority.Feb 23 2017, 3:26 PM
Halfak moved this task from Unsorted to Research & analysis on the Machine-Learning-Team board.
Halfak moved this task from Research & analysis to New development on the Machine-Learning-Team board.
Halfak updated the task description. (Show Details)

I have sorted the word list. Does it look like ok?

Looks great. Are there any more words (or word variants) that would would like to add to the list before we encode it in our modeling library?

As an example, for English, we have many variants of curse words in our tests. E.g. "shit", "sh1t", "shiiit", etc.

I added some common bad words more from fiwikis abuse filter rules. Though i think that there would be more if the more is better.

More is generally better. This isn't the last chance to extend the list though it may be the last chance to extend the list directly on the wiki. Future extensions will need to happen in code, but that isn't very difficult. See English Wikipedia's test set for the words we try to match there: https://github.com/wiki-ai/revscoring/blob/master/revscoring/languages/tests/test_english.py

I added few words more. I think it's now ready to be encoded to modeling library.