- generate tf-idf badword lists
- review and aggregation of badwords/informal words by native speaker
- implement revscoring.language module
Description
Event Timeline
Do you need some help here @Ladsgroup , i don't know what this task is for, but i just saw while randomly going through some tasks, i am a native Tamil speaker.
I just had a quick glance on the list that is there on meta wiki, most of the words seems to be other language loan words which we avoid using in Tamil wiki and some of them seems to be spelling mistakes, Most of them do not seem like bad words.
@Shanmugamp7
Awesome, Thank you!
What I need is two lists
1- bad words: lists of words that should not be in anywhere in Wikipedia. You can see list of them for English in here
2- Informal words, words that's not okay to use in Wikipedia articles but it's okay to use in talk namespaces. Like "Hey" "LOL", etc.
Can you do this? You can use the list this bot generated and add/remove anything you want.
Thanks
https://github.com/travis-ci/apt-package-whitelist/blob/master/ubuntu-precise Looks like we have "aspell-ta" in travis' Precise image.
https://meta.wikimedia.org/wiki/Research:Revision_scoring_as_a_service/Word_lists/ta contains the informal and badwords. It looks like we have regexes in the place of badwords and we'll need to turn those back into strings in order to test effectively.
I removed the regex styling from the badwords. See below. We need the words in this format in order to write tests. @Shanmugamp7, can you review to make sure I did it right?
- பூல்
- பூலு
- கூதி
- தேவுடியாள்
- தேவடியாள்
- ஓத்த
- ஓத்தா
- சுன்னி
- சுண்ணி
- ஓல்
- ஓழ்
- ஓலு
- ஓழு
- ஓழி
- ஒம்மால
- சூத்து
- முண்ட
- முண்டை
- புண்ட
- புண்டை
- தாயோளி
- ஓல்மாரி
- ஓழ்மாரி
- புழுத்தி
@Shanmugamp7, we should have more informal words than just "பொட்டை". Is there a Tamil equivalent to "haha", "hello", "goodbye", "silly", "ain't", "awesome", "blah", etc.? You could reference the English informal words to get ideas.
It seems like the aspell-ta package isn't working on travis (even though it installed correctly). Without alternative dictionaries being in the whitelist, I'm not sure how to proceed other than ignoring the test like test_hindi.py (which I'm not a fan of).