Page MenuHomePhabricator

Tamil language utilities
Closed, ResolvedPublic

Description

  • generate tf-idf badword lists
  • review and aggregation of badwords/informal words by native speaker
  • implement revscoring.language module

Event Timeline

It's generated. Now we need a native speaker review bad words :)

Ladsgroup updated the task description. (Show Details)

It's generated. Now we need a native speaker review bad words :)

Do you need some help here @Ladsgroup , i don't know what this task is for, but i just saw while randomly going through some tasks, i am a native Tamil speaker.

I just had a quick glance on the list that is there on meta wiki, most of the words seems to be other language loan words which we avoid using in Tamil wiki and some of them seems to be spelling mistakes, Most of them do not seem like bad words.

@Shanmugamp7
Awesome, Thank you!
What I need is two lists
1- bad words: lists of words that should not be in anywhere in Wikipedia. You can see list of them for English in here
2- Informal words, words that's not okay to use in Wikipedia articles but it's okay to use in talk namespaces. Like "Hey" "LOL", etc.

Can you do this? You can use the list this bot generated and add/remove anything you want.

Thanks

@Shanmugamp7
Awesome, Thank you!
What I need is two lists
1- bad words: lists of words that should not be in anywhere in Wikipedia. You can see list of them for English in here
2- Informal words, words that's not okay to use in Wikipedia articles but it's okay to use in talk namespaces. Like "Hey" "LOL", etc.

Can you do this? You can use the list this bot generated and add/remove anything you want.

Thanks

Sure, i will do it by this weekend

Halfak triaged this task as Lowest priority.Jul 11 2016, 5:07 PM

https://meta.wikimedia.org/wiki/Research:Revision_scoring_as_a_service/Word_lists/ta contains the informal and badwords. It looks like we have regexes in the place of badwords and we'll need to turn those back into strings in order to test effectively.

I removed the regex styling from the badwords. See below. We need the words in this format in order to write tests. @Shanmugamp7, can you review to make sure I did it right?

  1. பூல்
  2. பூலு
  3. கூதி
  4. தேவுடியாள்
  5. தேவடியாள்
  6. ஓத்த
  7. ஓத்தா
  8. சுன்னி
  9. சுண்ணி
  10. ஓல்
  11. ஓழ்
  12. ஓலு
  13. ஓழு
  14. ஓழி
  15. ஒம்மால
  16. சூத்து
  17. முண்ட
  18. முண்டை
  19. புண்ட
  20. புண்டை
  21. தாயோளி
  22. ஓல்மாரி
  23. ஓழ்மாரி
  24. புழுத்தி

@Shanmugamp7, we should have more informal words than just "பொட்டை". Is there a Tamil equivalent to "haha", "hello", "goodbye", "silly", "ain't", "awesome", "blah", etc.? You could reference the English informal words to get ideas.

It seems like the aspell-ta package isn't working on travis (even though it installed correctly). Without alternative dictionaries being in the whitelist, I'm not sure how to proceed other than ignoring the test like test_hindi.py (which I'm not a fan of).