Page MenuHomePhabricator

Tamil language utilities
Closed, ResolvedPublic

Description

  • generate tf-idf badword lists
  • review and aggregation of badwords/informal words by native speaker
  • implement revscoring.language module

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 1 2016, 12:28 PM

Bad words should be in the page in meta wiki very soon.

Ladsgroup moved this task from Active to Backlog on the Scoring-platform-team (Current) board.

It's generated. Now we need a native speaker review bad words :)

Ladsgroup removed Ladsgroup as the assignee of this task.May 1 2016, 2:08 PM
Ladsgroup updated the task description. (Show Details)

It's generated. Now we need a native speaker review bad words :)

Do you need some help here @Ladsgroup , i don't know what this task is for, but i just saw while randomly going through some tasks, i am a native Tamil speaker.

I just had a quick glance on the list that is there on meta wiki, most of the words seems to be other language loan words which we avoid using in Tamil wiki and some of them seems to be spelling mistakes, Most of them do not seem like bad words.

@Shanmugamp7
Awesome, Thank you!
What I need is two lists
1- bad words: lists of words that should not be in anywhere in Wikipedia. You can see list of them for English in here
2- Informal words, words that's not okay to use in Wikipedia articles but it's okay to use in talk namespaces. Like "Hey" "LOL", etc.

Can you do this? You can use the list this bot generated and add/remove anything you want.

Thanks

Shanmugamp7 added a comment.EditedMay 3 2016, 4:55 PM

@Shanmugamp7
Awesome, Thank you!
What I need is two lists
1- bad words: lists of words that should not be in anywhere in Wikipedia. You can see list of them for English in here
2- Informal words, words that's not okay to use in Wikipedia articles but it's okay to use in talk namespaces. Like "Hey" "LOL", etc.

Can you do this? You can use the list this bot generated and add/remove anything you want.

Thanks

Sure, i will do it by this weekend

Halfak triaged this task as Lowest priority.Jul 11 2016, 5:07 PM
Halfak assigned this task to schana.Jul 25 2016, 4:48 PM
Halfak updated the task description. (Show Details)Aug 1 2016, 4:52 PM
Halfak added a comment.Aug 2 2016, 2:19 PM

https://meta.wikimedia.org/wiki/Research:Revision_scoring_as_a_service/Word_lists/ta contains the informal and badwords. It looks like we have regexes in the place of badwords and we'll need to turn those back into strings in order to test effectively.

Halfak added a comment.EditedAug 2 2016, 2:23 PM

I removed the regex styling from the badwords. See below. We need the words in this format in order to write tests. @Shanmugamp7, can you review to make sure I did it right?

  1. பூல்
  2. பூலு
  3. கூதி
  4. தேவுடியாள்
  5. தேவடியாள்
  6. ஓத்த
  7. ஓத்தா
  8. சுன்னி
  9. சுண்ணி
  10. ஓல்
  11. ஓழ்
  12. ஓலு
  13. ஓழு
  14. ஓழி
  15. ஒம்மால
  16. சூத்து
  17. முண்ட
  18. முண்டை
  19. புண்ட
  20. புண்டை
  21. தாயோளி
  22. ஓல்மாரி
  23. ஓழ்மாரி
  24. புழுத்தி
Halfak added a comment.Aug 2 2016, 2:28 PM

@Shanmugamp7, we should have more informal words than just "பொட்டை". Is there a Tamil equivalent to "haha", "hello", "goodbye", "silly", "ain't", "awesome", "blah", etc.? You could reference the English informal words to get ideas.

It seems like the aspell-ta package isn't working on travis (even though it installed correctly). Without alternative dictionaries being in the whitelist, I'm not sure how to proceed other than ignoring the test like test_hindi.py (which I'm not a fan of).

Ladsgroup closed this task as Resolved.Aug 16 2016, 6:32 PM
Restricted Application added a project: artificial-intelligence. · View Herald TranscriptJun 7 2017, 6:42 PM