- generate tf-idf badword lists
- review and aggregation of badwords/informal words by native speaker
- implement revscoring.language module
Do you need some help here @Ladsgroup , i don't know what this task is for, but i just saw while randomly going through some tasks, i am a native Tamil speaker.
I just had a quick glance on the list that is there on meta wiki, most of the words seems to be other language loan words which we avoid using in Tamil wiki and some of them seems to be spelling mistakes, Most of them do not seem like bad words.
Awesome, Thank you!
What I need is two lists
1- bad words: lists of words that should not be in anywhere in Wikipedia. You can see list of them for English in here
2- Informal words, words that's not okay to use in Wikipedia articles but it's okay to use in talk namespaces. Like "Hey" "LOL", etc.
Can you do this? You can use the list this bot generated and add/remove anything you want.
https://meta.wikimedia.org/wiki/Research:Revision_scoring_as_a_service/Word_lists/ta contains the informal and badwords. It looks like we have regexes in the place of badwords and we'll need to turn those back into strings in order to test effectively.
I removed the regex styling from the badwords. See below. We need the words in this format in order to write tests. @Shanmugamp7, can you review to make sure I did it right?