- generate TFiDF badword lists
- review and aggregation of badwords/informal words by native speaker
- implement revscoring.language.Language (Language utility)
Maybe @whym could help us with this (if I'm remembering right that he's a native Japanese speaker).
We're looking to get a list of Japanese "badwords" (curses, racial slurs, offensive language) and "informals" (silly/informal talk, "haha", "lol"). I've pasted a link above to a list of insults that we might incorporate as features in our vandalism detection models.
Would you be willing to review that list (and/or others) and give us a list of "badwords" and "informals" that seem common? Extra points if you can format them in ways that vandalism might be likely to do. E.g. the English curse "shit" is often extended to "shiiiit" or written as "$hit". See https://meta.wikimedia.org/wiki/Research:Revision_scoring_as_a_service/Word_lists/en for how we thought about it for English but please don't do regexes *yet*. Once we develop the regexes, we'll want the original list of example bad/informal words to test against.
Should the list go to https://meta.wikimedia.org/wiki/Research:Revision_scoring_as_a_service/Word_lists/ja? I see regexes added in January there, which might need to be removed/replaced.
http://www.sljfaq.org/afaq/insults.html is in part already incorporated in the wiki page. The rest may or may not be useful - I have never seen a vandal using them. Maybe we can try adding later?
I can instead try extracting some words from Japanese Wikipedia's abuse filters (which should be useful assuming that wiki is the main target).
Good points. +1 for extracting from jawiki's abuse filter. If you could give us either examples of the raw content that Abuse Filter is intended to match or *both* the raw content and the regexes, that would be great!