Page MenuHomePhabricator

Japanese language utilities
Closed, ResolvedPublic


  • generate TFiDF badword lists
  • review and aggregation of badwords/informal words by native speaker
  • implement revscoring.language.Language (Language utility)

Event Timeline

ToAruShiroiNeko claimed this task.
ToAruShiroiNeko raised the priority of this task from to High.
ToAruShiroiNeko updated the task description. (Show Details)
ToAruShiroiNeko added a subscriber: ToAruShiroiNeko.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 6 2015, 6:00 PM
ToAruShiroiNeko renamed this task from Japanese language Utilities to Japanese language utilities.Nov 6 2015, 6:01 PM
ToAruShiroiNeko updated the task description. (Show Details)
ToAruShiroiNeko set Security to None.
This comment was removed by ToAruShiroiNeko.
Halfak added a subscriber: Halfak.Nov 13 2015, 2:45 PM

@ToAruShiroiNeko, were you able to find a japanese speaker to help us with the TFiDF lists?

@ToAruShiroiNeko Maybe you can find me a badword list from somewhere else?

Halfak claimed this task.Dec 4 2015, 6:37 PM
Halfak reassigned this task from Halfak to ToAruShiroiNeko.Jan 1 2016, 6:26 PM
Halfak added a subscriber: whym.Feb 24 2016, 8:17 PM

Maybe @whym could help us with this (if I'm remembering right that he's a native Japanese speaker).

We're looking to get a list of Japanese "badwords" (curses, racial slurs, offensive language) and "informals" (silly/informal talk, "haha", "lol"). I've pasted a link above to a list of insults that we might incorporate as features in our vandalism detection models.

Would you be willing to review that list (and/or others) and give us a list of "badwords" and "informals" that seem common? Extra points if you can format them in ways that vandalism might be likely to do. E.g. the English curse "shit" is often extended to "shiiiit" or written as "$hit". See for how we thought about it for English but please don't do regexes *yet*. Once we develop the regexes, we'll want the original list of example bad/informal words to test against.

whym added a comment.Feb 26 2016, 1:40 PM

Should the list go to I see regexes added in January there, which might need to be removed/replaced. is in part already incorporated in the wiki page. The rest may or may not be useful - I have never seen a vandal using them. Maybe we can try adding later?

I can instead try extracting some words from Japanese Wikipedia's abuse filters (which should be useful assuming that wiki is the main target).

Good points. +1 for extracting from jawiki's abuse filter. If you could give us either examples of the raw content that Abuse Filter is intended to match or *both* the raw content and the regexes, that would be great!

This is now merged.

Ladsgroup closed this task as Resolved.Apr 26 2016, 3:06 PM