Page MenuHomePhabricator

Generate bad words for all languages more than 100K articles
Closed, ResolvedPublic

Description

List of languages needed:

  • bg
  • be
  • ca
  • cs
  • da
  • el
  • eo
  • eu
  • gl
  • ko
  • hy
  • hi
  • hr
  • ka
  • la
  • lt
  • ms
  • min
  • ce
  • uz
  • pt
  • kk
  • ro
  • sk
  • sl
  • sr
  • sh
  • fi
  • th
  • uk
  • vo
  • ur
  • zh - not done because tokenizing is impossible. Other solutions are being pursued.

Event Timeline

With this commit, running Bad-Words-Detection-System is much easier for new languages. As a test, I'm running on bg and then if it works properly, I'll run in batches of four languages.

Is there a doc explaining to volunteers what they need to contribute to bring ORES to their project? I would like to inform the volunteers in ca.wiki (see topic in Catalan).

Hey @QuimGil,
Thank you for showing interest in this project. It means a lot to me.
We are planning to add a section to the ORES page, Follow the discussion here. It's a complicated matter.
I ran my bot to generate bad words for Catalan and you will have the results by tomorrow in here. After that we need a volunteer (a native speaker) to go through this list and split it into three lists: 1- swear words 2- informal words (such "Hey" or "LOL", things we shouldn't put into article but it's okay to have it in talk pages) 3- false positives. After that, I can implement a basic support for that language.

"el" and "fi" are working. They'll be done very soon.

Ladsgroup updated the task description. (Show Details)
Ladsgroup updated the task description. (Show Details)

I'm curious about this task. Could you please sum up what is it about, how this words are generated. I'm especially interested by the Esperanto batch you ran.

I'm curious about this task. Could you please sum up what is it about, how this words are generated. I'm especially interested by the Esperanto batch you ran.

Hey, This is an automated AI-based software that finds bad words in any given language based on history of edits in Wikipedia. It finds common words in reverted edits but not common in not-reverted edits. We do this to build anti-vandalism tools (if you know ORES, we are doing this to add support for more languages in ORES).

We need someone to review this list and separate it into three lists, 1- Swearwords, things you shouldn't say in Wikipedia at all 2- informal words such as "Hey" which you can't use in articles but it's okay to use in talk pages. 3- False positives, Words that are totally okay to use everywhere.

After that we can incorporate ORES into that wikis.

Well, I didn't know ORES, I will look at that when I have some time for this. So can I help with a review for Esperanto (or French) word list somewhere?

Well, I didn't know ORES, I will look at that when I have some time for this. So can I help with a review for Esperanto (or French) word list somewhere?

Awesome. French is already there but for Esperanto. Check out this page and split "Generated list" into three lists I explained above :)

Thanks :)

Ok, I'll see that. Probably not this week though. Just glancing it, I see that there are (as far as I can tell) non-Esperanto words in the main list. For example "somebody" : there is no "y" letter in Esperanto. What should I do in such a case?

Also, on a broader point of view, how will the filter function? Bad words are part of the knowledge, we may use some for example in Profanity.

Is the plan to simply prevent contributions with this list of words? If so, this may be changed to a "improve you edit guideline", with a few direct suggestions, and links to more verbose documentation. Could you please tell me more or provide me a link with more information on what will be the behavior applied?

Ok, I'll see that. Probably not this week though. Just glancing it, I see that there are (as far as I can tell) non-Esperanto words in the main list. For example "somebody" : there is no "y" letter in Esperanto. What should I do in such a case?

You should put them into "false positives" category.

Also, on a broader point of view, how will the filter function? Bad words are part of the knowledge, we may use some for example in Profanity.

The AI behind ORES understands regular cases. To simply put, we get proportion of bad words added not just number of bad words added.

Is the plan to simply prevent contributions with this list of words? If so, this may be changed to a "improve you edit guideline", with a few direct suggestions, and links to more verbose documentation. Could you please tell me more or provide me a link with more information on what will be the behavior applied?

After deploying it will score each and every edit a number between zero and one, if it's close to one, it's probably vandalism and we use these scores in numerous anti-vandalism tools such as Huggle, RTRC, anti-vandalism bots, etc.

@Ladsgroup if you intend Wikimania, and you have time for this, I wish we may talk about this task.

@Ladsgroup if you intend Wikimania, and you have time for this, I wish we may talk about this task.

Sure :)