Page MenuHomePhabricator

migrate bad words detection to editquality repo
Open, LowPublic

Description

Take the code from https://github.com/wiki-ai/bwds and port it to live in https://github.com/wiki-ai/editquality. The pattern we normally use is to write a new "utility" script (https://github.com/wiki-ai/editquality/tree/master/editquality/utilities) which is the command-line frontend. BWDS would become its own package in editquality, the utilities/bwds.py frontend would be a lightweight wrapper that calls into that package.

It would be great to have unit tests.

Documentation for BWDS is, https://www.mediawiki.org/wiki/ORES/BWDS_review
Example of a generated bad words list (sections are collapsed by default): https://meta.wikimedia.org/wiki/Research:Revision_scoring_as_a_service/Word_lists/hu

As a later enhancement, we might even want to templatize this step as part of the Makefile. Note that the makefile is code-generated from templates/Makefile.j2

Other followup work (should be defined in a new task) would be to streamline how the generated bad-words list is included in our code. Currently, it's pasted into the language support files in revscoring, e.g. https://github.com/wiki-ai/revscoring/blob/master/revscoring/languages/albanian.py , but it might be better to dump the results as text files in a "datasets" directory, rather than treat them as code. The interesting NLP twist is that we need to take the raw words and figure out regexes which will match common variations.

Event Timeline

@Ladsgroup, this one should be easy if you want to take a look.

Halfak triaged this task as Low priority.Jul 6 2016, 3:15 PM
awight updated the task description. (Show Details)
awight updated the task description. (Show Details)

Hi guys, I tried to take on myself this task during the hackathon, and although it's tagged as easy, found myself a bit struggling :) Is there anyone that could mentor me? I know that it might not be the most efficient thing to do (it would take probably less time to do the task yourself, thean explain me).

What I have been doing so far, was playing with process_dump.py (copied from process_api.py), trying to get the revisions from a dump instead via API. Still there are couple of things I couldn't figure out. E.g. it seems like the dump does not contain a tag for rollbacks / reverts and I don't know what to do there (of course, I can get that from the Wikipedia API, but in that case using the dump becomes a bit meaningless).

So if someone has the time to mentor me, please let me know. Alternatively, if you have like a really "simple" task that I can start with and get a better understanding of the workflow of BDWS, maybe that could help a newbie as me to get engaged.

Hello, I'm getting familiar with ORES and its models so I decided to pick this tasks to help me.
I've gone through the code at https://github.com/wiki-ai/bwds and tried running bad_words_detection_system.py. However, I'm stuck because I see that it uses:

from revscoring.extractors import APIExtractor
from revscoring.datasources import diff

I can't seem to find these modules in revscoring. Were the names changed in revscoring?

Also, I'm assuming from mw import api refers to mwapi?

Another thing, I see that there's an open pull request that will cause some major changes. Since the bwds code will live in editquality once this task is done, should I wait until that pull request is merged?
Thanks

jeropbrenda unsubscribed.