Page MenuHomePhabricator

migrate bad words detection to editquality repo
Open, LowPublic

Description

Take the code from https://github.com/wiki-ai/bwds and port it to live in https://github.com/wiki-ai/editquality. The pattern we normally use is to write a new "utility" script (https://github.com/wiki-ai/editquality/tree/master/editquality/utilities) which is the command-line frontend. BWDS would become its own package in editquality, the utilities/bwds.py frontend would be a lightweight wrapper that calls into that package.

It would be great to have unit tests.

Documentation for BWDS is, https://www.mediawiki.org/wiki/ORES/BWDS_review
Example of a generated bad words list (sections are collapsed by default): https://meta.wikimedia.org/wiki/Research:Revision_scoring_as_a_service/Word_lists/hu

As a later enhancement, we might even want to templatize this step as part of the Makefile. Note that the makefile is code-generated from templates/Makefile.j2

Other followup work (should be defined in a new task) would be to streamline how the generated bad-words list is included in our code. Currently, it's pasted into the language support files in revscoring, e.g. https://github.com/wiki-ai/revscoring/blob/master/revscoring/languages/albanian.py , but it might be better to dump the results as text files in a "datasets" directory, rather than treat them as code. The interesting NLP twist is that we need to take the raw words and figure out regexes which will match common variations.

Event Timeline

Ladsgroup created this task.Apr 5 2016, 4:52 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 5 2016, 4:52 PM
Halfak added a subscriber: Halfak.May 7 2016, 9:24 PM

@Ladsgroup, this one should be easy if you want to take a look.

Halfak triaged this task as Low priority.Jul 6 2016, 3:15 PM
Ladsgroup moved this task from Incoming to Later on the User-Ladsgroup board.Aug 12 2016, 8:54 PM
Restricted Application added a project: artificial-intelligence. · View Herald TranscriptJul 21 2017, 11:10 AM
awight added a subscriber: Vacio.May 18 2018, 4:13 PM
awight updated the task description. (Show Details)
awight updated the task description. (Show Details)May 18 2018, 4:18 PM
awight removed Ladsgroup as the assignee of this task.May 18 2018, 4:21 PM
awight updated the task description. (Show Details)
awight updated the task description. (Show Details)May 18 2018, 4:26 PM
awight updated the task description. (Show Details)May 18 2018, 4:31 PM
awight updated the task description. (Show Details)
Vacio added a comment.May 22 2018, 2:09 PM

Hi guys, I tried to take on myself this task during the hackathon, and although it's tagged as easy, found myself a bit struggling :) Is there anyone that could mentor me? I know that it might not be the most efficient thing to do (it would take probably less time to do the task yourself, thean explain me).

What I have been doing so far, was playing with process_dump.py (copied from process_api.py), trying to get the revisions from a dump instead via API. Still there are couple of things I couldn't figure out. E.g. it seems like the dump does not contain a tag for rollbacks / reverts and I don't know what to do there (of course, I can get that from the Wikipedia API, but in that case using the dump becomes a bit meaningless).

So if someone has the time to mentor me, please let me know. Alternatively, if you have like a really "simple" task that I can start with and get a better understanding of the workflow of BDWS, maybe that could help a newbie as me to get engaged.