- Run Bad-Words-Detection-System on nowiki
- Human review of lists (contact @Galar71)
- Integrate reviewed lists into revscoring
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Ladsgroup | T131856 Train/test `reverted` model for nowiki | |||
Resolved | Ladsgroup | T131855 Language assets for Norwegian |
Event Timeline
@Ladsgroup, can you run Bad-Words-Detection-System for nowiki?
@Galar71, once this is done being run a page will be generated at https://meta.wikimedia.org/wiki/Research:Revision_scoring_as_a_service/Word_lists/no. From there, we'll ask you to review and sort the output lists. Once you are done, we'll take over and get the resulting lists integrated into revscoring and start building prediction models.
Looks like we're ready to go: https://meta.wikimedia.org/wiki/Research:Revision_scoring_as_a_service/Word_lists/no
@Galar71, could you review the "generated list" and split it into "badwords" and "informal words"? You can ignore the "common words" for now.
Hi...
I added some "bad" and "informal" words, but not necessarily all from the generated list... I can take closer look at this if it's wrong to add words not in the Generated list... I just did a quick review (a bit short on time right now), but I can redo it from the generated list if that's how it's supposed to be done.
A couple of questions in that regard:
- Should ALL the words in the generated list be split into the Bad and Informal categories, or is a subset enough?
- Is it ok to add words to the "bad" and "informal" categories that do not exist in the generated list?
Best Regards,
Galar71
Hey, Thank you for your help!
As much as you think is enough, It would be great to have all of possible bad words though. If a word is a total false positive, let's say "hamster" you can ignore it.
- Is it ok to add words to the "bad" and "informal" categories that do not exist in the generated list?
Totally, It's a helper.
Best Regards,
Galar71
Hi, @Halfak.
I'll go through it, hopefully in the next few days, and do a more thorough review of the generated list, and then I'll copy the relevant words to the "Bad" and "Informal" lists. I'll make sure to leave "hamster" out. ;)
Best Regards,
Galar71
As someone who can read Norwegian, it looks finished.
(I'd possibly move "cool" to from "bad words" to "informal".)
I went through and copied some commonly used insults into the bad words list. Noticed that the generated list contains quite a lot of variations, will you be stemming the words?