- Run Bad-Words-Detection-System to get potential badword list
- Human review of BWDS list
- Integrate into revscoring
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Ladsgroup | T160279 Deploy ores in prod (Mid-March) | |||
Resolved | Halfak | T160228 Train/test reverted model for fiwiki | |||
Resolved | Ladsgroup | T158587 Add language support for Finnish |
Event Timeline
Finnish does already have word list, but it's 8 months old. So will we need new word list or are we fine with the old one?
@4shadoww, I think the old one should work. 8 months isn't too old for this kind of signal.
Looks great. Are there any more words (or word variants) that would would like to add to the list before we encode it in our modeling library?
As an example, for English, we have many variants of curse words in our tests. E.g. "shit", "sh1t", "shiiit", etc.
I added some common bad words more from fiwikis abuse filter rules. Though i think that there would be more if the more is better.
More is generally better. This isn't the last chance to extend the list though it may be the last chance to extend the list directly on the wiki. Future extensions will need to happen in code, but that isn't very difficult. See English Wikipedia's test set for the words we try to match there: https://github.com/wiki-ai/revscoring/blob/master/revscoring/languages/tests/test_english.py