Suggested by ԱշոտՏՆՂ
We'd store our lists of words on a wiki and periodically re-read from the wiki and a snapshot in revscoring
Suggested by ԱշոտՏՆՂ
We'd store our lists of words on a wiki and periodically re-read from the wiki and a snapshot in revscoring
Just renamed this to be a little more clear to me. I'm not quite sure how we'd set this up. We need a lot of determinism in revscoring to have things work. But it's possible that we can store snapshots in revscoring to account for changes on the wiki.
Yeah. I think this is really interesting. We'd need to so some thinking about how it could work with our pipelines for building models.
Here are some examples of existing lists, of varying quality and formats, used by other tools:
There are also some edit filters which contain such lists:
And for typos:
It is not uncommon for some good faith edit to add a new expression (or badly written regex) to such lists and then breaking (to some extent) the tools which use them (e.g. increasing its false positives).
We can probably handle the breaking changes by having a manual step where we pull a new version of a badwords list from the wiki. If fitness measure go down, we know something was broken.