Add denylist of words. First, research to what it means. Adding regex to current model might be tricky since it needs to be fast
- Model source: https://gitlab.wikimedia.org/repos/research/knowledge_integrity/-/blob/120591b209c771e8c8773cef2cb190b89c617031/knowledge_integrity/models/revertrisk.py
- Word regexes used by ORES: https://github.com/wikimedia/revscoring/tree/5a3618e4ce6f93f1e571abf8b3600726d7295138/revscoring/languages
Caveats:
- Regexes are not available for all languages but the premise for the language agnostic model is that it is able to support all languages.
- Testing a large number of regexes against every revision might be slow. Current throughput for the model hosted on Liftwing is 202.34 queries per second when queried by 25 parallel workers.
Tasks:
- Incorporating the regexes from ORES into the model and benchmarking performance. [Low]
- If too slow, explore other options like multiple substring matching algorithms e.g. Aho Corasick etc. [Medium]
- Look into building automated blacklist of words in all languages [High]