Take the code from https://github.com/wiki-ai/bwds and port it to live in https://github.com/wiki-ai/editquality. The pattern we normally use is to write a new "utility" script (https://github.com/wiki-ai/editquality/tree/master/editquality/utilities) which is the command-line frontend. BWDS would become its own package in editquality, the utilities/bwds.py frontend would be a lightweight wrapper that calls into that package.
It would be great to have unit tests.
Documentation for BWDS is, https://www.mediawiki.org/wiki/ORES/BWDS_review
Example of a generated bad words list (sections are collapsed by default): https://meta.wikimedia.org/wiki/Research:Revision_scoring_as_a_service/Word_lists/hu
As a later enhancement, we might even want to templatize this step as part of the Makefile. Note that the makefile is code-generated from templates/Makefile.j2
Other followup work (should be defined in a new task) would be to streamline how the generated bad-words list is included in our code. Currently, it's pasted into the language support files in revscoring, e.g. https://github.com/wiki-ai/revscoring/blob/master/revscoring/languages/albanian.py , but it might be better to dump the results as text files in a "datasets" directory, rather than treat them as code. The interesting NLP twist is that we need to take the raw words and figure out regexes which will match common variations.