I think just blacklisting nazi is sufficient
nazi is on the blacklist present in the repository since 2014, see 1e5bd7dc3c1be
Seems we are using a different blacklist for captcha generation, but I see no need to do so.
And fwiw yes, I have confirmed that having nazi does prevent nazis from appearing. The blacklist is checked last, it doesn't matter if the blacklisted term appears as one of the words or from a combination of valid words, eg.
Picked word naz
Picked word2 isabel
word is nazisabel
skipping word pair 'nazisabel' because it contains blacklisted word 'nazi'
The repo has a tiny blocklist (510 bytes). A 5KB blocklist seems perfectly acceptable to store there, and a good blocklist would benefit everyone.
Do we know the source of that blocklist?
(also, what's the last-modified timestamp of the wmf blocklist?)
Nope, don't know the actual source of the blacklist nor of the word list... They were in @aaron's home directory for a while at one point...
reedy@mwmaint1002:~$ ls -al /home/aaron/home-terbium/badwords -rw-r--r-- 1 aaron wikidev 7206 Sep 25 2014 /home/aaron/home-terbium/badwords reedy@mwmaint1002:~$ ls -al /home/aaron/home-terbium/words -rw-r--r-- 1 aaron wikidev 938848 Sep 25 2014 /home/aaron/home-terbium/words
I can't actually tell you the current last-modified timestamp, as it's puppet provisioned from the private puppet repo. So, it's irrelevant what I can see on disk in /etc/fancycaptcha due to servers being reinstalled/installed etc later
They've been in puppet private since at least November 2016 in 6e1867f843052b067807012629ec82362c3f626e
Kind of amusingly in https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/319892/ I said
I think, we can probably put the badword/blacklist list in public too, as long as that doesn't offend anyone
Reason not to do it was Chad saying
Eh, putting the badword list public allows someone to limit their dictionary if they're trying to build a list of known words, best to keep it private too imho.
The actual wordlist was published by Tim many years ago,¹ which is much more relevant for an attacker than the blacklist. Someone even created a greasemonkey to check if you were providing a valid captcha solution.
We should publish the blacklist. We gain more from using a common blacklist than the near-zero security that might be gained from having someone guessing something that matched the blacklist and thus being able to discard a captcha guess before sending it to the server.
¹ And it actually wasn't really special. If I remember right, the used list just your normal /usr/share/dict disctionary, filtered to words of 4-5 characters, not starting with f (just like the code does again at line 280, not sure if it had the double begins/ends stripped, too).
Sounds great. For what it's worth, I let the individual who originally contacted us know that multiple people were working on resolving this and they seemed impressed that we were jumping on it so diligently.