Page MenuHomePhabricator

Add more bad words to fancycaptcha/badwords
Closed, ResolvedPublic

Description

Per this discussion:
https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Offensive_Captcha
can someone make sure that "nazis" is added, if it isn't already on the list?

I think just blacklisting nazi is sufficient

Event Timeline

Reedy created this task.May 25 2019, 6:30 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 25 2019, 6:30 PM
Reedy renamed this task from Add another bad word to fancycaptcha/badwords to Add more bad words to fancycaptcha/badwords.May 25 2019, 6:40 PM
Reedy updated the task description. (Show Details)


I would have guessed that "nazi" on the list would pick up "nazis" but the attached screenshot suggests otherwise (if that works, i'never uploaded an image here before)

Reedy added a comment.May 25 2019, 9:30 PM

Cheers for the screenshot.

Nazi isn't on the list, hence the request to have it added

nazi is on the blacklist present in the repository since 2014, see 1e5bd7dc3c1be

Seems we are using a different blacklist for captcha generation, but I see no need to do so.

And fwiw yes, I have confirmed that having nazi does prevent nazis from appearing. The blacklist is checked last, it doesn't matter if the blacklisted term appears as one of the words or from a combination of valid words, eg.

Picked word naz
Picked word2 isabel
word is nazisabel
skipping word pair 'nazisabel' because it contains blacklisted word 'nazi'

Reedy added a comment.May 25 2019, 9:51 PM

nazi is on the blacklist present in the repository since 2014, see 1e5bd7dc3c1be

Might be in that one, it's not in the list WMF are using. Hence the problem

Reedy added a comment.May 25 2019, 9:54 PM

Seems we are using a different blacklist for captcha generation, but I see no need to do so.

Probably not. But the one in WMF prod is about 9 times larger than the one in the repo

The repo has a tiny blocklist (510 bytes). A 5KB blocklist seems perfectly acceptable to store there, and a good blocklist would benefit everyone.
Do we know the source of that blocklist?
(also, what's the last-modified timestamp of the wmf blocklist?)

Reedy added a subscriber: aaron.May 25 2019, 10:06 PM

The repo has a tiny blocklist (510 bytes). A 5KB blocklist seems perfectly acceptable to store there, and a good blocklist would benefit everyone.
Do we know the source of that blocklist?
(also, what's the last-modified timestamp of the wmf blocklist?)

Nope, don't know the actual source of the blacklist nor of the word list... They were in @aaron's home directory for a while at one point...

reedy@mwmaint1002:~$ ls -al /home/aaron/home-terbium/badwords 
-rw-r--r-- 1 aaron wikidev 7206 Sep 25  2014 /home/aaron/home-terbium/badwords
reedy@mwmaint1002:~$ ls -al /home/aaron/home-terbium/words 
-rw-r--r-- 1 aaron wikidev 938848 Sep 25  2014 /home/aaron/home-terbium/words

I can't actually tell you the current last-modified timestamp, as it's puppet provisioned from the private puppet repo. So, it's irrelevant what I can see on disk in /etc/fancycaptcha due to servers being reinstalled/installed etc later

They've been in puppet private since at least November 2016 in 6e1867f843052b067807012629ec82362c3f626e

Kind of amusingly in https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/319892/ I said

I think, we can probably put the badword/blacklist list in public too, as long as that doesn't offend anyone

Reason not to do it was Chad saying

Eh, putting the badword list public allows someone to limit their dictionary if they're trying to build a list of known words, best to keep it private too imho.

The actual wordlist was published by Tim many years ago,¹ which is much more relevant for an attacker than the blacklist. Someone even created a greasemonkey to check if you were providing a valid captcha solution.

We should publish the blacklist. We gain more from using a common blacklist than the near-zero security that might be gained from having someone guessing something that matched the blacklist and thus being able to discard a captcha guess before sending it to the server.

¹ And it actually wasn't really special. If I remember right, the used list just your normal /usr/share/dict disctionary, filtered to words of 4-5 characters, not starting with f (just like the code does again at line 280, not sure if it had the double begins/ends stripped, too).

I have replaced the existing badwords with P8560.

Volans triaged this task as Medium priority.May 27 2019, 9:18 AM
Reedy closed this task as Resolved.May 27 2019, 1:22 PM
Reedy assigned this task to ArielGlenn.

Captchas should've been re-generated last night, so these words should have taken affect

Whether we want to make the blacklist public etc should be moved to another task

Sounds great. For what it's worth, I let the individual who originally contacted us know that multiple people were working on resolving this and they seemed impressed that we were jumping on it so diligently.