Page MenuHomePhabricator

Investigate and mitigate trivial bypass to AntiSpoof
Open, LowPublicSecurity

Description

The following recent accounts were created on enwiki:

Please consider resolving the defect that makes this possible. I will again state that the Unicode's confusables.txt is a better basis for a list of confusable characters than one that is "derived by unknown methods".

Details

Author Affiliation
Other (Please specify in description)

Event Timeline

ST47 created this task.Feb 27 2020, 3:50 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 27 2020, 3:50 PM
ST47 updated the task description. (Show Details)
Reedy set Security to Software security bug.Feb 27 2020, 3:58 PM
Reedy added projects: Security, Security-Team.
Reedy changed the visibility from "Public (No Login Required)" to "Custom Policy".
Reedy changed the subtype of this task from "Task" to "Security Issue".
Reedy added a subscriber: Reedy.
Reedy added a comment.Feb 27 2020, 4:01 PM

Please consider resolving the defect that makes this possible. I will again state that the Unicode's confusables.txt is a better basis for a list of confusable characters than one that is "derived by unknown methods".

You really don't need to keep restating the same things in different bugs, but you could reference them when you're reporting similar issues

As always, patch-welcome

ST47 added a comment.EditedFeb 27 2020, 9:00 PM

Good point. I have attached a patch which uses a python script to collect the confusables.txt data and add it in to the equivset database. The patch is against https://phabricator.wikimedia.org/source/Equivset/repository/master/. Several notes:

  • I had thought to automatically reduce "multi-level" equivalences (if A => B and B => C, then replace that with A => C and B => C). However, that changes several rules that are already in the database, so I commented that out so that only the *functional* changes would be visible in the patch.
  • There is a comment in the database file about a previous version incorrectly marking E and H as equivalent. That doesn't appear to be the case with this version, but I need to double check.
  • There are a few things that now map to the ASCII space character, and I don't know if the file format strictly allows this. They can be excluded fairly easily if need be.
  • I wasn't able to run the bin/console script yet, there appear to be some missing files that it is expecting to find in a non-existent "vendor" directory.

The script in the attached patch is public domain.

Reedy added a comment.Feb 27 2020, 9:05 PM
  • I wasn't able to run the bin/console script yet, there appear to be some missing files that it is expecting to find in a non-existent "vendor" directory.

composer update

ST47 added a comment.Feb 28 2020, 3:34 AM

Ah, I see, thanks. New patch 0001 is the same as the above, but with the dist/ files updates. New patch 0002 also collapses multiple level equivalences, which I don't know if it's required but it might be clearer for reviewers.


sbassett added a subscriber: sbassett.

@ST47 et al - the Security-Team reviewed this at our triage meeting today. We've tagged Anti-Harassment as the ostensible owners of the code and recommend pushing patches through gerrit, hence making this task public. If you have any further questions or concerns, please let us know.

sbassett triaged this task as Low priority.Mar 2 2020, 4:22 PM
sbassett changed Author Affiliation from N/A to Other (Please specify in description).
sbassett changed the visibility from "Custom Policy" to "Public (No Login Required)".
sbassett moved this task from Incoming to Watching on the Security-Team board.
aezell added a subscriber: aezell.Mar 4 2020, 3:33 PM

While the Anti-Harassment Team hasn't worked on Anti-Spoof before, it's possible we could help with code review once the patches are available in Gerrit.

Feel free to post here and tag me as a reviewer in Gerrit. I'll work to get some eyes on the patches.

Reedy added a comment.Mar 4 2020, 3:51 PM

While the Anti-Harassment Team hasn't worked on Anti-Spoof before, it's possible we could help with code review once the patches are available in Gerrit.

Feel free to post here and tag me as a reviewer in Gerrit. I'll work to get some eyes on the patches.

I thought AHT did the refactoring/splitting of AntiSpoof into Equivset ? Or was that Community Tech?

aezell added a comment.Mar 4 2020, 4:05 PM

@Reedy You're right. That happened before I got here and I was not aware.

We'll be happy to help review the work.

Reedy added a comment.Mar 4 2020, 4:19 PM

@Reedy You're right. That happened before I got here and I was not aware.

We'll be happy to help review the work.

Sweet, thanks.

It's alway useful to have the last people to touch (and doing some larger refactoring) things to try and help review. There are patches above, but we should get them into gerrit first I think for automated testing etc

aezell added a comment.Mar 4 2020, 4:29 PM

We'd be more comfortable reviewing things in Gerrit and seeing the outcome of the testing, as you mention.

Nirmos added a subscriber: Nirmos.Mar 4 2020, 4:34 PM

Of the four cases linked in the description, the first three seem to be a duplicate of T194310.

The fourth case is probably invalid/wontfix, assuming that it is Casliber that Caslieber is impersonating. If you want to disallow Caslieber because Casliber is already registered, then do you also want to disallow Mao if Mo is already registered?

ST47 added a comment.Mar 4 2020, 5:37 PM

Thanks for the discussion. I unfortunately am not able to log in to Gerrit, due to an issue with my account. My understanding is that it isn't possible to fix. Please email me privately with any other questions about that.

Can another developer look at the patch and, if they agree with them, upload to Gerrit? I'm happy to consider them "public domain", and others are free do to whatever they like with them.

Change 576925 had a related patch set uploaded (by SBassett; owner: ST47):
[mediawiki/libs/Equivset@master] Update confusables within equivset.in

https://gerrit.wikimedia.org/r/576925

@ST47 - I just squashed the patches from T246353#5925479 and uploaded to gerrit (see above). @aezell - They should be ready for review by Anti-Harassment or any other interested parties.

I unfortunately am not able to log in to Gerrit, due to an issue with my account.

@ST47: Is there a bug report, to fix that?

Reedy added a comment.Mar 7 2020, 12:38 AM

Thanks for the discussion. I unfortunately am not able to log in to Gerrit, due to an issue with my account. My understanding is that it isn't possible to fix. Please email me privately with any other questions about that.

Can another developer look at the patch and, if they agree with them, upload to Gerrit? I'm happy to consider them "public domain", and others are free do to whatever they like with them.

There's a few questions and comments appearing over on phab.. And as the author of the patch, it'd be useful to have your input :)

I'm happy to +2 this. Seems like we raised the issue about Python but feel that it's an acceptable approach here.

Outside of that, it seems there is agreement on the code. I won't +2 until we are sure this isn't something too risky to roll out right now.

Any thoughts on the risk assessment for this change?

Reedy added a comment.Mar 18 2020, 7:10 PM

It'd be nice if @ST47 could reply to the questions

If we leave it as a python script, that the library doesn't shell out to.. We might want to update the docs at least

I guess the questions is whether the current tests/test cases are sufficient. And whether this patch should be adding any..

I'd hoped to polish up the patch a bit more around some of the suggestions, but well, yeah. Strange times. Hopefully I can find some time soon to work through a few modifications and improvements.

Huji added a subscriber: Huji.Jun 11 2020, 2:03 PM

The idea of bringing in Unicode's confusables is a great idea. However, I think before we implement this, we need to address a bigger issue: right now when we change something in Equivsets, we have no way to know whether it will or won't break another piece of code (e.g in AbuseFitler) which relies on it. This is because Equivset is using a very short set of unit tests.

The ideal solution is to significantly (and thoughtfully) expand our unit tests. That is a separate task.

The interim solution is to require every patch to not only update the equivset files, but also add a few relevant unit tests to demonstrate that the change will not break any old behavior.

For instance, in https://gerrit.wikimedia.org/r/576925/ several mappings that were previously to I are not to L and that could break things downstream, so we should see some unit tests on that.

I did recently suggest in T253367 as means to look at the abuse of some of the unicode textual representatives and how we might be able to use anitspoof as a futher means to exclude as usernames or as certain article names.

Huji added a comment.Jun 11 2020, 2:50 PM

I did recently suggest in T253367 as means to look at the abuse of some of the unicode textual representatives and how we might be able to use anitspoof as a futher means to exclude as usernames or as certain article names.

That is a good idea too. However, that doesn't diminish the importance of better unit tests in Equivset

sbassett changed the task status from Open to Stalled.Jun 26 2020, 7:13 PM
sbassett added a project: user-sbassett.
sbassett moved this task from Watching to Waiting on the Security-Team board.

I haven't had time to focus on this lately. I'm wondering if we should abandon the current effort due to @Huji's gerrit comment.

sbassett moved this task from Backlog to Postponed on the user-sbassett board.Jun 26 2020, 7:13 PM
kaldari changed the task status from Stalled to Open.Oct 9 2020, 8:08 PM
kaldari added a subscriber: kaldari.

@ST47 - Sorry I didn't notice this task until now. I've marked the patch as -2 as we explicitly chose not to use the Unicode confusables list in Equivset for several reasons. Most importantly, confusables.txt and Equivset have different purposes and are not intended to be utilized in the same way. Confusables.txt is intended to be used to see if two strings can be confused with each other, but it isn't intended to be used to create filter strings like we do in AbuseFilter, nor does it handle casefolding as Equivset does. This allows you to filter with a single string like "POOP" instead of "Pp|Oo|Oo|Pp" in AbuseFilter. It also results in significantly different mappings. For example we don't map capital I to lowercase L even though they are confusable. Otherwise you would have to filter out the word "idiot" with "LDLOT" instead of "IDIOT" in AbuseFilter. Secondly, confusables.txt is much bigger than our list with lots of obscure symbols mapping to other obscure symbols which we don't actually care about (for the most part). For AbuseFilter, we have to run every character of every edit through the entire Equivset list, and the longer that list is the more of a performance hit it is on saving an edit.

You are very much correct that Equivset and AntiSpoof are not foolproof and there are many trivial ways to work around them. Unfortunately, this will always be the case so long as we are relying on a one-way 1-to-1 mapping of characters. It simply isn't possible to map all possible equivalences in this way without creating loops or lots of false positives for AbuseFilter string filters. To be honest, confusables.txt would probably be a better solution for username filtering that Equivset, as Equiveset is more tailored to AbuseFilter's use cases.

Change 576925 abandoned by SBassett:
[mediawiki/libs/Equivset@master] Update confusables within equivset.in

Reason:
See T246353#6533626

https://gerrit.wikimedia.org/r/576925