Page MenuHomePhabricator

Add new matches_any function to join series of regular expressions
Open, Needs TriagePublic

Description

Using the AbuseFilter's debugging tools, you can see that use of ccnorm or norm on regex containing pipes | will get transformed to the letter I. This means we can't do things like: ccnorm(added_lines) irlike ccnorm("A|B|C"). A proposed solution (thanks to @MER-C) is to create a new function that works like contains_any except that it accepts regular expressions, and will join the arguments with | on the server-side.

Adding to that, it might be helpful to have norm_matches_any and ccnorm_matches_any, so we can avoid duplication as with matches_any(added_lines, ccnorm('regex1'), ccnorm('regex2'), ccnorm('regex3')), etc.

Another solution by @Huji, aiming to fix behaviour for both | and \b is available in task description of T190648.

Event Timeline

This perhaps can't or shouldn't be done as described, because there's no guarantee other regex syntax won't be transformed by norm or ccnorm. Let's say we write our filter ccnorm_matches_any(added_lines, "goo+gle", "yahoo+") which would be equivalent to matches_any(added_lines, ccnorm("goo+gle"), ccnorm("yahoo+")) (matches_any does not actually exist either, but that's beside the point). Here we want it to return true if added_lines contains "Go00oogle", "Gooogle", "Yah00ooo", "YahoO", etc (any number of O's, or 0's because ccnorm will turn a 0 into an O). The matches_any function then joins them with a pipe effectively running ccnorm(added_lines) matches "GOO+GLE|YAHOO+". This is great, but what if later someone adds a rule to AntiSpoof to turn +'s into T's? Then you'd unexpectedly end up with ccnorm(added_lines) matches "GOOTGLE|YAHOOT".

So basically, we shouldn't rely on regex working as intended after running norm or ccnorm the expression.

I don't know what to do here, frankly. It'd be nice to combine AntiSpoof and regex, but I don't see how to avoid the above risks. One could manually emulate AntiSpoof with regex – like ccnorm(added_lines) matches "G[O0][O0]+L[E3]|Y[@A]H[O0][O0]+" – so it's at least still possible to do what you need to with a bit of extra work.

I'm sorry if this question will sound silly. However, can we actually do something like irlike ccnorm("A|B|C")? Isn't it equivalent to irlike"AIBIC"? I mean, since we're calling ccnorm on a known string, what do we want to normalize here?

Niharika subscribed.

I'm gonna take Community-Tech off of this. Feel free to make it into a wishlist proposal if you want us to work on it. :)

I'm sorry if this question will sound silly. However, can we actually do something like irlike ccnorm("A|B|C")? Isn't it equivalent to irlike"AIBIC"? I mean, since we're calling ccnorm on a known string, what do we want to normalize here?

Right now, yes. But what if the definition of ccnorm changes? Then you have to update your filter each time. So instead of adding that burden + the burden of solving the equation ccnorm(x) and putting the output in the filter, we keep the ccnorm(x) in the filter even when x is a static known string. The additional runtime footprint is negligible.

Yes, this was only an example. Though, I still can't see the point of using ccnorm on a static needle. When using irlike (and similar functions) you supposedly know exactly what you want to search for. For instance, if I want the filter to be triggered if a user adds "foo", would it make sense to write it as added_lines irlike ccnorm("f00")? I mean, there's no equation to solve if you already know what to look for. Or did I miss something?