Page MenuHomePhabricator

Add string distance function to AbuseFilter
Closed, DeclinedPublic

Description

For labelling potentially vandal changes (especially redirects), it would be useful if there would be an easy way to compare the old and the new text and see how similar they are. Something like the Levenshtein distance could be used to calculate similarity.

Details

Reference
bz34912

Event Timeline

bzimport raised the priority of this task from to Low.Nov 22 2014, 12:19 AM
bzimport added a project: AbuseFilter.
bzimport set Reference to bz34912.
bzimport added a subscriber: Unknown Object (MLST).

Using Levenshtein distance is almost impossible. Or, at least, it's impossible to implement it in a useful way. PHP provides a dedicated function, which however only accepts strings shorther than 255 characters. In any case, Levenshtein distance has a complexity of O(m*n) (about O(n^2) if added and removed lines have similar length) and, especially if used with large variables like new_wikitext (which is what you have to do when comparing versions), it may be seriously slow. Of course we may add it with the same limitation of 255 characters, but it would be almost useless. There might be some quicker and more efficient alternative, but I don't know any.

Huji subscribed.

I am with @Daimona on this one. There exists no string comparison function (that I am aware of) with a complexity that is not in quadratic/cubic range. With our current resources and run-time, this would be impractical to implement on WMF wikis (let alone other smaller wikis that use AbuseFilter).

Marking as Declined for now; perhaps one day when quantum computers become a commodity, we can reopen this task :)

Right, Levenshtein is a poor choice for this, but a similarity algorithm for recognizing full-page blanking/replacement vandalism is not particularly hard, e.g. measure the longest common substring.

That said, with ORES available these days, I'm not sure if there's still a use case fr this.

Hi, for a long time an LTA is creating pages on enwn with the tilte visually similar to "This is a problematic string". The create pages like "thís is a problemat1c string", "tnis iss a problemátiç strimg" and of such sorts. There is not much in the body. I really think dice coffecient, or some sort of string similarity algorithm, and then a hamming disctance calculating function can help preventing abuse.

Hi, for a long time an LTA is creating pages on enwn with the tilte visually similar to "This is a problematic string". The create pages like "thís is a problemat1c string", "tnis iss a problemátiç strimg" and of such sorts. There is not much in the body. I really think dice coffecient, or some sort of string similarity algorithm, and then a hamming disctance calculating function can help preventing abuse.

You can use norm or ccnorm for that:

ccnorm(added_lines) contains ccnorm( 'This is a problematic string' )

@Daimona as I metioned, that does not solve the problem when the LTA uses 'Tnis is a problematic string' or 'Th1s is a problematic strimg'. They are visually similar, but not something normalisation can solve on its own.

PHP's limited-to-255 Levenshtein would work fine for titles. There is probably no non-confusing way of preventing users from applying it to longer strings, though.

Maybe, a hacky way:

Lev(str) { if(str.length <= 255) {...algo...} }
In T36912#6809120, @Tgr wrote:

PHP's limited-to-255 Levenshtein would work fine for titles. There is probably no non-confusing way of preventing users from applying it to longer strings, though.

This.

Maybe, a hacky way:

Lev(str) { if(str.length <= 255) {...algo...} }

PHP's Levenshtein is already capped at 255 characters. If we keep this restrictions, I'd expect people to come here and start complaining because "This filter didn't work". If we work around the limitation by implementing our own levenshtein function, I'd expect myself to go at some wiki's village pump and start complaining because "This filter is delaying edit saves by 30 seconds".

@Daimona as I metioned, that does not solve the problem when the LTA uses 'Tnis is a problematic string' or 'Th1s is a problematic strimg'. They are visually similar, but not something normalisation can solve on its own.

Could you please point me to actual examples of this?

@Daimona if we specify in the docs the limit is 255 chars, maybe not enough people will complaint.

Re examples, they are hidden in logs. If you still wish to see it, please ping me on IRC, and I will make some arrangements to share it.

@Daimona if we specify in the docs the limit is 255 chars, maybe not enough people will complaint.

Re examples, they are hidden in logs. If you still wish to see it, please ping me on IRC, and I will make some arrangements to share it.

Yeah, I'd like to see them so I can better understand what might be needed. Where can I find you on IRC? Or you can ping me whenever you want :)