Page MenuHomePhabricator

Add string distance function to AbuseFilter
Closed, DeclinedPublic


For labelling potentially vandal changes (especially redirects), it would be useful if there would be an easy way to compare the old and the new text and see how similar they are. Something like the Levenshtein distance could be used to calculate similarity.



Event Timeline

bzimport raised the priority of this task from to Low.Nov 22 2014, 12:19 AM
bzimport added a project: AbuseFilter.
bzimport set Reference to bz34912.
bzimport added a subscriber: Unknown Object (MLST).
Tgr created this task.Mar 2 2012, 10:56 PM
Izno updated the task description. (Show Details)Jan 20 2018, 8:03 PM
Izno removed a subscriber: wikibugs-l-list.

Using Levenshtein distance is almost impossible. Or, at least, it's impossible to implement it in a useful way. PHP provides a dedicated function, which however only accepts strings shorther than 255 characters. In any case, Levenshtein distance has a complexity of O(m*n) (about O(n^2) if added and removed lines have similar length) and, especially if used with large variables like new_wikitext (which is what you have to do when comparing versions), it may be seriously slow. Of course we may add it with the same limitation of 255 characters, but it would be almost useless. There might be some quicker and more efficient alternative, but I don't know any.

Huji closed this task as Declined.Feb 27 2018, 3:13 PM
Huji added a subscriber: Huji.

I am with @Daimona on this one. There exists no string comparison function (that I am aware of) with a complexity that is not in quadratic/cubic range. With our current resources and run-time, this would be impractical to implement on WMF wikis (let alone other smaller wikis that use AbuseFilter).

Marking as Declined for now; perhaps one day when quantum computers become a commodity, we can reopen this task :)

Tgr added a comment.Feb 27 2018, 4:38 PM

Right, Levenshtein is a poor choice for this, but a similarity algorithm for recognizing full-page blanking/replacement vandalism is not particularly hard, e.g. measure the longest common substring.

That said, with ORES available these days, I'm not sure if there's still a use case fr this.

@Tgr There is still a use case for it. See T207446