For labelling potentially vandal changes (especially redirects), it would be useful if there would be an easy way to compare the old and the new text and see how similar they are. Something like the Levenshtein distance could be used to calculate similarity.

# Description

# Details

- Reference
- bz34912

# Related Objects

### Event Timeline

Using Levenshtein distance is almost impossible. Or, at least, it's impossible to implement it in a useful way. PHP provides a dedicated function, which however only accepts strings shorther than 255 characters. In any case, Levenshtein distance has a complexity of `O(m*n)` (about `O(n^2)` if added and removed lines have similar length) and, especially if used with large variables like `new_wikitext` (which is what you have to do when comparing versions), it may be seriously slow. Of course we may add it with the same limitation of 255 characters, but it would be almost useless. There might be some quicker and more efficient alternative, but I don't know any.

I am with @Daimona on this one. There exists no string comparison function (that I am aware of) with a complexity that is not in quadratic/cubic range. With our current resources and run-time, this would be impractical to implement on WMF wikis (let alone other smaller wikis that use AbuseFilter).

Marking as Declined for now; perhaps one day when quantum computers become a commodity, we can reopen this task :)

Right, Levenshtein is a poor choice for this, but a similarity algorithm for recognizing full-page blanking/replacement vandalism is not particularly hard, e.g. measure the longest common substring.

That said, with ORES available these days, I'm not sure if there's still a use case fr this.