Change Details

**Background & Problem to solve** As discussed on T29987, the current practice to run `ccnorm` on things and then compare them to the alleged canonical form of a string is not viable. The first problem is that often users are not comparing normalised strings to normalised strings; apple and oranges comparisons have unpredictable results. See T29987#324779 and T29987#324795. **Proposed solution** Tim proposed something like: (Tim S>>! @tstarling from T29987#324762) > Well, how about > > `added_lines cclike "testing|vandalizing"` > > Where the regex would be tokenized and reassembled, with alphabetic parts > normalised with equivset? That's ok but I (@Nemo_bis) think a more sensible syntax would be like `cclike(added_lines, testing) || cclike(added_lines, vandalizing)` That is, a single function should take two strings and tell us if, once canonicalised in whatever manner the code wants, they are the same thing, AKA if they are confusable. This is nothing special: it's the approach followed by the standard API to ICU data, see uspoof_areConfusable in <https://ssl.icu-project.org/apiref/icu4c/uspoof_8h.html#ac96fdf642bfd9efcd0d9956bd76cadaa>, found from the documents mentioned in T65217. I was pointed to UTS #36 and UTS #39 by Nikerabbit, they were just drafts when AntiSpoof was created. Now we have better tools. I'm marking this as blocked on T65217 because such a function seems trivial to implement with the ICU API. I'll comment there more in general.**Requested deliverable** * Build a function in AbuseFilter that allows the comparison of two canonicalised strings