Page MenuHomePhabricator

ccnorm revamp: add a more sensible interface for normalised comparison
Closed, ResolvedPublic3 Estimated Story Points

Description

Background & Problem to solve

As discussed on T29987, the current practice to run ccnorm on things and then compare them to the alleged canonical form of a string is not viable.

The first problem is that often users are not comparing normalised strings to normalised strings; apple and oranges comparisons have unpredictable results. See T29987#324779 and T29987#324795.


Proposed solution

Tim proposed something like:

Well, how about

added_lines cclike "testing|vandalizing"

Where the regex would be tokenized and reassembled, with alphabetic parts normalised with equivset?

That's ok but I (@Nemo_bis) think a more sensible syntax would be like

cclike(added_lines, testing) || cclike(added_lines, vandalizing)

That is, a single function should take two strings and tell us if, once canonicalised in whatever manner the code wants, they are the same thing, AKA if they are confusable.

This is nothing special: it's the approach followed by the standard API to ICU data, see uspoof_areConfusable in https://ssl.icu-project.org/apiref/icu4c/uspoof_8h.html#ac96fdf642bfd9efcd0d9956bd76cadaa, found from the documents mentioned in T65217. I was pointed to UTS #36 and UTS #39 by Nikerabbit, they were just drafts when AntiSpoof was created. Now we have better tools.


Requested deliverable

  • Build a function in AbuseFilter that allows the comparison of two canonicalised strings

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 2:55 AM
bzimport added projects: AntiSpoof, I18n.
bzimport set Reference to bz63242.
bzimport added a subscriber: Unknown Object (MLST).

I'm not sure if I like Tim or Nemo's suggestion better. Tim's version will be more concise and simpler to use in most cases, but Nemo's version will be closer to the underlying implementation and possibly offer more flexibility. If we do use Nemo's suggestion, I would advise against calling the function cclike() as that will probably confuse people into thinking it's an operator (like like, rlike, and irlike). A better function name might be are_confusable() (similar to the PHP function) or areAlike().

After giving this some thought, I think I'm leaning towards tim's suggestion of creating a new operator (like cclike). The main reason I'm leaning towards that solution is that I think it will be more intuitive for filter writers and the main problem we're trying to solve here is that the existing solution isn't intuitive enough.

Change 379159 had a related patch set uploaded (by Dmaza; owner: Dmaza):
[mediawiki/extensions/AbuseFilter@master] [WIP] Add cclike operator to normalize and compare a string to a list

https://gerrit.wikimedia.org/r/379159

@kaldari, @Nemo_bis please check patch 379159 and let me know if that's what you envisioned. @MusikAnimal your opinion is very much appreciated.

Change 379159 merged by jenkins-bot:
[mediawiki/extensions/AbuseFilter@master] Add ccnorm_contains_any function

https://gerrit.wikimedia.org/r/379159

kaldari moved this task from Code Review to Done on the Anti-Harassment (AHT Sprint 6) board.

Did someone update the documentation on MediaWiki.org? If not, should I open a task for it?

@Huji I've created T177711. We'll work on it when it gets released. I believe it will be next Tuesday.

I'm having a hard time trying to see if/how this can be used to deal with the original use case mentioned at T29987#324746. Our regexes are less readable then ever:
https://pt.wikipedia.org/wiki/Special:AbuseFilter/18
Tim's proposal would still help (since the special character "|" of the regexes would not be normalized), but I'm not sure if we can use this new function to improve things a little...

@He7d3r: I can't parse what https://pt.wikipedia.org/wiki/Special:AbuseFilter/18 is trying to do. Can you give us a simplified use case here? The use case that ccnorm_contains_any() was written to address is the one in the description:
ccnorm_contains_any( added_lines, "testing", "vandalizing" )

We want to detect edits which add any expression from a given list of "words". We need to use regexes in order to specify this list, because:

  1. We need to check for word boundaries to avoid false positives (e.g.: "VIADO" vs "AVIADOR")
  2. Each word has many variations (e.g. BANDID(?:AO|ONA|INH[AO]) matches 4 specific variations of a single word while not matching variants which would increase the number of false positives)
  3. Some characters are not normalized as we want (e.g. ccnorm('ÈÉÊẼÌÍÏÓÒÔÕ∅Q̃ÚŰÜŨ') == 'EEEEIIIOOOOOQUUUU' should be true), so we use character classes inside the regexes to deal with these cases.

One idea I plan to explorer is to use str_replace to implement our own normalization, as in
https://pt.wikipedia.org/wiki/WP:Filtro_de_edições/Solicitações?diff=50091160
(but it would be more convenient if it accepted an array of replace pairs, to allow a single call to the function)

@He7d3r: Combining normalization and regexes is tricky, but it might be worth experimenting further. Regarding #3, I created T178010.

@He7d3r I don't think ccnorm_contains_any solves anything for https://pt.wikipedia.org/wiki/Special:AbuseFilter/18.
It literally translates to contains_any(ccnorm(param1), ccnorm(param2), ...) so you still have the boundaries issue.

Regarding the character classes inside the regexes that's always gonna be necessary 'cause the Equivset doesn't cover all the possible cases (even tho some of those characters have already been added).

One idea I plan to explorer is to use str_replace to implement our own normalization, as in
https://pt.wikipedia.org/wiki/WP:Filtro_de_edições/Solicitações?diff=50091160
(but it would be more convenient if it accepted an array of replace pairs, to allow a single call to the function)

Something maybe worth exploring is implementing strtr or a variant of it where it will allow you to set your own replacements if they are missing from Equivset or not allowed 'cause they only make sense in a particular language or context.