ccnorm revamp: add a more sensible interface for normalised comparison
Closed, ResolvedPublic3 Estimated Story Points
Actions

Description

Background & Problem to solve

As discussed on T29987, the current practice to run ccnorm on things and then compare them to the alleged canonical form of a string is not viable.

The first problem is that often users are not comparing normalised strings to normalised strings; apple and oranges comparisons have unpredictable results. See T29987#324779 and T29987#324795.

Proposed solution

Tim proposed something like:

@tstarling from T29987#324762

Well, how about

added_lines cclike "testing|vandalizing"

Where the regex would be tokenized and reassembled, with alphabetic parts normalised with equivset?

That's ok but I (@Nemo_bis) think a more sensible syntax would be like

cclike(added_lines, testing) || cclike(added_lines, vandalizing)

That is, a single function should take two strings and tell us if, once canonicalised in whatever manner the code wants, they are the same thing, AKA if they are confusable.

This is nothing special: it's the approach followed by the standard API to ICU data, see uspoof_areConfusable in https://ssl.icu-project.org/apiref/icu4c/uspoof_8h.html#ac96fdf642bfd9efcd0d9956bd76cadaa, found from the documents mentioned in T65217. I was pointed to UTS #36 and UTS #39 by Nikerabbit, they were just drafts when AntiSpoof was created. Now we have better tools.

Requested deliverable

Build a function in AbuseFilter that allows the comparison of two canonicalised strings

Details

Reference: bz63242

	Subject	Repo	Branch	Lines +/-
	Add ccnorm_contains_any function	mediawiki/extensions/AbuseFilter	master	+73 -19

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	• TBolliger	T166816 Epic ⚡️ : Accuracy improvements to anti-spoof tools across multiple pertinent tools
Resolved	dmaza	T65242 ccnorm revamp: add a more sensible interface for normalised comparison
Open	None	T65217 Augment our AntiSpoof normalization data with Unicode/CLDR data
Resolved	dmaza	T177711 Notify users about ccnorm_contains_any and update abuse filter documentation

Event Timeline

• bzimport raised the priority of this task from to Medium.Nov 22 2014, 2:55 AM

• bzimport added projects: AntiSpoof, I18n.

• bzimport set Reference to bz63242.

• bzimport added a subscriber: Unknown Object (MLST).

Nemo_bis created this task.Mar 28 2014, 10:22 PM

He7d3r updated the task description. (Show Details)Nov 20 2016, 3:17 PM

He7d3r removed a subscriber: • wikibugs-l-list.

I'm not sure if I like Tim or Nemo's suggestion better. Tim's version will be more concise and simpler to use in most cases, but Nemo's version will be closer to the underlying implementation and possibly offer more flexibility. If we do use Nemo's suggestion, I would advise against calling the function cclike() as that will probably confuse people into thinking it's an operator (like like, rlike, and irlike). A better function name might be are_confusable() (similar to the PHP function) or areAlike().

kaldari added a project: AbuseFilter.Jun 22 2017, 6:41 PM

• TBolliger added a parent task: T166816: Epic ⚡️ : Accuracy improvements to anti-spoof tools across multiple pertinent tools.Aug 25 2017, 6:47 PM

Restricted Application added a subscriber: jeblad. · View Herald TranscriptAug 25 2017, 6:47 PM

• TBolliger added a project: Anti-Harassment.Aug 25 2017, 6:47 PM

After giving this some thought, I think I'm leaning towards tim's suggestion of creating a new operator (like cclike). The main reason I'm leaning towards that solution is that I think it will be more intuitive for filter writers and the main problem we're trying to solve here is that the existing solution isn't intuitive enough.

• TBolliger moved this task from Untriaged to Snackbox on the Anti-Harassment board.Aug 29 2017, 4:50 PM

• TBolliger moved this task from Snackbox to Triage/To be Estimated on the Anti-Harassment board.

• TBolliger updated the task description. (Show Details)Sep 6 2017, 10:30 PM

• TBolliger updated the task description. (Show Details)Sep 6 2017, 10:37 PM

• TBolliger set the point value for this task to 3.Sep 8 2017, 7:51 PM

• TBolliger moved this task from Triage/To be Estimated to Cards ready for development on the Anti-Harassment board.

• TBolliger moved this task from Cards ready for development to AHT Sprint 5 on the Anti-Harassment board.Sep 12 2017, 7:06 PM

• TBolliger edited projects, added Anti-Harassment (AHT Sprint 5); removed Anti-Harassment.

dmaza claimed this task.Sep 14 2017, 8:31 PM

dmaza moved this task from Ready to In progress on the Anti-Harassment (AHT Sprint 5) board.

Change 379159 had a related patch set uploaded (by Dmaza; owner: Dmaza):
[mediawiki/extensions/AbuseFilter@master] [WIP] Add cclike operator to normalize and compare a string to a list

https://gerrit.wikimedia.org/r/379159

gerritbot added a project: Patch-For-Review.Sep 20 2017, 3:31 AM

@kaldari, @Nemo_bis please check patch 379159 and let me know if that's what you envisioned. @MusikAnimal your opinion is very much appreciated.

dmaza moved this task from In progress to Code Review on the Anti-Harassment (AHT Sprint 5) board.Sep 21 2017, 7:50 PM

• TBolliger moved this task from AHT Sprint 5 to AHT Sprint 6 on the Anti-Harassment board.Sep 27 2017, 6:24 PM

• TBolliger edited projects, added Anti-Harassment (AHT Sprint 6); removed Anti-Harassment (AHT Sprint 5).

dmaza moved this task from Ready to Code Review on the Anti-Harassment (AHT Sprint 6) board.Sep 27 2017, 6:35 PM

Change 379159 merged by jenkins-bot:
[mediawiki/extensions/AbuseFilter@master] Add ccnorm_contains_any function

https://gerrit.wikimedia.org/r/379159

ReleaseTaggerBot added a project: MW-1.31-release-notes (WMF-deploy-2017-10-10 (1.31.0-wmf.3)).Oct 6 2017, 9:00 PM

kaldari closed this task as Resolved.Oct 6 2017, 9:03 PM

kaldari moved this task from Code Review to Done on the Anti-Harassment (AHT Sprint 6) board.

Did someone update the documentation on MediaWiki.org? If not, should I open a task for it?

dmaza created subtask T177711: Notify users about ccnorm_contains_any and update abuse filter documentation.Oct 8 2017, 2:41 AM

@Huji I've created T177711. We'll work on it when it gets released. I believe it will be next Tuesday.

I'm having a hard time trying to see if/how this can be used to deal with the original use case mentioned at T29987#324746. Our regexes are less readable then ever:
https://pt.wikipedia.org/wiki/Special:AbuseFilter/18
Tim's proposal would still help (since the special character "|" of the regexes would not be normalized), but I'm not sure if we can use this new function to improve things a little...

He7d3r mentioned this in T177711: Notify users about ccnorm_contains_any and update abuse filter documentation.Oct 9 2017, 4:58 PM

@He7d3r: I can't parse what https://pt.wikipedia.org/wiki/Special:AbuseFilter/18 is trying to do. Can you give us a simplified use case here? The use case that ccnorm_contains_any() was written to address is the one in the description:
ccnorm_contains_any( added_lines, "testing", "vandalizing" )

We want to detect edits which add any expression from a given list of "words". We need to use regexes in order to specify this list, because:

We need to check for word boundaries to avoid false positives (e.g.: "VIADO" vs "AVIADOR")
Each word has many variations (e.g. BANDID(?:AO|ONA|INH[AO]) matches 4 specific variations of a single word while not matching variants which would increase the number of false positives)
Some characters are not normalized as we want (e.g. ccnorm('ÈÉÊẼÌÍÏÓÒÔÕ∅Q̃ÚŰÜŨ') == 'EEEEIIIOOOOOQUUUU' should be true), so we use character classes inside the regexes to deal with these cases.

One idea I plan to explorer is to use str_replace to implement our own normalization, as in
https://pt.wikipedia.org/wiki/WP:Filtro_de_edições/Solicitações?diff=50091160
(but it would be more convenient if it accepted an array of replace pairs, to allow a single call to the function)

@He7d3r: Combining normalization and regexes is tricky, but it might be worth experimenting further. Regarding #3, I created T178010.

@He7d3r I don't think ccnorm_contains_any solves anything for https://pt.wikipedia.org/wiki/Special:AbuseFilter/18.
It literally translates to contains_any(ccnorm(param1), ccnorm(param2), ...) so you still have the boundaries issue.

Regarding the character classes inside the regexes that's always gonna be necessary 'cause the Equivset doesn't cover all the possible cases (even tho some of those characters have already been added).

One idea I plan to explorer is to use str_replace to implement our own normalization, as in
https://pt.wikipedia.org/wiki/WP:Filtro_de_edições/Solicitações?diff=50091160
(but it would be more convenient if it accepted an array of replace pairs, to allow a single call to the function)

Something maybe worth exploring is implementing strtr or a variant of it where it will allow you to set your own replacements if they are missing from Equivset or not allowed 'cause they only make sense in a particular language or context.

• TBolliger closed subtask T177711: Notify users about ccnorm_contains_any and update abuse filter documentation as Resolved.Oct 16 2017, 6:43 PM

ccnorm revamp: add a more sensible interface for normalised comparisonClosed, ResolvedPublic3 Estimated Story PointsActions

Description

Details

Related ObjectsSearch...

Event Timeline

ccnorm revamp: add a more sensible interface for normalised comparison
Closed, ResolvedPublic3 Estimated Story Points
Actions

Related Objects
Search...