missing character equivalencies: ÈÉÊẼÌÍÏÓÒÔÕ∅Q̃ÚŰÜŨ
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	kaldari
	Oct 11 2017, 10:20 PM

Description

ccnorm( 'ÈÉÊẼÌÍÏÓÒÔÕ∅Q̃ÚŰÜŨ' ) should return EEEEIIIOOOOOQUUUU, but instead returns ÈÉÊẼÌÍÏÓÒÔÕ∅Q̃ÚŰÜŨ. These equivalencies need to be added to the equivset.

Details

	Subject	Repo	Branch	Lines +/-
	Add ¡¥©ªµ×İĴĵŉŊŋ€₱∅	mediawiki/libs/Equivset	master	+43 -12
	Expand set for lower/upper case characters which are alone in the set	mediawiki/libs/Equivset	master	+549 -27

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Umherirrender	T27619 Add more characters to ccnorm
		Resolved		Umherirrender	T178010 missing character equivalencies: ÈÉÊẼÌÍÏÓÒÔÕ∅Q̃ÚŰÜŨ

Event Timeline

kaldari created this task.Oct 11 2017, 10:20 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 11 2017, 10:20 PM

kaldari mentioned this in T65242: ccnorm revamp: add a more sensible interface for normalised comparison.Oct 11 2017, 10:52 PM

More chars to the list:

ccnorm( '∆д©ĈĆČĎĘĒĔĖĚ€ĜĢĤĨĪĬİĴĵĶÑŃŇNŉŊŋŌŎŐ₱π®ŨŪŮŰŲµŴ×ÝŶŸŹŻŽ' )

should returns

AACCCCDEEEEEEGGHIIIIJJKNNNNNNNPRRUUUUUUWXYYYZZZ

And i suspect there are a lot in this list that isn't normalized too.

Huji awarded a token.Oct 12 2017, 2:40 AM

Huji subscribed.

• dpatrick moved this task from Backlog / Other to Other WMF team on the acl*security board.Oct 12 2017, 4:15 PM

I don't agree with € -> G and π -> R, but the rest look reasonable (although ∆ is debatable).

Note that in the proposal € is normalized to E, not to G. However, I agree that pi shoudln't be normalized to r, this equivalence would be kind of random. If anything, pi resembles an n, but I'm not even sure that this is needed.

I don't suggest € -> G. I put ĘĒĔĖĚ€ = EEEEEE and ĜĢ = GG.

∆ -> A
π -> R

Theses i suggest because have already been used in vandal attack on ptwiki and now are used in some filters (https://pt.wikipedia.org/wiki/Especial:Filtro_de_abusos/18 and https://pt.wikipedia.org/wiki/Especial:Filtro_de_abusos/70, for example).

"π" remembers a lowercase "r" cursive (https://commons.wikimedia.org/wiki/File:Cursive.svg).

Another suggestion:

ccnorm( '¡' )

should returns

Silent awarded a token.Oct 13 2017, 11:45 AM

He7d3r awarded a token.Oct 28 2017, 3:02 PM

Some news of this task? That problem is very complicated in ptwiki, that makes the filters very complex.
Besides, the normalization of the character "|" difficults further.

Take the https://pt.wikipedia.org/wiki/Especial:Filtro_de_abusos/141 like example. We do a test with the ccnorm function and without, to take something like "[[Teste|foda-se]]" (the same of [[Test|fuck you]]) and this makes the filter very verbose.

dmaza subscribed.Oct 31 2017, 3:09 PM

What happened with the chars "ÁÀÂÃ@ê" etc? They were already being normalized, so i remove from the filters on ptwiki. Now, they back to be unnormalized again and things like "FILHA DA PUT@ DE QUATRO" (son of a bitch [...]) passed...

@Silent that's a good question. I can confirm that the mapping is no longer there.

I've created T179834 to look into it.

Maybe they were lost during the move to the new library.

Would that be avoided if we had regression tests for abuse filters (T42478)?

@He7d3r Not in this particular case since this problem is related to the normalization lib (now Equivset)

matej_suchanek added a project: Equivset.Nov 24 2017, 5:56 PM

• TBolliger removed a project: Anti-Harassment.Feb 20 2018, 5:52 PM

Some prevision to solve this task?

• chasemp triaged this task as Low priority.Dec 9 2019, 5:15 PM

• chasemp added a project: Security.Feb 10 2020, 11:00 PM

• chasemp removed a project: acl*security.Feb 20 2020, 8:08 PM

STran mentioned this in T265390: Investigation into AntiSpoof maintenance [4H].Nov 26 2020, 6:25 AM

matej_suchanek added a parent task: T27619: Add more characters to ccnorm.Apr 10 2022, 12:24 PM

Change 818287 had a related patch set uploaded (by Umherirrender; author: Umherirrender):

[mediawiki/libs/Equivset@master] Expand set for lower/upper case characters which are alone in the set

https://gerrit.wikimedia.org/r/818287

gerritbot added a project: Patch-For-Review.Jul 29 2022, 3:06 AM

Change 818287 merged by jenkins-bot:

[mediawiki/libs/Equivset@master] Expand set for lower/upper case characters which are alone in the set

https://gerrit.wikimedia.org/r/818287

Maintenance_bot removed a project: Patch-For-Review.Mar 31 2023, 7:11 AM

Trying to structure the task a bit to mention which was done by the merged patch set

From the task description:
ccnorm( 'ÈÉÊẼÌÍÏÓÒÔÕÚŰÜŨ' )

are now part of Equivset (from the list in the task description are the following characters missing in equivset ∅Q̃), it needs a new release to get them working in AbuseFilter on wmf wikis

In T178010#3678505, @Silent wrote:

ccnorm( 'ĈĆČĎĘĒĔĖĚĜĢĤĨĪĬĶÑŃŇNŌŎŐŨŪŮŰŲŴÝŶŸŹŻŽ' )

are now part of Equivset (from the list in the comment are the following characters missing in equivset ∆д©€İĴĵŉŊŋ₱π®µ×), it needs a new release to get them working in AbuseFilter on wmf wikis

In T178010#3680739, @Silent wrote:

ccnorm( '¡' )

the character is missing in equivset

Change 904831 had a related patch set uploaded (by Umherirrender; author: Umherirrender):

https://gerrit.wikimedia.org/r/904831

gerritbot added a project: Patch-For-Review.Mar 31 2023, 5:20 PM

The Q̃ in the task description is a composite character of Q and ̃. such characters are not handled by EquivSet.

I would not agree on visually similar for ∆дπ.
I have added the other with a new patch set for review.

Umherirrender merged a task: T334039: ccnorm() doesn't convert « É » character.Apr 6 2023, 4:37 PM

Umherirrender added subscribers: Od1n, Daimona.

Change 904831 merged by jenkins-bot:

https://gerrit.wikimedia.org/r/904831

it needs a new release to get them working in AbuseFilter on wmf wikis

Please create a new task when the remaining or more characters should be added, to make the discussion easier.

Maintenance_bot removed a project: Patch-For-Review.Apr 6 2023, 7:10 PM

missing character equivalencies: ÈÉÊẼÌÍÏÓÒÔÕ∅Q̃ÚŰÜŨClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

missing character equivalencies: ÈÉÊẼÌÍÏÓÒÔÕ∅Q̃ÚŰÜŨ
Closed, ResolvedPublic
Actions

Related Objects
Search...