Page MenuHomePhabricator

missing character equivalencies: ÈÉÊẼÌÍÏÓÒÔÕ∅Q̃ÚŰÜŨ
Closed, ResolvedPublic

Description

ccnorm( 'ÈÉÊẼÌÍÏÓÒÔÕ∅Q̃ÚŰÜŨ' ) should return EEEEIIIOOOOOQUUUU, but instead returns ÈÉÊẼÌÍÏÓÒÔÕ∅Q̃ÚŰÜŨ. These equivalencies need to be added to the equivset.

Event Timeline

More chars to the list:

ccnorm( '∆д©ĈĆČĎĘĒĔĖĚ€ĜĢĤĨĪĬİĴĵĶÑŃŇNʼnŊŋŌŎŐ₱π®ŨŪŮŰŲµŴ×ÝŶŸŹŻŽ' )

should returns

AACCCCDEEEEEEGGHIIIIJJKNNNNNNNPRRUUUUUUWXYYYZZZ

And i suspect there are a lot in this list that isn't normalized too.

I don't agree with € -> G and π -> R, but the rest look reasonable (although ∆ is debatable).

Note that in the proposal € is normalized to E, not to G. However, I agree that pi shoudln't be normalized to r, this equivalence would be kind of random. If anything, pi resembles an n, but I'm not even sure that this is needed.

I don't suggest € -> G. I put ĘĒĔĖĚ€ = EEEEEE and ĜĢ = GG.


∆ -> A
π -> R

Theses i suggest because have already been used in vandal attack on ptwiki and now are used in some filters (https://pt.wikipedia.org/wiki/Especial:Filtro_de_abusos/18 and https://pt.wikipedia.org/wiki/Especial:Filtro_de_abusos/70, for example).

"π" remembers a lowercase "r" cursive (https://commons.wikimedia.org/wiki/File:Cursive.svg).


Another suggestion:

ccnorm( '¡' )

should returns

I

Some news of this task? That problem is very complicated in ptwiki, that makes the filters very complex.
Besides, the normalization of the character "|" difficults further.

Take the https://pt.wikipedia.org/wiki/Especial:Filtro_de_abusos/141 like example. We do a test with the ccnorm function and without, to take something like "[[Teste|foda-se]]" (the same of [[Test|fuck you]]) and this makes the filter very verbose.

What happened with the chars "ÁÀÂÃ@ê" etc? They were already being normalized, so i remove from the filters on ptwiki. Now, they back to be unnormalized again and things like "FILHA DA PUT@ DE QUATRO" (son of a bitch [...]) passed...

@Silent that's a good question. I can confirm that the mapping is no longer there.

I've created T179834 to look into it.

Maybe they were lost during the move to the new library.

Would that be avoided if we had regression tests for abuse filters (T42478)?

@He7d3r Not in this particular case since this problem is related to the normalization lib (now Equivset)

Some prevision to solve this task?

Change 818287 had a related patch set uploaded (by Umherirrender; author: Umherirrender):

[mediawiki/libs/Equivset@master] Expand set for lower/upper case characters which are alone in the set

https://gerrit.wikimedia.org/r/818287

Change 818287 merged by jenkins-bot:

[mediawiki/libs/Equivset@master] Expand set for lower/upper case characters which are alone in the set

https://gerrit.wikimedia.org/r/818287

Trying to structure the task a bit to mention which was done by the merged patch set

From the task description:
ccnorm( 'ÈÉÊẼÌÍÏÓÒÔÕÚŰÜŨ' )

are now part of Equivset (from the list in the task description are the following characters missing in equivset ∅Q̃), it needs a new release to get them working in AbuseFilter on wmf wikis

ccnorm( 'ĈĆČĎĘĒĔĖĚĜĢĤĨĪĬĶÑŃŇNŌŎŐŨŪŮŰŲŴÝŶŸŹŻŽ' )

are now part of Equivset (from the list in the comment are the following characters missing in equivset ∆д©€İĴĵʼnŊŋ₱π®µ×), it needs a new release to get them working in AbuseFilter on wmf wikis

ccnorm( '¡' )

the character is missing in equivset

Change 904831 had a related patch set uploaded (by Umherirrender; author: Umherirrender):

[mediawiki/libs/Equivset@master] Add ¡¥©ªµ×İĴĵʼnŊŋ€₱∅

https://gerrit.wikimedia.org/r/904831

The in the task description is a composite character of Q and ̃. such characters are not handled by EquivSet.

I would not agree on visually similar for ∆дπ.
I have added the other with a new patch set for review.

Change 904831 merged by jenkins-bot:

[mediawiki/libs/Equivset@master] Add ¡¥©ªµ×İĴĵʼnŊŋ€₱∅

https://gerrit.wikimedia.org/r/904831

it needs a new release to get them working in AbuseFilter on wmf wikis

Please create a new task when the remaining or more characters should be added, to make the discussion easier.