ccnorm( 'ÈÉÊẼÌÍÏÓÒÔÕ∅Q̃ÚŰÜŨ' ) should return EEEEIIIOOOOOQUUUU, but instead returns ÈÉÊẼÌÍÏÓÒÔÕ∅Q̃ÚŰÜŨ. These equivalencies need to be added to the equivset.
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Umherirrender | T27619 Add more characters to ccnorm | |||
Resolved | Umherirrender | T178010 missing character equivalencies: ÈÉÊẼÌÍÏÓÒÔÕ∅Q̃ÚŰÜŨ |
Event Timeline
More chars to the list:
ccnorm( '∆д©ĈĆČĎĘĒĔĖĚ€ĜĢĤĨĪĬİĴĵĶÑŃŇNʼnŊŋŌŎŐ₱π®ŨŪŮŰŲµŴ×ÝŶŸŹŻŽ' )
should returns
AACCCCDEEEEEEGGHIIIIJJKNNNNNNNPRRUUUUUUWXYYYZZZ
And i suspect there are a lot in this list that isn't normalized too.
I don't agree with € -> G and π -> R, but the rest look reasonable (although ∆ is debatable).
Note that in the proposal € is normalized to E, not to G. However, I agree that pi shoudln't be normalized to r, this equivalence would be kind of random. If anything, pi resembles an n, but I'm not even sure that this is needed.
I don't suggest € -> G. I put ĘĒĔĖĚ€ = EEEEEE and ĜĢ = GG.
∆ -> A
π -> R
Theses i suggest because have already been used in vandal attack on ptwiki and now are used in some filters (https://pt.wikipedia.org/wiki/Especial:Filtro_de_abusos/18 and https://pt.wikipedia.org/wiki/Especial:Filtro_de_abusos/70, for example).
"π" remembers a lowercase "r" cursive (https://commons.wikimedia.org/wiki/File:Cursive.svg).
Another suggestion:
ccnorm( '¡' )
should returns
I
Some news of this task? That problem is very complicated in ptwiki, that makes the filters very complex.
Besides, the normalization of the character "|" difficults further.
Take the https://pt.wikipedia.org/wiki/Especial:Filtro_de_abusos/141 like example. We do a test with the ccnorm function and without, to take something like "[[Teste|foda-se]]" (the same of [[Test|fuck you]]) and this makes the filter very verbose.
What happened with the chars "ÁÀÂÃ@ê" etc? They were already being normalized, so i remove from the filters on ptwiki. Now, they back to be unnormalized again and things like "FILHA DA PUT@ DE QUATRO" (son of a bitch [...]) passed...
Change 818287 had a related patch set uploaded (by Umherirrender; author: Umherirrender):
[mediawiki/libs/Equivset@master] Expand set for lower/upper case characters which are alone in the set
Change 818287 merged by jenkins-bot:
[mediawiki/libs/Equivset@master] Expand set for lower/upper case characters which are alone in the set
Trying to structure the task a bit to mention which was done by the merged patch set
From the task description:
ccnorm( 'ÈÉÊẼÌÍÏÓÒÔÕÚŰÜŨ' )
are now part of Equivset (from the list in the task description are the following characters missing in equivset ∅Q̃), it needs a new release to get them working in AbuseFilter on wmf wikis
are now part of Equivset (from the list in the comment are the following characters missing in equivset ∆д©€İĴĵʼnŊŋ₱π®µ×), it needs a new release to get them working in AbuseFilter on wmf wikis
the character is missing in equivset
Change 904831 had a related patch set uploaded (by Umherirrender; author: Umherirrender):
[mediawiki/libs/Equivset@master] Add ¡¥©ªµ×İĴĵʼnŊŋ€₱∅
The Q̃ in the task description is a composite character of Q and ̃. such characters are not handled by EquivSet.
I would not agree on visually similar for ∆дπ.
I have added the other with a new patch set for review.
Change 904831 merged by jenkins-bot:
[mediawiki/libs/Equivset@master] Add ¡¥©ªµ×İĴĵʼnŊŋ€₱∅
it needs a new release to get them working in AbuseFilter on wmf wikis
Please create a new task when the remaining or more characters should be added, to make the discussion easier.