Page MenuHomePhabricator

Add more characters to ccnorm
Open, NormalPublic

Description

Currently only some characters are normalized to a "canonical" form. For example, although ccnorm("α") results in "A", ccnorm("ά") doesn't change anything.

The function should support the conversion of more characters.

The following list is based on what was available at en:MediaWiki:Titleblacklist, but maybe it is better to have different sets of characters depending on the case of the letter. For example, for D, but for d.

a: aαąăãàāάạậảấầẩắằẵẳẫặḁǟǡȁᾳὰᾀἁᾁἄᾄἂᾂἆᾆἅᾅἃᾃἇᾇáâäæåǻ٩4
b: bßβбв฿
c: cċĉ¢сćĉçč
d: dďḍðⅆ
e: éèëeęěĕėẻẹếềễểȨȩḝēḗȅȇệḙḛ3عڠẽə
f: fғ₣
g: gĝģğġɠǥǧǵḡԌ
h: hήĥħȞʰʱḣḥḧḩḫнңӈӉηἠἡἢἣἤἥἦἧὴᾐћⱧԋњһ
i: iìíîïĩļǐīĭḷŀιїɨ!łľį
k: kķкќқҝҡҟӄ
l: l₤ĺľḷłŀλлљ
m: mɯḿṁṃмӍμ₥
n: n₦ńñņňṇν
o: oóòôöõǒōŏǫőœøəόοωὸὀὁὄὂὅὃоөӧӫδσʘǿọ
p: pƥṕṗǷ₧þρр
q: qɊʠ
r: rŕŗřȑȓƦʳʴʵʶṙṛṝṟя®
s: s$śŝşšṣσѕ
t: tţťṭτтŧ
u: uúùûüũůǔūǖǘǚǜŭųű
w: wŵẁẃẅẇẉ₩
x: xҳχ
y: yýÿŷƴȲʸẏỳỵỷỹʊύυϋὑὓὕὗὺῠῡуϓ
z: zźžż

Details

Reference
bz25619

Event Timeline

bzimport raised the priority of this task from to Normal.Nov 21 2014, 11:12 PM
bzimport added a project: AntiSpoof.
bzimport set Reference to bz25619.
He7d3r created this task.Oct 22 2010, 8:48 PM

dodo.wikipedia wrote:

Proposed patch

I've attached a proposed patch that would add the characters to the AntiSpoof checks (which are also used by the AbuseFilter).

attachment AntiSpoof.patch ignored as obsolete

matthew.britton wrote:

Changed extension to AntiSpoof, since that's where the change would have to be made (unless AbuseFilter was fixed by an independent re-implementation of the normalization, which seems pointless).

dodo.wikipedia wrote:

Yes, they're the same ones that I added to mediawiki.org in the edits you linked.

They were committed in r76484, then.

dodo.wikipedia wrote:

Okay, thanks.

The function still doesn't works with all characters mentioned in comment 0 above.

Using ccnorm in the string "ìíîïĩļǐīĭḷĿї!ľį₤ĺľḷĿΛЛљóòôöõǒōŏǫőόὸὀὁὄὂὅὃọ$śŝşšṣσ" doesn't change any of its characters.

sumanah wrote:

EdoDodo, does your patch still apply?

I recommend that you get a developer access account https://www.mediawiki.org/wiki/Developer_access so that you can commit your patches directly into the source control system in the future -- in fact, you could update and submit this patch, and get it reviewed faster. I'm sorry for the delay.

demon added a comment.May 16 2012, 7:27 PM

(In reply to comment #8)

EdoDodo, does your patch still apply?

This was already applied, see comment 5.

(In reply to comment #7)

The function still doesn't works with all characters mentioned in comment 0
above.
Using ccnorm in the string
"ìíîïĩļǐīĭḷĿї!ľį₤ĺľḷĿΛЛљóòôöõǒōŏǫőόὸὀὁὄὂὅὃọ$śŝşšṣσ"
doesn't change any of its characters.

Still reproducible.

It looks like all of the equivalents were added except for the ones corresponding to the letters I, L, O, and S. Of course this makes sense since those 4 letters have never worked in AntiSpoof due to bug 27987.

I fixed bug 27987 in change I613f9917, so I'll do a follow-up commit to add the missing equivs.

Change 92154 had a related patch set uploaded by Kaldari:
Adding missing equivalents for I, L, O, and S.

https://gerrit.wikimedia.org/r/92154

I added all the missing equivalencies, except for 4 or 5 that either didn't make sense or would have conflicted with valid equivalencies for Greek. For example:
λ->L
л->L
љ->L
σ->S

Change 97304 had a related patch set uploaded by Kaldari:
Adding 2 new equivalencies (partial fix for bug 25619)

https://gerrit.wikimedia.org/r/97304

Since I haven't had any luck getting code review on https://gerrit.wikimedia.org/r/92154 I submitted https://gerrit.wikimedia.org/r/97304 as a simpler version. It only adds ! and $ and nothing else.

Both patches are still open. The first one got some reviews and now it looks like is waiting for a new upload from Kaldari. The second one with the simpler version got no reviews at all.

(In reply to Ryan Kaldari from comment #15)

Since I haven't had any luck getting code review on
https://gerrit.wikimedia.org/r/92154 I submitted
https://gerrit.wikimedia.org/r/97304 as a simpler version. It only adds !
and $ and nothing else.

I'm not sure whether sending a request to wikitech-l could help getting any reviews to these two patches, but pinging at the patches and here doesn't seem to be enough... Any ideas?

demon removed a subscriber: demon.Dec 16 2014, 7:57 PM

Change 184850 had a related patch set uploaded (by Whym):
Add Japanese normalization pairs

https://gerrit.wikimedia.org/r/184850

Patch-For-Review

whym added a subscriber: whym.EditedJan 15 2015, 1:23 PM

I also see that the "editable" sets on mediawiki.org have some more changes that are included in none of the gerrit changes mentioned here.

EDIT: added link

Change 184850 merged by jenkins-bot:
Add Japanese normalization pairs

https://gerrit.wikimedia.org/r/184850

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 16 2016, 4:49 AM

Change 311310 had a related patch set uploaded (by MusikAnimal):
Adding missing equivalents for I, L, O, and S.

https://gerrit.wikimedia.org/r/311310

Qgil removed a subscriber: Qgil.Sep 19 2016, 8:58 AM

Change 311310 merged by jenkins-bot:
Adding missing equivalents for I, L, O, and S.

https://gerrit.wikimedia.org/r/311310

He7d3r updated the task description. (Show Details)Nov 30 2016, 11:32 AM

Looks like this ticket can be closed, correct? Is this released to production?

Saw two more O's from ntsamr spambot this does not handle: & ߋ (the latter is hard to copy, but it's unicode 0x07cb)

@zhuyifei1999: Thanks for the report!

Tgr added a subscriber: Tgr.Feb 14 2018, 8:38 AM

Also 1 and I are in the same equivalence set but l is not. Those two sets should probably be merged.