Page MenuHomePhabricator

Add more characters to ccnorm
Open, MediumPublic

Description

Currently only some characters are normalized to a "canonical" form. For example, although ccnorm("α") results in "A", ccnorm("ά") doesn't change anything.

The function should support the conversion of more characters.

The following list is based on what was available at en:MediaWiki:Titleblacklist, but maybe it is better to have different sets of characters depending on the case of the letter. For example, for D, but for d.

a: aαąăãàāάạậảấầẩắằẵẳẫặḁǟǡȁᾳὰᾀἁᾁἄᾄἂᾂἆᾆἅᾅἃᾃἇᾇáâäæåǻ٩4
b: bßβбв฿
c: cċĉ¢сćĉçč
d: dďḍðⅆ
e: éèëeęěĕėẻẹếềễểȨȩḝēḗȅȇệḙḛ3عڠẽə
f: fғ₣
g: gĝģğġɠǥǧǵḡԌ
h: hήĥħȞʰʱḣḥḧḩḫнңӈӉηἠἡἢἣἤἥἦἧὴᾐћⱧԋњһ
i: iìíîïĩļǐīĭḷŀιїɨ!łľį
k: kķкќқҝҡҟӄ
l: l₤ĺľḷłŀλлљ
m: mɯḿṁṃмӍμ₥
n: n₦ńñņňṇν
o: oóòôöõǒōŏǫőœøəόοωὸὀὁὄὂὅὃоөӧӫδσʘǿọ
p: pƥṕṗǷ₧þρр
q: qɊʠ
r: rŕŗřȑȓƦʳʴʵʶṙṛṝṟя®
s: s$śŝşšṣσѕ
t: tţťṭτтŧ
u: uúùûüũůǔūǖǘǚǜŭųű
w: wŵẁẃẅẇẉ₩
x: xҳχ
y: yýÿŷƴȲʸẏỳỵỷỹʊύυϋὑὓὕὗὺῠῡуϓ
z: zźžż

Related Objects

StatusSubtypeAssignedTask
OpenNone
OpenNone

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 11:12 PM
bzimport added a project: AntiSpoof.
bzimport set Reference to bz25619.

dodo.wikipedia wrote:

Proposed patch

I've attached a proposed patch that would add the characters to the AntiSpoof checks (which are also used by the AbuseFilter).

attachment AntiSpoof.patch ignored as obsolete

matthew.britton wrote:

Changed extension to AntiSpoof, since that's where the change would have to be made (unless AbuseFilter was fixed by an independent re-implementation of the normalization, which seems pointless).

dodo.wikipedia wrote:

Yes, they're the same ones that I added to mediawiki.org in the edits you linked.

They were committed in r76484, then.

dodo.wikipedia wrote:

Okay, thanks.

The function still doesn't works with all characters mentioned in comment 0 above.

Using ccnorm in the string "ìíîïĩļǐīĭḷĿї!ľį₤ĺľḷĿΛЛљóòôöõǒōŏǫőόὸὀὁὄὂὅὃọ$śŝşšṣσ" doesn't change any of its characters.

sumanah wrote:

EdoDodo, does your patch still apply?

I recommend that you get a developer access account https://www.mediawiki.org/wiki/Developer_access so that you can commit your patches directly into the source control system in the future -- in fact, you could update and submit this patch, and get it reviewed faster. I'm sorry for the delay.

(In reply to comment #8)

EdoDodo, does your patch still apply?

This was already applied, see comment 5.

(In reply to comment #7)

The function still doesn't works with all characters mentioned in comment 0
above.

Using ccnorm in the string
"ìíîïĩļǐīĭḷĿї!ľį₤ĺľḷĿΛЛљóòôöõǒōŏǫőόὸὀὁὄὂὅὃọ$śŝşšṣσ"
doesn't change any of its characters.

Still reproducible.

It looks like all of the equivalents were added except for the ones corresponding to the letters I, L, O, and S. Of course this makes sense since those 4 letters have never worked in AntiSpoof due to bug 27987.

I fixed bug 27987 in change I613f9917, so I'll do a follow-up commit to add the missing equivs.

Change 92154 had a related patch set uploaded by Kaldari:
Adding missing equivalents for I, L, O, and S.

https://gerrit.wikimedia.org/r/92154

I added all the missing equivalencies, except for 4 or 5 that either didn't make sense or would have conflicted with valid equivalencies for Greek. For example:
λ->L
л->L
љ->L
σ->S

Change 97304 had a related patch set uploaded by Kaldari:
Adding 2 new equivalencies (partial fix for bug 25619)

https://gerrit.wikimedia.org/r/97304

Since I haven't had any luck getting code review on https://gerrit.wikimedia.org/r/92154 I submitted https://gerrit.wikimedia.org/r/97304 as a simpler version. It only adds ! and $ and nothing else.

Both patches are still open. The first one got some reviews and now it looks like is waiting for a new upload from Kaldari. The second one with the simpler version got no reviews at all.

(In reply to Ryan Kaldari from comment #15)

Since I haven't had any luck getting code review on
https://gerrit.wikimedia.org/r/92154 I submitted
https://gerrit.wikimedia.org/r/97304 as a simpler version. It only adds !
and $ and nothing else.

I'm not sure whether sending a request to wikitech-l could help getting any reviews to these two patches, but pinging at the patches and here doesn't seem to be enough... Any ideas?

Change 184850 had a related patch set uploaded (by Whym):
Add Japanese normalization pairs

https://gerrit.wikimedia.org/r/184850

Patch-For-Review

I also see that the "editable" sets on mediawiki.org have some more changes that are included in none of the gerrit changes mentioned here.

EDIT: added link

Change 184850 merged by jenkins-bot:
Add Japanese normalization pairs

https://gerrit.wikimedia.org/r/184850

Change 311310 had a related patch set uploaded (by MusikAnimal):
Adding missing equivalents for I, L, O, and S.

https://gerrit.wikimedia.org/r/311310

Change 311310 merged by jenkins-bot:
Adding missing equivalents for I, L, O, and S.

https://gerrit.wikimedia.org/r/311310

Looks like this ticket can be closed, correct? Is this released to production?

Saw two more O's from ntsamr spambot this does not handle: & ߋ (the latter is hard to copy, but it's unicode 0x07cb)

Also 1 and I are in the same equivalence set but l is not. Those two sets should probably be merged.

If you take a look at the code I maintain on this edit filter on en-wiki, you'll find many more variances of each letter of the alphabet that should also be added to the ccnorm function that aren't listed here. We definitely need to expand this function and get the additional variances added. I have LTA users who are using different font text in their usernames and abusive edits and in order to bypass edit filters. It's becoming a daunting and long task to update the filter I created so that it's caught up, and it's a cat and mouse game that we'll be one step behind on if we don't do this...

This webpage is what the LTA user is using in order to quickly generate text in different fonts and use them to create account, get around edit filters, and make abusive edits to pages...

In T27619#3970962, @Tgr wrote:

Also 1 and I are in the same equivalence set but l is not. Those two sets should probably be merged.

This is a result of the 2 competing use cases of Equivset:

  • To prevent spoofing of usernames (AntiSpoof)
  • To create "bad word" filters in AbuseFilter

I and L being in the same equivalence set makes sense for AntiSpoof, but not for AbuseFilter (as it would make construction of the filters unintuitive). The ultimate solution to this problem is probably to adopt confusables.txt (or a derivative) for AntiSpoof, and tailor Equivset for AbuseFilter.

The characters "ª" (A), "º" (O) and "°" (O) should also be included.

Billinghurst added a subscriber: Billinghurst.

@Proc AbuseFilter leverages the character set from the AntiSpoof extension, so it belongs against that project, not against AbuseFilter

Due to recent abuse that can be seen for example at https://simple.wikipedia.org/wiki/Special:Contributions/46.134.191.203, please also add the mathematical alphanumeric characters (see https://en.wikipedia.org/wiki/Mathematical_Alphanumeric_Symbols for character tables).

You all should check out edit filter 51 and 53 on the English Wikipedia... I've been working to stay on top of abusive usernames and edits for years now, and I've concocted a large list of letters that ccnorm doesn't catch. I have amended these letters to the table of letters in the edit task summary. See below:

a: aαąăãàāάạậảấầẩắằẵẳẫặḁǟǡȁᾳὰᾀἁᾁἄᾄἂᾂἆᾆἅᾅἃᾃἇᾇáâäæåǻ٩4AÅaà🇦Ꭿ4ㅂ月A͜͡𝓐𝓪𝒜𝒶𝔸𝕒Aa𝘈𝘢Ꭺαᴀ∀ɐᴬᵃₐ
b: bßβбв฿Bbß𝓑𝓫𝐵𝒷𝔹𝕓Bb𝘉𝘣ʙᙠqᴮᵇ
c: cċĉ¢сćĉçčCcㄷс𝓒𝓬𝒞𝒸ℂ𝕔Cc𝘊𝘤ᴄɔᶜ
d: dďḍðⅆDd🇩𝓓𝓭𝒟𝒹𝔻𝕕Dd𝘋𝘥შᗡᴅpᴰᵈ
e: éèëeęěĕėẻẹếềễểȨȩḝēḗȅȇệḙḛ3عڠẽəEe3🇪Ꮛㅌ€三𝓔𝓮𝐸𝑒𝔼𝕖Ee𝘌𝘦ƎǝᴇєӛᎬᴱᵉₑ
f: fғ₣Ff🇫ㅋ𝓕𝓯𝐹𝒻𝔽𝕗Ff𝘍𝘧ƒҒℲꜰɟᶠ
g: gĝģğġɠǥǧǵḡԌGg巨ġ𝓖𝓰𝒢𝑔𝔾𝕘Gg𝘎𝘨𝔤⅁ɢɓᴳᵍ
h: hήĥħȞʰʱḣḥḧḩḫнңӈӉηἠἡἢἣἤἥἦἧὴᾐћⱧԋњһHh𝓗𝓱𝐻𝒽ℍ𝕙Hh𝘏𝘩ʜɥᴴʰₕ🇭н
i: iìíîïĩļǐīĭḷŀιїɨ!łľįIi工1í🇮!𝘐𝘪ㅣ|𝕝l𝒾𝓁𝓘𝓲𝐼𝕀𝕚IiᏆɪıЇ𝘭ᴵⁱᵢ
j:Jj𝓙𝓳𝒥𝒿𝕁𝕛Jj𝘑𝘫ᴊɾᴶʲⱼ
k: kķкќқҝҡҟӄKkкㅈᏦ𝓚𝓴𝒦𝓀𝕂𝕜KkϏ𝘒𝘬⋊ᴋκʞᴷᵏₖ
l: l₤ĺľḷłŀλлљLlㄴㅣ|1𝕀𝕚𝒾𝓛𝓵𝐿𝓁𝕃𝕝Ll𝘓𝘭Ꮮ˥ʟí🇮!IiЇ𝘐𝘪ᴸˡₗ
m: mɯḿṁṃмӍμ₥Mmм𝓜𝓶𝑀𝓂𝕄𝕞Mm𝘔𝘮რლWᴍɯᴹᵐₘ
n: n₦ńñņňṇνNnпŊ冂Ꮑ𝓝𝓷𝒩𝓃ℕ𝕟Nn𝘕𝘯ɴῃИuᴺⁿₙ
o: oóòôöõǒōŏǫőœøəόοωὸὀὁὄὂὅὃоөӧӫδσʘǿọOo0ÒÔÓóQㅇøᎾ𝓞𝓸𝒪𝑜𝕆𝕠Oo𝘖οი𝘰ᴏσᴼᵒₒ
p: pƥṕṗǷ₧þρрPp尸𝓟𝓹𝒫𝓅ℙ𝕡Pp𝘗𝘱Ԁᴘρᴾᵖₚ
q: qɊʠQq𝓠𝓺𝒬𝓆ℚ𝕢Qqϙ𝐐𝐪𝘘𝘲
r: rŕŗřȑȓƦʳʴʵʶṙṛṝṟя®Rrг🇷ㄱᏒ民®ʁ𝓡𝓻𝑅𝓇ℝ𝕣Rr𝘙𝘳ᴚяʀɹᎡᴿʳᵣ
s: s$śŝşšṣσѕSsşŠš$Ꭶ𝓢𝓼𝒮𝓈𝕊𝕤Ss𝘚𝘴ꜱ🇸Տˢₛ
t: tţťṭτтŧTt七ㅜ𝓣𝓽𝒯𝓉𝕋𝕥ㄒTt𝘛𝘵⊥ᴛτʇᵀᵗₜ
u: uúùûüũůǔūǖǘǚǜŭųű[Uuü∪🇺니心𝓤𝓾𝒰𝓊𝕌𝕦Uu𝘜𝘶∩ᴜnʉɄᵁᵘᵤ
v:Vv𝓥𝓿𝒱𝓋𝕍𝕧Vv𝘝𝘷ᴠΛʌνⱽᵛᵥ
w: wŵẁẃẅẇẉ₩WwᏔ𝓦𝔀𝒲𝓌𝕎𝕨Wwᴡ𝘞𝘸ᴡMʍᵂʷᵚ
x: xҳχXx𝓧𝔁𝒳𝓍𝕏𝕩Xx𝘟𝘹ˣₓ
y: yýÿŷƴȲʸẏỳỵỷỹʊύυϋὑὓὕὗὺῠῡуϓYy𝓨𝔂𝒴𝓎𝕐𝕪Yy𝘠𝘺უყʏ⅄ʎʸ
z: zźžżZzㄹ리𝓩𝔃𝒵𝓏ℤ𝕫Zz𝘡𝘻ᴢ弓ᶻ

Some of the characters I added might be repeated ones that are already in the table - I simply amended the letters that I've been keeping track of from the edit filters I maintain (51 and 53). This should prove helpful, and I really hope that when ccnorm is updated, that the table I supply here is used.

I find it quite weird that you often only have parts of quite specific specific character sets. For example, you have the regional indicator symbols for A D E F H I L R S U (🇦 🇩 🇪 🇫 🇭 🇮 🇮 🇷 🇸 🇺) but not for the rest of the alphabet (🇧 🇨 🇬 🇯 🇰 🇲 🇳 🇴 🇵 🇶 🇹 🇻 🇼 🇽 🇾 🇿). I also suggest adding other symbols from the Enclosed Alphanumerics and Enclosed Alphanumerics Supplement Unicode blocks, namely, each row resembling an alphabet,

ⓐⓑⓒⓓⓔⓕⓖⓗⓘⓙⓚⓛⓜⓝⓞⓟⓠⓡⓢⓣⓤⓥⓦⓧⓨⓩ
ⒶⒷⒸⒹⒺⒻⒼⒽⒾⒿⓀⓁⓂⓃⓄⓅⓆⓇⓈⓉⓊⓋⓌⓍⓎⓏ //first one is in fact already covered for some reason
⒜⒝⒞⒟⒠⒡⒢⒣⒤⒥⒦⒧⒨⒩⒪⒫⒬⒭⒮⒯⒰⒱⒲⒳⒴⒵
🄐🄑🄒🄓🄔🄕🄖🄗🄘🄙🄚🄛🄜🄝🄞🄟🄠🄡🄢🄣🄤🄥🄦🄧🄨🄩
🄰🄱🄲🄳🄴🄵🄶🄷🄸🄹🄺🄻🄼🄽🄾🄿🅀🅁🅂🅃🅄🅅🅆🅇🅈🅉
🅐🅑🅒🅓🅔🅕🅖🅗🅘🅙🅚🅛🅜🅝🅞🅟🅠🅡🅢🅣🅤🅥🅦🅧🅨🅩
🅰🅱🅲🅳🅴🅵🅶🅷🅸🅹🅺🅻🅼🅽🅾🅿🆀🆁🆂🆃🆄🆅🆆🆇🆈🆉

From the same blocks, we should also add 🄪 for S, 🄫 for C, 🄬 for R, and 🆭 for M. Adding the enclosed numbers might also be helpful.

Inserting the lines from your list at Special:AbuseFilter/tools in ccnorm() shows that quite the majority of the symbols are already implemented, though a couple ones as well as my findings are not.

But even from the already covered ones, you only seem to have the bolded mathematical characters for one letter (𝐐𝐪) but not the rest (𝐀𝐚𝐁𝐛𝐂𝐜𝐃𝐝𝐄𝐞𝐅𝐟𝐆𝐠𝐇𝐡𝐈𝐢𝐉𝐣𝐊𝐤𝐋𝐥𝐌𝐦𝐍𝐧𝐎𝐨𝐏𝐩𝐑𝐫𝐒𝐬𝐓𝐭𝐔𝐮𝐕𝐯𝐖𝐰𝐗𝐱𝐘𝐲𝐙𝐳), the italic ones only for a couple letters (𝐵𝐶𝐸𝑒𝐹𝑔𝐻𝐼𝐿𝑀𝑜𝑅) but not the rest (𝐴𝑎𝑏𝑐𝐷𝑑𝑓𝐺𝑕ℎ𝑖𝐽𝑗𝐾𝑘𝑙𝑚𝑁𝑛𝑂𝑃𝑝𝑄𝑞𝑟𝑆𝑠𝑇𝑡𝑈𝑢𝑉𝑣𝑊𝑤𝑋𝑥𝑌𝑦𝑍𝑧 — by the way, ℎ is still to be implemented)𝑕𝑕; lots of other mathematical letter variants are missing as well.

Hi @1234qwer1234qwer4! I added the characters that I personally found that LTA accounts were trying to abuse in order to get around the edit filters. I acknowledge that they're not the complete alphabet. I appreciate you for taking the time to locate them and add them to the list. :-)