Page MenuHomePhabricator

Enhance Equivset with regard to Persian/Arabic characters
Open, LowPublic

Description

please add this two character to here

ۍ U+06CD --> ی
ݸ U+0778--> و

code

6CD ۍ => 649 ى

778 ݸ => 648 و

Details

Related Gerrit Patches:
mediawiki/libs/Equivset : masterAdd two more characters to Equivset

Event Timeline

Yamaha5 created this task.Dec 16 2018, 1:06 AM
Huji claimed this task.Dec 16 2018, 2:38 AM
Huji triaged this task as Low priority.
Huji renamed this task from add character to Equivset to Add two more characters to Equivset.Dec 16 2018, 3:14 AM

Change 479970 had a related patch set uploaded (by Huji; owner: Huji):
[mediawiki/libs/Equivset@master] Add two more characters to Equivset

https://gerrit.wikimedia.org/r/479970

I check the original file it has many problems. for example it changed foo ==> bar and later change bar ==> blabla
I listed the part of the code which should be change

64A ي => 649 ى  ==change to == > 649 ى =>64A ي
6AA ڪ => 6A9 ک  ==change to == > 6AA ڪ => 643 ك  
6CC ی => 649 ى ==change to == > 6CC ی => 64A ي
6D0 ې => 67B ٻ ==change to == > 6D0 ې => 64A ي
FED9 ﻙ => 6A9 ک ==change to == > FED9 ﻙ => 643 ك 
FEDA ﻚ => 6A9 ک ==change to == > FEDA ﻚ => 643 ك

---- should remove---
639 ع => 45 E   ==remove == > because it is the main character of many languages and you can not sacrifice them for Latin languages!
669 ٩ => 41 A   ==remove = >becuse it is the main number of many languages and you can not sacrifice them for Latin languages!

-----------all the arabic and persian numbers should convert to Latin numbers------------------
6F0 ۰ => 30 0
660 ٠ => 30 0
6F1 ۱ => 31 1
661 ١ => 31 1
6F2 ۲ => 32 2
662 ٢ => 32 2
6F3 ۳ => 33 3
663 ٣ => 33 3

6F7 ۷ => 37 7
667 ٧ => 37 7

6F8 ۸ => 38 8
668 ٨ => 38 8

6F9 ۹ => 39 9
669 ٩ => 39 9
9ED ৭ => 39 9
A6A ੪ => 38 8
Huji renamed this task from Add two more characters to Equivset to Enhance Equivset with regard to Persian/Arabic characters.Dec 16 2018, 7:40 PM
Restricted Application added a subscriber: alanajjar. · View Herald TranscriptDec 16 2018, 7:40 PM
Huji added a comment.Dec 16 2018, 7:43 PM

Regarding the two you want removed: I don't think it matters. If you always compare ccnorm(this) like ccnorm(that), the characters will be replaced similarly on both sides of the equation and things will work even if you were replacing ع with E, for instance. Removing them, however, will break AbuseFilters that depend on them in English Wikipedia and similar wikis.

you mean for matching word معین we should write مEین !
and for ۰۹۱۲۱۱۲۳۴۳۴ we should write ۰A۱۲۱۱۲۳۴۳۴! it so strange

Huji added a comment.EditedDec 16 2018, 9:52 PM

No. I mean if the word 'معین' is a word you intend to look for new_wikitext, then you should not do it as ccnorm(new_wikitext) rlike 'معین' but you should do it as ccnorm(new_wikitext rlike ccnorm('معین').

Huji added a comment.EditedDec 16 2018, 9:57 PM

Essentially, the ideal way for writing complex patterns while using ccnorm is something like this:

pattern := ccnorm('this') + '|' + ccnorm('that');

ccnorm(new_wikitext) rlike pattern

Besides the ability to use ccnorm properly, it also has two additional advantages: (1) you can break your pattern into several lines and make your code more readable; and (2) because you can break your code into multiple lines, BIDI issues will be less frequent in the editor and much easier to handle.

Here is a tidier version that you can test in Special:AbuseFilter/test

pattern := ccnorm('this')
  + '|'
  + ccnorm('that');

ccnorm('i have that only') rlike pattern
Yamaha5 added a comment.EditedDec 17 2018, 12:55 AM

Essentially, the ideal way for writing complex patterns while using ccnorm is something like this:

pattern := ccnorm('this') + '|' + ccnorm('that');
ccnorm(new_wikitext) rlike pattern

Besides the ability to use ccnorm properly, it also has two additional advantages: (1) you can break your pattern into several lines and make your code more readable; and (2) because you can break your code into multiple lines, BIDI issues will be less frequent in the editor and much easier to handle.
Here is a tidier version that you can test in Special:AbuseFilter/test

pattern := ccnorm('this')
  + '|'
  + ccnorm('that');
ccnorm('i have that only') rlike pattern

We have many abuse filters which contain more than 100 words and combinations. you mean we should repeat ccnorm 100 times more?
each language like english has huge list of offensive words like this, so this idea doesn't work.
also, your Idea has this false positive, for example:
we want to catch "۹ام": --> pattern := ccnorm("۹ام")
we have "Aام" at the text now your idea will catch it instead of "۹ام"
also pattern := ccnorm("Eام") will catch عام.
why do you want to continue wrong code and make many spaghetti codes at other wikis (ar, fa, ur, zab, glk,...)? why we should write ccnorm many times more than it is needed? (because maybe some English abuse filters will be broken? if there is many.)

Change 479970 merged by jenkins-bot:
[mediawiki/libs/Equivset@master] Add two more characters to Equivset

https://gerrit.wikimedia.org/r/479970