Page MenuHomePhabricator

Enhance Equivset with regard to Persian/Arabic characters
Closed, ResolvedPublic

Description

please add this two character to here

ۍ U+06CD --> ی
ݸ U+0778--> و

code

6CD ۍ => 649 ى

778 ݸ => 648 و

Event Timeline

Huji triaged this task as Low priority.
Huji renamed this task from add character to Equivset to Add two more characters to Equivset.Dec 16 2018, 3:14 AM

Change 479970 had a related patch set uploaded (by Huji; owner: Huji):
[mediawiki/libs/Equivset@master] Add two more characters to Equivset

https://gerrit.wikimedia.org/r/479970

I check the original file it has many problems. for example it changed foo ==> bar and later change bar ==> blabla
I listed the part of the code which should be change

64A ي => 649 ى  ==change to == > 649 ى =>64A ي
6AA ڪ => 6A9 ک  ==change to == > 6AA ڪ => 643 ك  
6CC ی => 649 ى ==change to == > 6CC ی => 64A ي
6D0 ې => 67B ٻ ==change to == > 6D0 ې => 64A ي
FED9 ﻙ => 6A9 ک ==change to == > FED9 ﻙ => 643 ك 
FEDA ﻚ => 6A9 ک ==change to == > FEDA ﻚ => 643 ك

---- should remove---
639 ع => 45 E   ==remove == > because it is the main character of many languages and you can not sacrifice them for Latin languages!
669 ٩ => 41 A   ==remove = >becuse it is the main number of many languages and you can not sacrifice them for Latin languages!

-----------all the arabic and persian numbers should convert to Latin numbers------------------
6F0 ۰ => 30 0
660 ٠ => 30 0
6F1 ۱ => 31 1
661 ١ => 31 1
6F2 ۲ => 32 2
662 ٢ => 32 2
6F3 ۳ => 33 3
663 ٣ => 33 3

6F7 ۷ => 37 7
667 ٧ => 37 7

6F8 ۸ => 38 8
668 ٨ => 38 8

6F9 ۹ => 39 9
669 ٩ => 39 9
9ED ৭ => 39 9
A6A ੪ => 38 8
Huji renamed this task from Add two more characters to Equivset to Enhance Equivset with regard to Persian/Arabic characters.Dec 16 2018, 7:40 PM

Regarding the two you want removed: I don't think it matters. If you always compare ccnorm(this) like ccnorm(that), the characters will be replaced similarly on both sides of the equation and things will work even if you were replacing ع with E, for instance. Removing them, however, will break AbuseFilters that depend on them in English Wikipedia and similar wikis.

you mean for matching word معین we should write مEین !
and for ۰۹۱۲۱۱۲۳۴۳۴ we should write ۰A۱۲۱۱۲۳۴۳۴! it so strange

No. I mean if the word 'معین' is a word you intend to look for new_wikitext, then you should not do it as ccnorm(new_wikitext) rlike 'معین' but you should do it as ccnorm(new_wikitext rlike ccnorm('معین').

Essentially, the ideal way for writing complex patterns while using ccnorm is something like this:

pattern := ccnorm('this') + '|' + ccnorm('that');

ccnorm(new_wikitext) rlike pattern

Besides the ability to use ccnorm properly, it also has two additional advantages: (1) you can break your pattern into several lines and make your code more readable; and (2) because you can break your code into multiple lines, BIDI issues will be less frequent in the editor and much easier to handle.

Here is a tidier version that you can test in Special:AbuseFilter/test

pattern := ccnorm('this')
  + '|'
  + ccnorm('that');

ccnorm('i have that only') rlike pattern

Essentially, the ideal way for writing complex patterns while using ccnorm is something like this:

pattern := ccnorm('this') + '|' + ccnorm('that');

ccnorm(new_wikitext) rlike pattern

Besides the ability to use ccnorm properly, it also has two additional advantages: (1) you can break your pattern into several lines and make your code more readable; and (2) because you can break your code into multiple lines, BIDI issues will be less frequent in the editor and much easier to handle.

Here is a tidier version that you can test in Special:AbuseFilter/test

pattern := ccnorm('this')
  + '|'
  + ccnorm('that');

ccnorm('i have that only') rlike pattern

We have many abuse filters which contain more than 100 words and combinations. you mean we should repeat ccnorm 100 times more?
each language like english has huge list of offensive words like this, so this idea doesn't work.
also, your Idea has this false positive, for example:
we want to catch "۹ام": --> pattern := ccnorm("۹ام")
we have "Aام" at the text now your idea will catch it instead of "۹ام"
also pattern := ccnorm("Eام") will catch عام.
why do you want to continue wrong code and make many spaghetti codes at other wikis (ar, fa, ur, zab, glk,...)? why we should write ccnorm many times more than it is needed? (because maybe some English abuse filters will be broken? if there is many.)

Change 479970 merged by jenkins-bot:
[mediawiki/libs/Equivset@master] Add two more characters to Equivset

https://gerrit.wikimedia.org/r/479970

Aklapper removed Huji as the assignee of this task.Jul 2 2021, 5:24 AM
Aklapper added subscribers: Huji, Aklapper.

Removing task assignee due to inactivity, as this open task has been assigned for more than two years (see emails sent to assignee on May26 and Jun17, and T270544). Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be very welcome!

(See https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator.)

Umherirrender subscribed.

The reported characters from the task description where added with the mention patch set.

There is T231973 to discuss about the mapping for arabic and persian numbers