|Resolved||Umherirrender||T27619 Add more characters to ccnorm|
|Resolved||None||T212061 Enhance Equivset with regard to Persian/Arabic characters|
I check the original file it has many problems. for example it changed foo ==> bar and later change bar ==> blabla
I listed the part of the code which should be change
64A ي => 649 ى ==change to == > 649 ى =>64A ي 6AA ڪ => 6A9 ک ==change to == > 6AA ڪ => 643 ك 6CC ی => 649 ى ==change to == > 6CC ی => 64A ي 6D0 ې => 67B ٻ ==change to == > 6D0 ې => 64A ي FED9 ﻙ => 6A9 ک ==change to == > FED9 ﻙ => 643 ك FEDA ﻚ => 6A9 ک ==change to == > FEDA ﻚ => 643 ك ---- should remove--- 639 ع => 45 E ==remove == > because it is the main character of many languages and you can not sacrifice them for Latin languages! 669 ٩ => 41 A ==remove = >becuse it is the main number of many languages and you can not sacrifice them for Latin languages! -----------all the arabic and persian numbers should convert to Latin numbers------------------ 6F0 ۰ => 30 0 660 ٠ => 30 0 6F1 ۱ => 31 1 661 ١ => 31 1 6F2 ۲ => 32 2 662 ٢ => 32 2 6F3 ۳ => 33 3 663 ٣ => 33 3 6F7 ۷ => 37 7 667 ٧ => 37 7 6F8 ۸ => 38 8 668 ٨ => 38 8 6F9 ۹ => 39 9 669 ٩ => 39 9 9ED ৭ => 39 9 A6A ੪ => 38 8
Regarding the two you want removed: I don't think it matters. If you always compare ccnorm(this) like ccnorm(that), the characters will be replaced similarly on both sides of the equation and things will work even if you were replacing ع with E, for instance. Removing them, however, will break AbuseFilters that depend on them in English Wikipedia and similar wikis.
No. I mean if the word 'معین' is a word you intend to look for new_wikitext, then you should not do it as ccnorm(new_wikitext) rlike 'معین' but you should do it as ccnorm(new_wikitext rlike ccnorm('معین').
Essentially, the ideal way for writing complex patterns while using ccnorm is something like this:
pattern := ccnorm('this') + '|' + ccnorm('that'); ccnorm(new_wikitext) rlike pattern
Besides the ability to use ccnorm properly, it also has two additional advantages: (1) you can break your pattern into several lines and make your code more readable; and (2) because you can break your code into multiple lines, BIDI issues will be less frequent in the editor and much easier to handle.
Here is a tidier version that you can test in Special:AbuseFilter/test
pattern := ccnorm('this') + '|' + ccnorm('that'); ccnorm('i have that only') rlike pattern
We have many abuse filters which contain more than 100 words and combinations. you mean we should repeat ccnorm 100 times more?
each language like english has huge list of offensive words like this, so this idea doesn't work.
also, your Idea has this false positive, for example:
we want to catch "۹ام": --> pattern := ccnorm("۹ام")
we have "Aام" at the text now your idea will catch it instead of "۹ام"
also pattern := ccnorm("Eام") will catch عام.
why do you want to continue wrong code and make many spaghetti codes at other wikis (ar, fa, ur, zab, glk,...)? why we should write ccnorm many times more than it is needed? (because maybe some English abuse filters will be broken? if there is many.)
Removing task assignee due to inactivity, as this open task has been assigned for more than two years (see emails sent to assignee on May26 and Jun17, and T270544). Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be very welcome!
(See https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator.)