|mediawiki/libs/Equivset : master||Add two more characters to Equivset|
I check the original file it has many problems. for example it changed foo ==> bar and later change bar ==> blabla
I listed the part of the code which should be change
64A ي => 649 ى ==change to == > 649 ى =>64A ي 6AA ڪ => 6A9 ک ==change to == > 6AA ڪ => 643 ك 6CC ی => 649 ى ==change to == > 6CC ی => 64A ي 6D0 ې => 67B ٻ ==change to == > 6D0 ې => 64A ي FED9 ﻙ => 6A9 ک ==change to == > FED9 ﻙ => 643 ك FEDA ﻚ => 6A9 ک ==change to == > FEDA ﻚ => 643 ك ---- should remove--- 639 ع => 45 E ==remove == > because it is the main character of many languages and you can not sacrifice them for Latin languages! 669 ٩ => 41 A ==remove = >becuse it is the main number of many languages and you can not sacrifice them for Latin languages! -----------all the arabic and persian numbers should convert to Latin numbers------------------ 6F0 ۰ => 30 0 660 ٠ => 30 0 6F1 ۱ => 31 1 661 ١ => 31 1 6F2 ۲ => 32 2 662 ٢ => 32 2 6F3 ۳ => 33 3 663 ٣ => 33 3 6F7 ۷ => 37 7 667 ٧ => 37 7 6F8 ۸ => 38 8 668 ٨ => 38 8 6F9 ۹ => 39 9 669 ٩ => 39 9 9ED ৭ => 39 9 A6A ੪ => 38 8
Regarding the two you want removed: I don't think it matters. If you always compare ccnorm(this) like ccnorm(that), the characters will be replaced similarly on both sides of the equation and things will work even if you were replacing ع with E, for instance. Removing them, however, will break AbuseFilters that depend on them in English Wikipedia and similar wikis.
No. I mean if the word 'معین' is a word you intend to look for new_wikitext, then you should not do it as ccnorm(new_wikitext) rlike 'معین' but you should do it as ccnorm(new_wikitext rlike ccnorm('معین').
Essentially, the ideal way for writing complex patterns while using ccnorm is something like this:
pattern := ccnorm('this') + '|' + ccnorm('that'); ccnorm(new_wikitext) rlike pattern
Besides the ability to use ccnorm properly, it also has two additional advantages: (1) you can break your pattern into several lines and make your code more readable; and (2) because you can break your code into multiple lines, BIDI issues will be less frequent in the editor and much easier to handle.
Here is a tidier version that you can test in Special:AbuseFilter/test
pattern := ccnorm('this') + '|' + ccnorm('that'); ccnorm('i have that only') rlike pattern
We have many abuse filters which contain more than 100 words and combinations. you mean we should repeat ccnorm 100 times more?
each language like english has huge list of offensive words like this, so this idea doesn't work.
also, your Idea has this false positive, for example:
we want to catch "۹ام": --> pattern := ccnorm("۹ام")
we have "Aام" at the text now your idea will catch it instead of "۹ام"
also pattern := ccnorm("Eام") will catch عام.
why do you want to continue wrong code and make many spaghetti codes at other wikis (ar, fa, ur, zab, glk,...)? why we should write ccnorm many times more than it is needed? (because maybe some English abuse filters will be broken? if there is many.)