Maniphest T212061

Enhance Equivset with regard to Persian/Arabic characters
Closed, ResolvedPublic
Actions

Assigned To

None

Authored By

	Yamaha5
	Dec 16 2018, 1:06 AM

Description

please add this two character to here

ۍ U+06CD --> ی
ݸ U+0778--> و

code

6CD ۍ => 649 ى

778 ݸ => 648 و

Details

	Subject	Repo	Branch	Lines +/-
	Add two more characters to Equivset	mediawiki/libs/Equivset	master	+7 -2

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Umherirrender	T27619 Add more characters to ccnorm
		Resolved		None	T212061 Enhance Equivset with regard to Persian/Arabic characters

Event Timeline

Yamaha5 created this task.Dec 16 2018, 1:06 AM

Huji claimed this task.Dec 16 2018, 2:38 AM

Huji triaged this task as Low priority.

Huji renamed this task from add character to Equivset to Add two more characters to Equivset.Dec 16 2018, 3:14 AM

Change 479970 had a related patch set uploaded (by Huji; owner: Huji):
[mediawiki/libs/Equivset@master] Add two more characters to Equivset

https://gerrit.wikimedia.org/r/479970

gerritbot added a project: Patch-For-Review.Dec 16 2018, 3:14 AM

I check the original file it has many problems. for example it changed foo ==> bar and later change bar ==> blabla
I listed the part of the code which should be change

64A ي => 649 ى  ==change to == > 649 ى =>64A ي
6AA ڪ => 6A9 ک  ==change to == > 6AA ڪ => 643 ك  
6CC ی => 649 ى ==change to == > 6CC ی => 64A ي
6D0 ې => 67B ٻ ==change to == > 6D0 ې => 64A ي
FED9 ﻙ => 6A9 ک ==change to == > FED9 ﻙ => 643 ك 
FEDA ﻚ => 6A9 ک ==change to == > FEDA ﻚ => 643 ك

---- should remove---
639 ع => 45 E   ==remove == > because it is the main character of many languages and you can not sacrifice them for Latin languages!
669 ٩ => 41 A   ==remove = >becuse it is the main number of many languages and you can not sacrifice them for Latin languages!

-----------all the arabic and persian numbers should convert to Latin numbers------------------
6F0 ۰ => 30 0
660 ٠ => 30 0
6F1 ۱ => 31 1
661 ١ => 31 1
6F2 ۲ => 32 2
662 ٢ => 32 2
6F3 ۳ => 33 3
663 ٣ => 33 3

6F7 ۷ => 37 7
667 ٧ => 37 7

6F8 ۸ => 38 8
668 ٨ => 38 8

6F9 ۹ => 39 9
669 ٩ => 39 9
9ED ৭ => 39 9
A6A ੪ => 38 8

Huji renamed this task from Add two more characters to Equivset to Enhance Equivset with regard to Persian/Arabic characters.Dec 16 2018, 7:40 PM

Restricted Application added a subscriber: alaa. · View Herald TranscriptDec 16 2018, 7:40 PM

Regarding the two you want removed: I don't think it matters. If you always compare ccnorm(this) like ccnorm(that), the characters will be replaced similarly on both sides of the equation and things will work even if you were replacing ع with E, for instance. Removing them, however, will break AbuseFilters that depend on them in English Wikipedia and similar wikis.

you mean for matching word معین we should write مEین !
and for ۰۹۱۲۱۱۲۳۴۳۴ we should write ۰A۱۲۱۱۲۳۴۳۴! it so strange

No. I mean if the word 'معین' is a word you intend to look for new_wikitext, then you should not do it as ccnorm(new_wikitext) rlike 'معین' but you should do it as ccnorm(new_wikitext rlike ccnorm('معین').

Essentially, the ideal way for writing complex patterns while using ccnorm is something like this:

pattern := ccnorm('this') + '|' + ccnorm('that');

ccnorm(new_wikitext) rlike pattern

Besides the ability to use ccnorm properly, it also has two additional advantages: (1) you can break your pattern into several lines and make your code more readable; and (2) because you can break your code into multiple lines, BIDI issues will be less frequent in the editor and much easier to handle.

Here is a tidier version that you can test in Special:AbuseFilter/test

pattern := ccnorm('this')
  + '|'
  + ccnorm('that');

ccnorm('i have that only') rlike pattern

In T212061#4826751, @Huji wrote:
Essentially, the ideal way for writing complex patterns while using ccnorm is something like this:
pattern := ccnorm('this') + '|' + ccnorm('that');

ccnorm(new_wikitext) rlike pattern
Besides the ability to use ccnorm properly, it also has two additional advantages: (1) you can break your pattern into several lines and make your code more readable; and (2) because you can break your code into multiple lines, BIDI issues will be less frequent in the editor and much easier to handle.

Here is a tidier version that you can test in Special:AbuseFilter/test
pattern := ccnorm('this')
  + '|'
  + ccnorm('that');

ccnorm('i have that only') rlike pattern

We have many abuse filters which contain more than 100 words and combinations. you mean we should repeat ccnorm 100 times more?
each language like english has huge list of offensive words like this, so this idea doesn't work.
also, your Idea has this false positive, for example:
we want to catch "۹ام": --> pattern := ccnorm("۹ام")
we have "Aام" at the text now your idea will catch it instead of "۹ام"
also pattern := ccnorm("Eام") will catch عام.
why do you want to continue wrong code and make many spaghetti codes at other wikis (ar, fa, ur, zab, glk,...)? why we should write ccnorm many times more than it is needed? (because maybe some English abuse filters will be broken? if there is many.)

Change 479970 merged by jenkins-bot:
[mediawiki/libs/Equivset@master] Add two more characters to Equivset

https://gerrit.wikimedia.org/r/479970

Ladsgroup removed a project: Patch-For-Review.May 28 2019, 3:37 PM

Removing task assignee due to inactivity, as this open task has been assigned for more than two years (see emails sent to assignee on May26 and Jun17, and T270544). Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be very welcome!

(See https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator.)

matej_suchanek added a parent task: T27619: Add more characters to ccnorm.Apr 10 2022, 12:24 PM

The reported characters from the task description where added with the mention patch set.

There is T231973 to discuss about the mapping for arabic and persian numbers

Enhance Equivset with regard to Persian/Arabic charactersClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Enhance Equivset with regard to Persian/Arabic characters
Closed, ResolvedPublic
Actions

Related Objects
Search...