Maniphest T231973

Implement recognition of persian integers in int function
Open, LowPublic
Actions

Assigned To

None

Authored By

	MohammadtheEditor
	Sep 4 2019, 9:13 AM

Description

Hello,
We've been running into issues in Persian Wikipedia in cases where comparisons of two equal numbers ("1111" === "۱۱۱۱") has failed and I believe that int should resolve that.
Checking further, I found out that when a Persian number such as "۱۱۱۱" is passed to int, it just returns 0 which is not correct.
Please let me know if this can be implemented, for reference, here's a list of Persian numbers and their English equivalent:

Persian Num	English Num
۰	0
۱	1
۲	2
۳	3
۴	4
۵	5
۶	6
۷	7
۸	8
۹	9

Details

	Subject	Repo	Branch	Lines +/-
	Add missing digits from Bengali/Devanagari/Lao/Thai/Tibetan etc.	mediawiki/libs/Equivset	master	+303 -49
	Map all Arabic-Indic digits to their European number equivalent	mediawiki/libs/Equivset	master	+58 -28

Customize query in gerrit

Related Objects

Mentioned In: T212061: Enhance Equivset with regard to Persian/Arabic characters
T255089: Equivset addition CR
Mentioned Here: T173699: AntiSpoof should use language-specific mappings

Event Timeline

MohammadtheEditor created this task.Sep 4 2019, 9:13 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 4 2019, 9:13 AM

Well, I'm unsure about that... "1111" === "۱۱۱۱" should fail, IMHO. I'm not aware of any programming languages which consider locale-specific numbers identical (or even equal) to integers. For instance, in PHP, "1111" == "۱۱۱۱" is false as well. I'm also not aware of any function which would allow to handle local-specific numbers. If we exclude things like keeping a huge map of numbers for each alphabet.

It's probably just easier to directly use Persian numbers in places where you would expect them to appear.

I think the only solution that we could implement is a separate function to translate numbers, but I'm still unsure.

@Huji thoughts?

@Daimona I think my example was unclear, in python 3.x, if you put "۱۱۱۱" through int function, it'd yield 1111. therefore int("1111") == int("۱۱۱۱") will be true. I think the int function in AbuseFilter should also be capable of this.
Edit: I double-checked and the effect can be reproduced in Python 3.6.8.

In T231973#5464374, @MohammadtheEditor wrote:

@Daimona I think my example was unclear, in python 3.x, if you put "۱۱۱۱" through int function, it'd yield 1111. therefore int("1111") == int("۱۱۱۱") will be true. I think the int function in AbuseFilter should also be capable of this.
Edit: I double-checked and the effect can be reproduced in Python 3.6.8.

Ah well, that could be doable. Although I still don't know of any PHP function which would do that out of the box.

Wouldn't ccnorm() get the job done? To the extent I recall, ccnorm('1') == ccnorm('‍۱') and so forth.

I was wrong. it appears we don't do a good job in mapping Arabic and Persian digits to those used in Latin-based languages like English, etc.

The correct solution is to modify the equivsets such that this mapping is done correctly, and for all digits. (Right now, for instance, we don't have any mapping for ۵ in the equivsets). I can take this on myself.

Change 534969 had a related patch set uploaded (by Huji; owner: Huji):
[mediawiki/libs/Equivset@master] Map all Arabic-Indic digits to their European number equivalent

https://gerrit.wikimedia.org/r/534969

gerritbot added a project: Patch-For-Review.Sep 8 2019, 6:40 PM

Reedy mentioned this in T255089: Equivset addition CR.Jun 11 2020, 2:05 AM

@MohammadtheEditor can you remind me which filter this was for? Per my comment at T173699#6262543 I think we should just use a set of nested str_replace calls before using ccnorm.

The equivset is to handle "Visually similar characters". The persian digits are semantical identical and mediawiki already supports that with $digitTransformTable in the language file.

I would suggest that AbuseFilter is using Language::parseFormattedNumber or similiar on the numeric AbuseFilter functions to get the same behaviour (which would be based on the content language of the wiki where AbuseFilter is running) and that this task is not handled in equivset (needs tag change of this task, if the idea gets accepted)

Umherirrender mentioned this in T212061: Enhance Equivset with regard to Persian/Arabic characters.Apr 7 2023, 10:33 PM

A couple of people asked for my opinion on this, probably because I'm "the languages guy", so I'll reply, but it will likely be disappointing :)

I don't have a strong opinion here. It's probably more about security than about language, and security and abuse filters are really not my expertise. From what I can understand, there should probably be equivalence between the Arabic (٤) and the Persian (۴) variants of the digits. Should there also be equivalence with the Western European digits (4)? Maybe, less sure. But again, it's more about security than about language. Sorry, I don't have much more to say.

Change 917893 had a related patch set uploaded (by Thiemo Kreuz (WMDE); author: Thiemo Kreuz (WMDE)):

[mediawiki/libs/Equivset@master] Add missing digits from Bengali/Devanagari/Lao/Thai/Tibetan etc.

https://gerrit.wikimedia.org/r/917893

Change 534969 merged by jenkins-bot:

[mediawiki/libs/Equivset@master] Map all Arabic-Indic digits to their European number equivalent

https://gerrit.wikimedia.org/r/534969

Umherirrender unsubscribed.Jul 12 2023, 8:15 PM

Aklapper edited projects, added Patch-Needs-Improvement; removed Patch-For-Review.Sep 17 2023, 4:04 PM

@Huji: Per emails from Sep18 and Oct20 and https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup , I am resetting the assignee of this task because there has not been progress lately (please correct me if I am wrong!). Resetting the assignee avoids the impression that somebody is already working on this task. It also allows others to potentially work towards fixing this task. Please claim this task again when you plan to work on it (via Add Action... → Assign / Claim in the dropdown menu) - it would be welcome. Thanks for your understanding!

Implement recognition of persian integers in int functionOpen, LowPublicActions

Description

Details

Related Objects

Event Timeline

Implement recognition of persian integers in int function
Open, LowPublic
Actions