Page MenuHomePhabricator

Implement recognition of persian integers in int function
Open, LowPublic

Description

Hello,
We've been running into issues in Persian Wikipedia in cases where comparisons of two equal numbers ("1111" === "۱۱۱۱") has failed and I believe that int should resolve that.
Checking further, I found out that when a Persian number such as "۱۱۱۱" is passed to int, it just returns 0 which is not correct.
Please let me know if this can be implemented, for reference, here's a list of Persian numbers and their English equivalent:

Persian NumEnglish Num
۰0
۱1
۲2
۳3
۴4
۵5
۶6
۷7
۸8
۹9

Event Timeline

Well, I'm unsure about that... "1111" === "۱۱۱۱" should fail, IMHO. I'm not aware of any programming languages which consider locale-specific numbers identical (or even equal) to integers. For instance, in PHP, "1111" == "۱۱۱۱" is false as well. I'm also not aware of any function which would allow to handle local-specific numbers. If we exclude things like keeping a huge map of numbers for each alphabet.

It's probably just easier to directly use Persian numbers in places where you would expect them to appear.

I think the only solution that we could implement is a separate function to translate numbers, but I'm still unsure.

@Huji thoughts?

@Daimona I think my example was unclear, in python 3.x, if you put "۱۱۱۱" through int function, it'd yield 1111. therefore int("1111") == int("۱۱۱۱") will be true. I think the int function in AbuseFilter should also be capable of this.
Edit: I double-checked and the effect can be reproduced in Python 3.6.8.

@Daimona I think my example was unclear, in python 3.x, if you put "۱۱۱۱" through int function, it'd yield 1111. therefore int("1111") == int("۱۱۱۱") will be true. I think the int function in AbuseFilter should also be capable of this.
Edit: I double-checked and the effect can be reproduced in Python 3.6.8.

Ah well, that could be doable. Although I still don't know of any PHP function which would do that out of the box.

Wouldn't ccnorm() get the job done? To the extent I recall, ccnorm('1') == ccnorm('‍۱') and so forth.

Huji triaged this task as Low priority.
Huji edited projects, added Equivset; removed AbuseFilter.

I was wrong. it appears we don't do a good job in mapping Arabic and Persian digits to those used in Latin-based languages like English, etc.

The correct solution is to modify the equivsets such that this mapping is done correctly, and for all digits. (Right now, for instance, we don't have any mapping for ۵ in the equivsets). I can take this on myself.

Change 534969 had a related patch set uploaded (by Huji; owner: Huji):
[mediawiki/libs/Equivset@master] Map all Arabic-Indic digits to their European number equivalent

https://gerrit.wikimedia.org/r/534969

@MohammadtheEditor can you remind me which filter this was for? Per my comment at T173699#6262543 I think we should just use a set of nested str_replace calls before using ccnorm.

The equivset is to handle "Visually similar characters". The persian digits are semantical identical and mediawiki already supports that with $digitTransformTable in the language file.

I would suggest that AbuseFilter is using Language::parseFormattedNumber or similiar on the numeric AbuseFilter functions to get the same behaviour (which would be based on the content language of the wiki where AbuseFilter is running) and that this task is not handled in equivset (needs tag change of this task, if the idea gets accepted)

A couple of people asked for my opinion on this, probably because I'm "the languages guy", so I'll reply, but it will likely be disappointing :)

I don't have a strong opinion here. It's probably more about security than about language, and security and abuse filters are really not my expertise. From what I can understand, there should probably be equivalence between the Arabic (٤) and the Persian (۴) variants of the digits. Should there also be equivalence with the Western European digits (4)? Maybe, less sure. But again, it's more about security than about language. Sorry, I don't have much more to say.

Change 917893 had a related patch set uploaded (by Thiemo Kreuz (WMDE); author: Thiemo Kreuz (WMDE)):

[mediawiki/libs/Equivset@master] Add missing digits from Bengali/Devanagari/Lao/Thai/Tibetan etc.

https://gerrit.wikimedia.org/r/917893

Change 534969 merged by jenkins-bot:

[mediawiki/libs/Equivset@master] Map all Arabic-Indic digits to their European number equivalent

https://gerrit.wikimedia.org/r/534969

@Huji: Per emails from Sep18 and Oct20 and https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup , I am resetting the assignee of this task because there has not been progress lately (please correct me if I am wrong!). Resetting the assignee avoids the impression that somebody is already working on this task. It also allows others to potentially work towards fixing this task. Please claim this task again when you plan to work on it (via Add Action...Assign / Claim in the dropdown menu) - it would be welcome. Thanks for your understanding!