Page MenuHomePhabricator

Figure out improved matching of monuments for Iran
Closed, ResolvedPublic

Description

In T138377 we added Iran in Farsi to the monuments database. While adding it to the database and generating https://fa.wikipedia.org/wiki/%DA%A9%D8%A7%D8%B1%D8%A8%D8%B1:LilyOfTheWest/Unused_images we ran into an interesting problem: Numbers on the fa wikipedia are in another script. Most scripts use 0123456789, but in Farsi a different script is used. The matching for the unused images is string based. The id in the list is matched by the sortkey of the images in https://commons.wikimedia.org/wiki/Category:Cultural_heritage_monuments_in_Iran_with_known_IDs .

Probably best solution:

The unused images page is still huge so no real rush to fix this. When it gets smaller this becomes more interesting because it might turn up more hits.

Event Timeline

Multichill moved this task from Backlog to Nice to have on the Wiki-Loves-Monuments-Database board.

If someone could provide a link to a list using the farsi numbers (or even better a mixture) then I'll take a look.

The same goes for an example of a Commons image with farsi numbers.

@mahmoud @Multichill @LilyOfTheWest
Are the ids guaranteed to be integers (in whichever script) or can it sometimes include other characters (or start with a 0 or ۰)?

@mahmoud @Multichill @LilyOfTheWest
Are the ids guaranteed to be integers (in whichever script) or can it sometimes include other characters (or start with a 0 or ۰)?

Wondering since:

>>> n = u'۱۲۳۴۵۶۷۸۹۰'
>>> int(n)
1234567890

Which would then make fixing the import part easy. Sadly Lua is not as clever so fixing the Commons part would require some string substitution.

@mahmoud @Multichill @LilyOfTheWest
Are the ids guaranteed to be integers (in whichever script) or can it sometimes include other characters (or start with a 0 or ۰)?

They are guaranteed to be integers, and they cannot start with 0 or ۰.

Change 301333 had a related patch set uploaded (by Lokal Profil):
Add integer script converter and integer checker

https://gerrit.wikimedia.org/r/301333

Change 301333 had a related patch set uploaded (by Lokal Profil):
Add integer script converter and integer checker

https://gerrit.wikimedia.org/r/301333

This one "normalizes" integers on the list side of things.

I made a stab at a Lua module to be used in {{Cultural Heritage Iran}} over at Module:NumberScripts.

My tests cases should work but I'd be happy for an extra pair of eyes on it before I implement it in the template.

P.S. I hate Lua so much right now.

Change 301333 merged by jenkins-bot:
Add integer script converter and integer checker

https://gerrit.wikimedia.org/r/301333

Reviewed the Python, looks good! I'll take a stab at reviewing that Lua script too, but I doubt I'll come out unscathed.

The Lua looks pretty good, too, but @Lokal_Profil is there a reason the Arabic numbers are written rtl while Persian is ltr? For all intents and purposes, to the best of my knowledge, both languages have identical number semantics.

The Lua looks pretty good, too, but @Lokal_Profil is there a reason the Arabic numbers are written rtl while Persian is ltr? For all intents and purposes, to the best of my knowledge, both languages have identical number semantics.

Thanks!

The rtl/ltr discrepancy is entirely due to my browser switching as soon as I copy pasted in an Arabic character. I would preferred the list order to be 1-0 and large in both cases but we'll...

If you want to try and fix it you should be able to run the tests (i.e. preview changes on the discussion page belonging to the test sub-page) to see if it worked. Otherwise I'll try to activate it next time I'm at a computer.

Lua part deployed. It will take a day or two before the bot picks this up though.

If it is still causing troubles then we should look into the zero-padding of the sort key which might also mess with things.

Mentioned in SAL [2016-08-01T07:53:01Z] <Lokal_Profil> Deployed latest from Git, 5fe42fe (5fe42fe), 1ec3530, 9a630b5 (T139258)

Mentioned in SAL [2016-08-01T07:54:26Z] <Lokal_Profil> (correction to last line) Deployed latest from Git, 5fe42fe (T111618), 1ec3530, 9a630b5 (T139258)

Looks good to me. We expected the empty list. Thank you! :)

Looks good to me. We expected the empty list. Thank you! :)

Perfect