Page MenuHomePhabricator

Chillu letters in Wikidata API
Open, LowPublic

Description

A behavior of Wikibase extension API differs from the general Mediawiki API when it works with invisible symbols like \u200d. The general Mediawiki API removes invisible symbols from titles in the query and returns results for titles without such symbols. For example, see [1].

The Wikibase extension API doesn't remove these symbols and it returns different result for queries with and without such symbols. For example, see [2] and [3].

I'm sure that it would be useful to have one policy for general API and its extension about these symbols.

[1] http://ml.wikipedia.org/w/api.php?action=query&format=json&prop=langlinks&lllimit=500&titles=%E0%B4%A8%E0%B4%B5%E0%B4%82%E0%B4%AC%E0%B4%B0%E0%B5%8D%E2%80%8D
[2] http://www.wikidata.org/w/api.php?action=wbgetentities&format=json&props=sitelinks&sites=mlwiki&titles=%E0%B4%A8%E0%B4%B5%E0%B4%82%E0%B4%AC%E0%B4%B0%E0%B5%8D%E2%80%8D
[3] http://www.wikidata.org/w/api.php?action=wbgetentities&format=json&props=sitelinks&sites=mlwiki&titles=%E0%B4%A8%E0%B4%B5%E0%B4%82%E0%B4%AC%E0%B5%BC


Version: unspecified
Severity: normal
Whiteboard: aklapper-moreinfo

Details

Reference
bz51326

Related Objects

Event Timeline

bzimport raised the priority of this task from to Low.Nov 22 2014, 1:54 AM
bzimport set Reference to bz51326.
bzimport added a subscriber: Unknown Object (MLST).

Correction of bug description.

The problem is made not by invisible symbols, but by some special type of letters of Malayalam language named "chillu". See [1].

The Wikipedia API can convert them to a normal form and the Wikidata API cann't.

[1] https://en.wikipedia.org/wiki/Malayalam_alphabet#Chillus_in_Unicode

I wonder if this is in fact ZWJ and ZWNJ... They are used in a real crappy way on end of strings, and there they are stripped if I remember correct. They should probably be left there, at least for Malayalam.

The new encoding in Unicode does not have this problem, it is only the old faulty encoded strings from before 5.1 (and legacy systems .. and possibly legacy fingers).

To clearify; it is not the new chillu letters but the ZWJ/ZWNJ used to encode those letters form before unicode 5.1.

[replacing wikidata keyword by adding CC - see bug 56417]

Is this still a problem? If so can you please provide links and cases where this causes problems?

emaus: Is this still a problem? If so can you please provide links and cases where this causes problems?

Andre Klapper: the second link of my initial post is not working yet. The Wikipedia API processes both representations of chillu letters and Wikidata API processes the only one that doesn't contain ZWJ. In examples before, Wikidata API doesn't process the title %E0%B4%A8%E0%B4%B5%E0%B4%82%E0%B4%AC%E0%B4%B0%E0%B5%8D%E2%80%8D and processes %E0%B4%A8%E0%B4%B5%E0%B4%82%E0%B4%AC%E0%B5%BC despite the fact that they represent the same word.

Lydia_Pintscher removed a subscriber: Unknown Object (MLST).
Lydia_Pintscher removed a subscriber: Unknown Object (MLST).

Even though this seems to be the same word, the last (visible) letter seems to be a completly different one. The one versions links correctly to the right Item, is the other version like a disambiguation? I guess this should be investigated further by someone who is into the topic.

Lucie set Security to None.