Page MenuHomePhabricator

Unicode characters in API output
Closed, DeclinedPublic

Description

Author: nejuje6tpztluvolq

Description:
Not sure if this is a bug but for example:

wget -q -O- "http://en.wikipedia.org/w/api.php?action=parse&page=abbé_Prévost&prop=links&format=json"

produces:

{"parse":{"title":"Abb\u00e9 Pr\u00e9vost","links":[{"ns":0,"*":"Antoine Fran\u00e7ois Pr\u00e9vost","exists":""}]}}

According to
http://www.fileformat.info/info/unicode/char/e9/index.htm

..the "\u00e9" is Unicode é produced by C/C++/Java. For an API this means I need to translate and I don't have an easy way. Should the API produce the character é not Java/C++/Python encoded?

Regards,
GreenC


Version: unspecified
Severity: normal

Details

Reference
bz72734

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:48 AM
bzimport set Reference to bz72734.
bzimport added a subscriber: Unknown Object (MLST).

"\u00e9" is also produced by JavaScript and other ECMAScript implementations. Your JSON decoder should be handling it for you; if you're writing your own JSON decoder, it will need to handle such escapes.

That said, if you supply the utf8 option to format=json,[1] most characters will be returned unescaped. You will still see escapes for certain characters, though, such as double-quote and newline.

[1]: http://en.wikipedia.org/w/api.php?action=parse&page=abbé_Prévost&prop=links&format=json&utf8=1

(In reply to Brad Jorsch from comment #1)

"\u00e9" is also produced by JavaScript and other ECMAScript
implementations.

Brad, I think you meant "correctly parsed" instead of "produced". JavaScript's JSON.stringify() won't escape that character.

Your JSON decoder should be handling it for you; if you're
writing your own JSON decoder, it will need to handle such escapes.

Yes, see http://tools.ietf.org/html/rfc7159#section-7.

nejuje6tpztluvolq wrote:

I'm using a language (awk) with no native UTF or JSON support so found it needs to pipe through the unix utility iconv eg.

echo '\u00E9' | iconv -f java
é

However &utf8=1 is awesome. That saved me from doing the above external program.

The link to ietf.org is helpful.. I tried it with the Wikipedia article named "300" (includes quotes):

wget -q -O- "http://en.wikipedia.org/w/api.php?action=parse&page="'"'"300"'"'"&prop=links&format=json&utf8=1"

produces

{"parse":{"title":"\"300\"","links": etc..

So it escapes not in UTF-16 Java format but plain backslash. That should make life easier.

GreenC