Page MenuHomePhabricator

SVG image wikisyntax can't use "lang=zh-hant"
Open, Needs TriagePublic

Description

SVG image wikisyntax can't use "lang=zh-hant". for example:
https://commons.wikimedia.org/wiki/File:First_Ionization_Energy.svg

In the SVG, systemLanguage include en, fr, de, zh-hans, zh-hant etc.

[[File:First Ionization Energy.svg|thumb|lang=en]] displayed in English;
[[File:First Ionization Energy.svg|thumb|lang=fr]] displayed in French;
[[File:First Ionization Energy.svg|thumb|lang=zh-hans]] displayed in Simplified Chinese;
but [[File:First Ionization Energy.svg|thumb|lang=zh-hant]] error is displayed as Simplified Chinese, the correct display should be Traditional Chinese.

see test case:

PS: Like T154132, "lang=zh-Hant" can't work:

[[File:First Ionization Energy.svg|thumb|lang=zh-Hant]] error is displayed as English

Event Timeline

Shizhao created this task.Dec 28 2016, 2:34 PM
Restricted Application added subscribers: Cosine02, Aklapper. · View Herald TranscriptDec 28 2016, 2:34 PM
Shizhao updated the task description. (Show Details)Dec 28 2016, 2:36 PM
Shizhao updated the task description. (Show Details)Dec 28 2016, 2:50 PM
Base added a subscriber: Base.Dec 28 2016, 2:54 PM

This is the same problem as in T125710. The reason is that the language code from the SVG is in BCP 47 format and this value is set as environment variable LANG. The environment variable LANG must be in the POSIX format which has a different syntax. rsvg tries to convert this value back to a BCP 47 format and compare this with the SVG languages. As far as I know it uses the first matching value and therefor it uses zh-hans. When the order in the SVG is different an other language may selected.

The only way to solve this problem is to separate the language variables. The LANG variable must be in POSIX format and should be only used for system messages in rsvg. rsvg should get a new variable or parameter for selecting the language for the SVG.

Glrx added subscribers: PhiLiP, Glrx.EditedDec 28 2016, 10:31 PM

There is a problem, but it is not with wikisyntax but rather librsvg. See T154132 which shows that different PNG filenames are generated for zh-hans and zh-hant, so the implication is librsvg mishandled the lang argument. @Fomafix is correct that this is also discussed at T125710, but I don't know enough about librsvg to assess the details.

IIRC, SVG.php notices there is a lang param and sets the LANG environment variable before calling librsvg.

That should be a typecheck violation because SVG.php is putting an IETF langtag string (which uses hyphens) into something that should be a Unix locale string (which uses underscores and may have other dirt at the end).

But there's also a typecheck violation inside librsvg because it, from what I've seen, plays fast and loose with langtag strings and locale strings.

The two may balance out.

Sometime back, @PhiLiP submitted a patch for librsvg:
https://bugzilla.gnome.org/attachment.cgi?id=320316&action=diff

That patch is ill advised, but it also shows that the unpatched librsvg is improperly matching langtags. Look at the left side diff around rsvg-cond.c line 105 to 111. The code does an improper substr match. Let a = "zh-Hans" and b = "zh-Hant". The strings should not match, but the code sees a hyphen at position 2 in b, so it checks the case-insensitive match up to the hyphen at line 111, sees "zh" equals "zh", and declares erroneously the two match. The matching algorithm does not follow the rules laid out in the SVG 1.1 specification. I gave an example of what the code should be, but I can barely use phabricator and don't know Bugzilla or even have an account there.

librsvg doing an improper hyphen match explains why @Shizhao only saw zh-Hans when he asked for zh-Hant.

librsvg may do what we want it to do with the LANG environment variable. I suspect it converts the underscores to hyphens and may even strip off the Unicode charset locale info, but wiki feeds it with a string that already has hyphens, so librsvg does not have to do any character manipulation. The matching bug could also be explained with a="zh_Hans" and b="zh-Hant", but that would mean librsvg took the input langtag and changed its hyphen to an underscore. That just seems implausible, but I have not examined the librsvg code beyond looking at PhLiP's patch diff.

Right now, I think this issue is just an upstream librsvg bug. It's dirty, but I doubt that MediaWiki needs to convert the langtag to a UNIX/POSIX locale string for librsvg even though that would make the LANG environment variable the proper type according to the operating system specification. Moreover, feeding librsvg a locale string would not fix librsvg's langtag matching bug.

Ideally, librsvg should take a command line argument that is an acceptLanguages preference string. The SVG systemLanguage matching algorithm is more sophisticated that a single user agent language. With SVG 2.0 allowReorder-style processing, we could have language dependent preferences. If the image does not have uk, then it might be better to fallback to ru rather than en.

Aklapper moved this task from Backlog to Reported Upstream on the Upstream board.Jul 23 2018, 8:31 AM
Glrx added a comment.Oct 24 2018, 10:20 PM

The good news: Gnome #131 and Gnome #256 have been closed.

https://gitlab.gnome.org/GNOME/librsvg/blob/master/rsvg_internals/src/cond.rs

The possibly bad news: the fix uses a library that may want locale strings and may throw exceptions for invalid locale strings or invalid langtags (e.g., "ru-1"). That will have to be checked.

Previously, librsvg figured out the agent's language by editing the LANG environment variable to make it look like a langtag. Now it will use a locale library to do the langtag matching against a locale.

In the past (before Thumbor), MW setenv LANG to $lang and then exec'd librsvg. See https://doc.wikimedia.org/mediawiki-core/master/php/SvgHandler_8php_source.html at 319. I suspect Thumbor does the same thing.

We will need to check if that will still work. Will from_unix("en-US") work? Otherwise, MW will need to convert the $lang langtag (e.g., "en-US") into a locale string (e.g., "en_US").

Or wait until librsvg will take the preferred languages as a list of langtags:

https://gitlab.gnome.org/GNOME/librsvg/issues/356 "Provide a way to specify the user's preferred languages"

Shizhao updated the task description. (Show Details)Oct 25 2018, 7:24 AM

SVG uses language codes from BCP 47 and MediaWiki uses language code from BCP 47 (at least there is a mapping from the internal MediaWiki language code to a BCP 47 conform language code). The Unix system environment variable LANG contains a language code following a different standard. It is not possible to convert a BCP 47 language code to a Unix language code and back to a BCP 47 language code without loosing information. Therefor the language code must not transferred from MediaWiki to librsvg via the environment variable LANG. librsvg need a separate parameter with a BCP 47 language code.

For the records, https://gitlab.gnome.org/GNOME/librsvg/issues/256 is fixed in librsvg2-2.44.7.