Page MenuHomePhabricator

systemLanguage process issues in SVG.php
Open, LowPublic

Description

This is mainly an upstream bug of librsvg (Bug 761482). Please see the description of the upstream bug. In a nutshell, for example, this SVG image contains options en, en-US, en-AU and en-GB. Only en-US works well (by luck) but all other language options display "English (United States)". Also according to validator.w3.org, there's no error or warning in the image "System language attribute bug demo.svg".

To fix this bug, there can be two ways:

  • Wait librsvg fix the upstream bug.
  • Workaround: generate a new SVG which replaced dash "-" to underscore "_" in all BCP 47 language codes to avoid librsvg wrongly fallback the languages. This can be very tricky since you may need to add new element to strictly follow the w3c standard of language fallbacks.

Also there's a small bug in MediaWiki is that it uses unprocessed value derived from systemLanguage attrs in <select id="mw-imglangselector"> on image pages (they should be all lowercase for query parameter "lang", but SVG standard requires BCP 47 format). On this SVG image you can't select any language except the all lowercase "en" actually. This can be a quick fix and I may create a patch later.

Event Timeline

PhiLiP raised the priority of this task from to Needs Triage.
PhiLiP updated the task description. (Show Details)
PhiLiP subscribed.
PhiLiP set Security to None.

Actually I want your opinions about the two ways to fix the bug. I am unsure if we should introduce some tricky code here, but I'm also afraid it takes more time in librsvg to accept the patch and do a new release.

Change 269108 had a related patch set uploaded (by PhiLiP):
Fix the uppercase issue of "lang" parameter in ImagePage (T125710)

https://gerrit.wikimedia.org/r/269108

Change 269206 had a related patch set uploaded (by PhiLiP):
Fix the uppercase issue of "lang" parameter in ImagePage

https://gerrit.wikimedia.org/r/269206

Change 269206 abandoned by PhiLiP:
Fix the uppercase issue of "lang" parameter in ImagePage

Reason:
wrongly created

https://gerrit.wikimedia.org/r/269206

Change 269108 had a related patch set uploaded (by PhiLiP):
Fix the uppercase issue of "lang" parameter in ImagePage

https://gerrit.wikimedia.org/r/269108

The LANG environment variable of POSIX systems does not follow BCP 47. It has its own standard. Therefor this can not fixed in librsvg. The conversion from the BCP 47 format or the internal format to the POSIX format must done in SVG.php.

@Fomafix

No. There's still a bug in librsvg that cause it ignores any region and script suffix to the language code. Actually librsvg is already case insensitive now because it uses g_ascii_strncasecmp to compare the language code. The true problem is librsvg always returns the first one with a language code has the prefix "en" whether your language code is en-au, en-AU or even en-Au.

This is not the exactly same problem from the gerrit patch. Actually they are two, but highly related:

  • MediaWiki only have a tiny frontend issue which causes user can't switch language if the target language code has a suffix like "-AU". That's what I want to address in the gerrit patch.
  • librsvg always fallbacks system language to the language with same main language tag it first met. Doesn't consider if the system language is actually supported by the SVG.

These two problems all lead to one result that users can not see the correct SVG render result in their language variants.

Ok, I see. librsvg has some deficits. This must fixed there.

My comment focus on the line

$env['LANG'] = $lang;

in SVG.php. Here the BCP 47 format must converted to the POSIX format.

@Fomafix

I can't agree with you more about the format of LANG environ. But sadly librsvg follows POSIX format badly nowadays. librsvg even can't handle underscore format like "zh_CN" until the patch I submitted. The language matching of librsvg will also failed if the LANG has a charset suffix like "zh_CN.UTF-8", which is common in most modern Linux distributions.

The POSIX variable LANG has its own specification: http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap08.html
I think it is problematic for librsvg to support the BCP 47 format and the POSIX format at the same variable.

SVG uses internal the BCP 47 format, so it is stupid to convert a BCP 47 format to a POSIX format and back the BCP 47 format. The best way would be a separate parameter for librsvg to transfer the language code in a BCP 47 format. The environment variable LANG is still responsible for messages of librsvg like error messages or help messages.

This bug is a duplicate of another bug I'm pretty sure, but I can't find it.

Basically, librsvg handles dashes in language names incorrectly, and in violation of the svg spec.

Forgive me, but I am but an ignorant newbie.

There are locale strings (with underscores) and langtags (with hyphens). RSVG seems to confuse the two notions. The patch that took locale strings and edited them into langtags seems ill advised. Using the name "locale" in RSVG is also ill advised when the comparisons are done on langtags.

RSVG should be using a library that already does langtag matching ala HTTP services. Java has such a library; I assume there is one for C++. (BTW, the Java library routines do convert langtags to locales.)

PhiLiP seems to have confused fallback matching, the SVG spec, and desired behavior. SVG 1.1 wants switch to use the first compatible systemLanguage clause. A fallback match is immediately in play; a fallback match is as good as an exact match; it should not be saved for later.

I'm basing above comments on this patch. https://bugzilla.gnome.org/show_bug.cgi?id=761482

I would deprecate that patch, but right now I'm struggling with phabricator. I searched for this bug report yesterday, but did not find it. Bawolff had to point it out to me.

The RSVG fallback matching algorithm is just wrong. That's immediately apparent because it only looks at one hyphen; a langtag can have several hyphens (another reason why using a library for the matching would be appropriate). But there's a more insidious bug.

The SVG matching algorithm requires that the user agent's langtag (userLang) be compared to the systemLanguage langtags; there's a match if userLang equals a systemLanguage langtag (e.g. userLang "en-GB" matches systemLangage "en-GB") or if userLang equals a systemLanguage langtag broken at a hyphen (e.g. userLang "en" will fallback-match systemLanguage "en-GB").

The SVG 1.1 matching algorithm never uses a subset of the user's langtag. If the user's langtag is "en-GB", then that user langtag does not fallback match the systemLanguage "en". A user demanding British English is not served a default English. See the 'Implementation Note" at

http://www.w3.org/TR/SVG/struct.html#ConditionalProcessingSystemLanguageAttribute

which states: 'Evaluates to "true" if one of the languages indicated by user preferences exactly equals one of the languages given in the value of this parameter, or if one of the languages indicated by user preferences exactly equals a prefix of one of the languages given in the value of this parameter such that the first tag character following the prefix is "-".'

That's not the behavior most people expect, but that is the behavior that SVG 1.1 demands. RSVG is not supposed to decide that if the user asks for en-GB and en-US is available, then I'll give him en-US. SVG 1.1 does not guess and does weigh options.

(Note the spec uses the plural "user preferences". User agent allows multiple preferences ala AcceptLanguages.)

The insidious bug in RSVG's matching is that it will chop the user's langtag. Say the user's langtag is "en-GB" and systemLanguage is "en-US". RSVG compares the two strings, finds they are not equal, and then looks for a fallback match. It notices that "en-US" has a hyphen at position 2, so it compares the first two characters of the user's langtag and the systemLanguage:

g_ascii_strncasecmp(userLang, systemLang, 2)

The result is the two are erroneously declared a fallback match. All characters of the user's langtag must match, but match never looks at the "-GB".

An appropriate test is (where systemLang is a langtag from systemLanguage)

if (strlen(userLang) == strlen(systemLang) and stricmp(userLang, systemLang) == 0)
    then exact match.
else if (strlen(userLang) < strlen(systemLang) and
           g_ascii_strncasecmp(userLang, systemLang, strlen(userLang)) == 0 and
           systemLang[strlen(userLang)] == '-')
    then subcomponent/fallback match
else
    no match

The RSVG fallback algorithm will even match user langtag "enx-VN" to "en-US".

An apparent goal of SVG 2.0 will use an acceptLanguages string ala SMIL allowReorder. A HTTP library would be even more appropriate for that exercise because weights must be calculated.

Change 269108 abandoned by Cparle:
Fix the uppercase issue of "lang" parameter in ImagePage

Reason:
Superceded by https://gerrit.wikimedia.org/r/#/c/384052/

https://gerrit.wikimedia.org/r/269108

Aklapper moved this task from Patch proposed upstream to Patch merged upstream on the Upstream board.

For the records, last related downstream patch was https://gerrit.wikimedia.org/r/#/c/385352/ and got merged.

Also note that upstream comments in https://gitlab.gnome.org/GNOME/librsvg/issues/131 imply that this has been fixed in librsvg 2.42.3.

Also note that librsvg 2.52.x has a new --accept-language parameter which allows to specify the user's preferred languages by passing the HTTP Accept-Language header to librsvg: https://gitlab.gnome.org/GNOME/librsvg/-/issues/356 (Not sure if it will get backported to the 2.50.x series)