Page MenuHomePhabricator

Hyphenated langtags in Thumbor/7.3.2 and librsvg 2.44.10 do not show any text
Closed, ResolvedPublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

*: https://commons.wikimedia.org/w/index.php?lang=az-latn&title=File%3AIPv6_header-en.svg

What happens?:
az-latn shows no text

What should have happened instead?:
az-latn should show the az-latn text

Software version (skip for WMF-hosted wikis like Wikipedia):
Thumbor/7.3.2
librsvg 2.44.10

Other information (browser name/version, screenshots, etc.):
Thumbor URLs

The az-latn Thumbor URL shows no text:

The az Thumbor URL shows az-latn text

T261192
T335361

librsvg 2.40 only matched langtags up to the first hyphen.
librsvg 2.44 does not even match the default.

May be problem with az-latn not being a Unix locale string.

Possible workaround for old librsvg 2.40 behavior is to truncate hyphenated langtags.

Event Timeline

Change 923368 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/software/thumbor-plugins@master] svg: attempt to build valid locales from hyphenated languages

https://gerrit.wikimedia.org/r/923368

Thanks for the report and the test cases. This change attempts to build valid locales to fix both of the issues. I am curious as to whether our approach of using these languages in Thumbor when we used LANG rather than LC_ALL would have ever worked for these language tags.

More broadly I wonder whether we should return an error when we can't get a valid locale and a language is requested rather than just serve an empty generated image. The language-tags library exists for managing RFC-compliant language strings, but doesn't seem to offer us anything particularly useful for mapping between languages

Thanks for the report and the test cases. This change attempts to build valid locales to fix both of the issues. I am curious as to whether our approach of using these languages in Thumbor when we used LANG rather than LC_ALL would have ever worked for these language tags.

Generally, there is not a one-to-one mapping between IETF langtags and locale strings. Consequently, I believe it is not a good idea to map a langtag to a Unix locale string and then hope that the Rust crate will map that locale string back to the original langtag. The screwiness is too complicated to go into right now. IIRC, converting the langtag en-us to the locale string en_US will match the langtag en-GB in an SVG file. The right thing to do is avoid locales entirely and pass the langtag through --accept-languages, but that must wait for next version of the operating system.

Some background.

Early versions of librsvg just read $LANG and (mis)treated it as an IETF langtag. Environment variables are just strings. That would have worked well except that Gnome got the IETF langtag string matching routine wrong. The matching bug meant that Gnome only matched the first subtag and ignored the rest. That is why zh-Hans matches zh-Hant in librsvg 2.40.

Gnome then decided to use a Rust crate to do the IETF matching. That had some nice features for guessing a reasonable IETF langtag from the user's environment, but it ran into trouble because users could no longer set a particular langtag. For example, the Unix and the Rust crate may understand the locale string zh_CN but not zh_Hans, so it could generate the langtag zh-CN but not zh-Hans. Even more troubling: what if the user wanted zh-CN-Hans. And, of course, WMF would be toast with its non-compliant sr-ec (which IETF would interpret as Serbian as spoken in Ecuador). Unix locale strings and IETF langtags are not invertible.

Gnome fixed that problem with addition of --accept-languages, but that change comes after librsvg 2.44.10.

More broadly I wonder whether we should return an error when we can't get a valid locale and a language is requested rather than just serve an empty generated image. The language-tags library exists for managing RFC-compliant language strings, but doesn't seem to offer us anything particularly useful for mapping between languages

Do not return an error. SVG wants IETF language tags, but it does not require them to be valid. Avoid thinking about converting between langtags and locales.

For right now, I would suggest the following approach.

If lang is not specified, then set it to en. I'm hoping that will fix the non-English default case. It would enforce the current WMF semantics that English is the default.

If lang contains a hypen, then truncate lang to the first subtag. For example, modify zh-hans to zh. This step is not ideal, but it should match the (mis)behavior of librsvg 2.40.

Set the environment variable $LC_ALL to the modified lang. (I might also set `$LANG'.)

If librsvg 2.44.10 does not complain when given --accept-languages (it probably will complain), then I would consider passing it with the original value of lang. That way we are ready for the next upgrade.

(Side thought: can lang be used for script injection?)

Add unit tests to check whether lang works for non-English default and hyphenated langtags.

The right thing to do is avoid locales entirely and pass the langtag through --accept-languages, but that must wait for next version of the operating system.

Packaging our own librsvg version or backporting from bullseye should be more possible than it was on stretch, since most (all?) of the rust buildchain dependency issues should be solved. I don't know if anyone's looked at the feasibility of doing that recently though.

The right thing to do is avoid locales entirely and pass the langtag through --accept-languages, but that must wait for next version of the operating system.

Packaging our own librsvg version or backporting from bullseye should be more possible than it was on stretch, since most (all?) of the rust buildchain dependency issues should be solved. I don't know if anyone's looked at the feasibility of doing that recently though.

+5 for ACN.

librsvg 2.44.10 is old and broken, so the right thing is to use a much more recent version. Clearly 2.44.10 is broken for hyphenated language tags and other issues. Upgrade to a modern version of librsvg and pass the langtag through accept-languages.

Gnome fixed that problem with addition of --accept-languages, but that change comes after librsvg 2.44.10.

More broadly I wonder whether we should return an error when we can't get a valid locale and a language is requested rather than just serve an empty generated image. The language-tags library exists for managing RFC-compliant language strings, but doesn't seem to offer us anything particularly useful for mapping between languages

For right now, I would suggest the following approach.

If lang is not specified, then set it to en. I'm hoping that will fix the non-English default case. It would enforce the current WMF semantics that English is the default.

If lang contains a hypen, then truncate lang to the first subtag. For example, modify zh-hans to zh. This step is not ideal, but it should match the (mis)behavior of librsvg 2.40.

Set the environment variable $LC_ALL to the modified lang. (I might also set `$LANG'.)

If librsvg 2.44.10 does not complain when given --accept-languages (it probably will complain), then I would consider passing it with the original value of lang. That way we are ready for the next upgrade.

librsvg 2.44.10 does not accept this flag. Version 2.50.3 (which is packaged with bullseye) also doesn't so we will have to package our own version. I'll start the work on that.

While the solution in the patch is imperfect, it will get the rendering of images somewhat unblocked for the time being and will address issues like the ones mentioned in the cited images while also adding some support for distinctions between language variations like zh_hk and zh_tw.

(Side thought: can lang be used for script injection?)

The change above will only allow the environment variable to be set if it is a valid locale and otherwise will default to en.

Add unit tests to check whether lang works for non-English default and hyphenated langtags.

I've added a test for File:IPv6_header-en.svg to the linked CR to ensure that the correct image is generated when lang is set

I'm not shure if it helps if I provide simple SVG-examples with systemLanguage=

The hypen Problem can be seen in e.g. https://commons.wikimedia.org/wiki/File:SystemLanguage.svg

Sometimes artificial lang-tags are used for including several images into one SVG: e.g. https://commons.wikimedia.org/wiki/File:Unicode_Geschlechtersymbole.svg

Looks like it's not just hyphenated codes, the "simple" has the same problem.

Looks like it's not just hyphenated codes, the "simple" has the same problem.

Simple should be mapped to "en-simple", there are more cases like this on https://meta.wikimedia.org/wiki/Special_language_codes which shows what they should be mapped to.
This is WMF specific, so I doubt bullseye is going to cover this aspect.

Change 923368 merged by jenkins-bot:

[operations/software/thumbor-plugins@master] svg: attempt to build valid locales from hyphenated languages

https://gerrit.wikimedia.org/r/923368

Change 930641 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] thumbor: attempt to render hypenated svg languages better

https://gerrit.wikimedia.org/r/930641

Change 930641 merged by jenkins-bot:

[operations/deployment-charts@master] thumbor: attempt to render hypenated svg languages better

https://gerrit.wikimedia.org/r/930641

The temporary fix for hyphenated languages has mitigated some of the issues highlighted in this ticket. However, the correct solution is to build and deploy a more modern version of rsvg-convert, which will be done in the coming weeks.

@hnowlan
There is a related problem at T337199.

Consider the file

It should display "en" (the systemLanguage="en" English translation clause), but instead it displays "other" (the default clause).

That means that rsvg-convert does not know it is supposed to render English.

The relevant wiki URL is

That URL does not have an explicit language parameter, so I presume

will be entered without a lang parameter: That is

if hasattr(self.context.request, 'lang'):

will not be true. The consequent is LC_ALL is never added to env, so the rasterizer uses some language that does not match "en".

MediaWiki semantics wants a default URL (one with no lang parameter) to default to English.

When Thumbor processes a URL that has not set self.context.request.lang, then it should either force lang to "en" before further processing or explicitly set

env = {'LC_ALL': 'en'}

so rsvg_convert knows to use the preferred language "en".

Change 962563 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/software/thumbor-plugins@master] svg: default to "en" when a language is not specified

https://gerrit.wikimedia.org/r/962563

Change 962563 merged by jenkins-bot:

[operations/software/thumbor-plugins@master] svg: default to "en" when a language is not specified

https://gerrit.wikimedia.org/r/962563

The latest change appears to have improved the default on many of the supplied cases.

Closing this task for now, please reopen if needed or if this was done in error.