Page MenuHomePhabricator

SVG language tag als reported as Swiss German; gsw is Swiss German.
Open, Needs TriagePublicBUG REPORT

Description

The problem:

  • Go to https://commons.wikimedia.org/wiki/File:Abdomal_organs_body.svg
  • User Mrmw added a new SVG switch translation specifying systemLanguage="als". (Mrmw intended Swiss German, but that is beside the point.)
  • At the Commons Information page, the als langtag is reported as "Swiss German (gsw)" in the "Render this image in" dropdown box
  • I added to the SVG file another switch translation using systemLanguage="gsw".
  • The information page now reports two dropdown entries as "Swiss German (gsw)"; one for HTTP uselang=als and another for HTTP uselang=gsw.
  • See https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry
  • gswis the IETF langtag for Swiss German/Alemannic/Alsatian
  • als is the IETF langtag for Tosk Albanian

Expected Results (what should have happened?):

  • The SVG systemLanguage attribute uses IETF langtags to specify the language.
  • SVG does not use WMF language identifiers. https://als.wikipedia.org may be a Swiss German wiki, but als is not an IETF langtag for Swiss German.
  • MW should report systemLanguage="als" as Tosk Albanian or unknown. Within an SVG file, it should never be reported as Swiss German. It is not a langtag for Alsatian.

Note: When als.wikipedia.org transcludes a multilingual SVG file, it uses the correct default langtag of gsw.

Event Timeline

This problem is related to T6793 and T125073, but is separate because SVG parsers should not be using WMF language identifiers to interpret SVG IETF langtags.

Glrx renamed this task from Language tag als reported as Swiss German; gsw is Swiss German. to SVG language tag als reported as Swiss German; gsw is Swiss German..Apr 11 2021, 11:25 PM

There is indeed all kinds of confusion around this...

The metadata has IETF tags, and this is what feeds getAvailableLanguages() of the File class... of which the language format is undefined' it seems....

These are then used to match against the render languages, which should be wikimedia language codes (as used by uselang). But Wikimedia doesn't support all the language codes that are possible, just the ones for which it has interface translations... and some codes indeed will just be different and no reconciliation is attempted...

For me, lang and uselang have different types. For example, they should not be treated as the same. They should not be tested for equality without converting one type to the other. Such conversions are not guaranteed; there may not be an isomorphic mapping.

For SVG files, lang is a (lowercased) IETF langtag. When an SVG file page offers the list of renderable languages, that should be the IETF langtags present in the SVG file. That is why als should be reported as Tosk Albanian (or just als). SVG langtags are supposed to be IETF langtags, but the SVG specification is loose. It allows non-conforming tags, and there are many files on Commons that have such tags. In fact, SVG Translate produces non-conforming tags such as ku_ARAB, sr_CYRL, and zh_HANS. (Underscores are not permitted in IETF-conforming langtags, and the SVG spec wants legit langtags.) I'm not a fan of nonstandard tags, but MW could display the nonstandard tags. (SVG Translate should be producing IETF compliant tags, but that is a separate tale of woe about a librsvg workaround.) It is also possible to have legitimate IETF langtags such as the overspecified en-Latn-US or the custom en-BZ-x-naught. Such tags should show up in the dropdown list, and (under current MediaWiki semantics) wikitext should be able to display them.

uselang is a MediaWiki type that need not and does not comply with IETF (although it is similar, many MW languages were assigned before IETF assignments). There are als, sr-el, and sr-ec wikis (the last two may have been switched over). IIRC, those wikis are aware of the standard IETF langtags they should use for SVG. When an SVG file is used on the als.Wiki, the wiki looks for a gsw langtag (not an als langtag) in the SVG file and uses gsw if present.

getAvailableLanguages() should return an array of reasonable type for the Handler. For the SVGHandler, that would be lowercased (psuedo) IETF langtags gleaned from the file. Another Handler might deal with files that use some other convention to identify languages (e.g., WAV files use Microsoft 16-bit dialect-and-language codes, so the strings might be 4-digit hexadecimal numbers). Whatever the type, it falls to the Handler to interpret those strings. The rest of MediaWiki should treat them as opaque strings.

I expect the render language to be separate from uselang. I believe if I'm on the de.Wiki and look at an SVG file with German translations and click the button to go to Commons, then the link will just set uselang (and not set lang):

So I'm taken to Commons and see the English translation of the file with a German user interface. If I want to see the German translation, then I use the dropdown box to select the German translation. I believe that was a deliberate choice by the designers of the user interface.

ImagePage.php has some IETF or SVG-specific code in it.

The code that produces the dropdown:

protected function doRenderLangOpt( array $langChoices, $renderLang ) {

The $langChoices would be a set of IETF langtags that were extracted from the SVG file. Consequently, we would expect those langtags to be IETF langtags. They come from an SVG file; they do not come from a Wiki.

At line 116, the option for the dropdown is created for a $lang in $langChoices. The $lang is an IETF (/AKA/ BCP47) langtag:

private function createXmlOptionStringForLanguage( $lang, $selected ) {
    $code = LanguageCode::bcp47( $lang );

The intention of bcp47() is to return the langtag in its preferred form (e.g., zh-Hant-TW) to find a description of that language.

That intention goes awry in LanguageCode.php. The $lang is treated not as an IETF langtag but rather as a Wiki langtag for a specific case:

public static function bcp47( $code ) {
    $code = self::replaceDeprecatedCodes( strtolower( $code ) );

With replaceDeprecatedCodes() consulting a table that will replace als with gsw:

private const DEPRECATED_LANGUAGE_CODE_MAPPING = [
    // Note that als is actually a valid ISO 639 code (Tosk Albanian), but it
    // was previously used in MediaWiki for Alsatian, which comes under gsw
    'als' => 'gsw', // T25215
    'bat-smg' => 'sgs', // T27522
    'be-x-old' => 'be-tarask', // T11823
    'fiu-vro' => 'vro', // T31186
    'roa-rup' => 'rup', // T17988
    'zh-classical' => 'lzh', // T30443
    'zh-min-nan' => 'nan', // T30442
    'zh-yue' => 'yue', // T30441
];

This table confuses Wiki and IETF intentions. IETF never used als for Alsatian.

The other substitutions can make sense. Genuine IETF langtags can more into preferred langtags, and those preferred langtags can be used to determine a reasonable text description of the language.

Yes these are well known confusing parts of MW language support. Its difficult to clean up, but I can make some improvements...

I think it can get als supported, but for now it will still be listed as 'Allemannisch (als)'

Change 804703 had a related patch set uploaded (by TheDJ; author: TheDJ):

[mediawiki/core@master] Allow any SVG language tag that is well formed IETF

https://gerrit.wikimedia.org/r/804703

@TheDJ
ImagePage.php 1118-1131 seems a little off.

You do not want to change $lang (line 1120); that is the canonized code in the SVG file. Treat $lang as a constant. We do not want zh-hant turning into zh-Hant.

It is OK to have a local $code that is derived from $lang for the sole purpose of finding the language name. Therefore, it is OK for zh-cmn to transmogrify into cmn and then use cmn to find the descriptive string "Mandarin". We need the option attribute value to be the original $lang and not the $code. If a descriptive string is not found, then use (the unchanged) $lang ala line 1130.

Language could have a bcp47Strict() that takes only IETF langtags. Then bcp47() can do the Wiki sub "als" => "gsw" and call bcp47Strict() to do any other work. Ideally, calls to bpc47() would be replaced with either bcp47Strict() or bcp47Wiki(), but that would take a lot of work.

Change 804703 merged by jenkins-bot:

[mediawiki/core@master] Allow any SVG language tag that is well formed IETF

https://gerrit.wikimedia.org/r/804703

@TheDJ

Thanks for adding more unit tests.

However, that gets into another issue. (Do you want me to open it as a separate issue?)

SvgHandler::getMatchedLanguage() is semantically incorrect and its existing and new unit tests are faulty.

	/**
	 * SVG's systemLanguage matching rules state:
	 * 'The `systemLanguage` attribute ... [e]valuates to "true" if one of the languages indicated
	 * by user preferences exactly equals one of the languages given in the value of this parameter,
	 * or if one of the languages indicated by user preferences exactly equals a prefix of one of
	 * the languages given in the value of this parameter such that the first tag character
	 * following the prefix is "-".'
	 *
	 * Return the first element of $svgLanguages that matches $userPreferredLanguage
	 *
	 * @see https://www.w3.org/TR/SVG/struct.html#SystemLanguageAttribute
	 * @param string $userPreferredLanguage
	 * @param string[] $svgLanguages
	 * @return string|null
	 */
	public function getMatchedLanguage( $userPreferredLanguage, array $svgLanguages ) {
		// Explicitly requested undetermined language (text without svg systemLanguage attribute)
		if ( $userPreferredLanguage === 'und' ) {
			return 'und';
		}
		foreach ( $svgLanguages as $svgLang ) {
			if ( strcasecmp( $svgLang, $userPreferredLanguage ) === 0 ) {
				return $svgLang;
			}
			$trimmedSvgLang = $svgLang;
			while ( strpos( $trimmedSvgLang, '-' ) !== false ) {
				$trimmedSvgLang = substr( $trimmedSvgLang, 0, strrpos( $trimmedSvgLang, '-' ) );
				if ( strcasecmp( $trimmedSvgLang, $userPreferredLanguage ) === 0 ) {
					return $svgLang;
				}
			}
		}
		return null;
	}

Issue 1
In particular, the statement

	 * Return the first element of $svgLanguages that matches $userPreferredLanguage

is wrong. The method should return $userPreferredLanguage rather than an element of $svgLanguages if there is a match. A match signifies that $userPreferredLanguage is an acceptable langtag for the SVG user agent.

The error is subtle, so I will try to make a clear but contrived example. Say an SVG file has many languages (eg, en, de, and fr), but say the SVG file also distinguishes British and American spellings in a few (but not all) places. It might have

  • `<text systemLanguage="de">Center</text>
  • <text systemLanguage="en-GB">centre</text>
  • <text systemLanguage="en-US">center</text>

in one switch but not need to distinguish elsewhere

  • `<text systemLanguage="de">Strasse</text>
  • <text systemLanguage="en">street</text>

So $svgLanguages will be ["en-GB", "en-US", "en", "de", "fr"].

Clearly,

  • getMatchedLanguages("en-GB", $svgLanguages) should produce en-GB and
  • getMatchedLanguages("en-US", $svgLanguages) should produce en-US

but what should

  • getMatchedLanguages("en", $svgLanguages) produce?

The given code might produce en, en-US, or en-GB depending on the order $svgLanguages is traversed. First match wins. That inconsistent values are possible should raise a red flag.

Notice that en will display all three strings (which is what is intended), but the other matches will display only one string.

Returning userPreferredLanguage is the proper action. The user wanted that langtag, and that langtag is supported in the SVG file. The same value is returned no matter the order of $svgLanguages.

Issue 2
The SVG/BCP47 matching test in getMatchedLanguage() is also overly complex. It does not need to break at multiple hyphens. The test is much simpler. Either the langtags are an exact match:

  • strcasecmp($userPreferredLanguage, $svgLang) === 0

or $userPreferredLanguage is a proper substring of svgLang and the following character is a hyphen. I do not know PHP, but something like

  • length($userPreferredLanguage) < length($svgLang) AND
  • strcasecmp($userPreferredLanguage, substr($svgLang, 0, length($userPreferredLanguage)) === 0 AND
  • svgLang[length($userPreferredLanguage)] === '-'