Page MenuHomePhabricator

MediaWiki mishandles hyphenated language tags in SVG files.
Open, Needs TriagePublicBUG REPORT

Description

List of steps to reproduce (step by step, including full links if applicable):

  • go to Commons File:SVG_MW_Test_Switch.svg. That file has generic Spanish (es) and several langtags for Spanish-speaking countries.
  • select "Render this image in" Spanish (es) and click "GO"
  • That should take you to https://commons.wikimedia.org/w/index.php?lang=es&title=File%3ASVG_MW_Test_Switch.svg (or just go directly to that URL)
  • Look at the "Render this image in" dropdown. It says "es-ag" rather than "Spanish (es)".
  • View the file page HTML source and find the img element taht displays the main image. It's src attribute is src="https://upload.wikimedia.org/wikipedia/commons/thumb/f/f6/SVG_MW_Test_Switch.svg/langes-ag-512px-SVG_MW_Test_Switch.svg.png?20220703211320"

What happens?:
The Render this image dropdown displays es-ag rather than Spanish (es).
The src attribute has .../langes-ag-512px... rather than ../langes-512px...

What should have happened instead?:
The dropdown should display "Spanish (es)".
The src attribute should be ../langes-512px....

The File page also displays a gallery of the SVG with |lang=es and |lang=es-MX. The src attributes on those img elements are correct, but the rendering is incorrect for Mexican Spanish due to a librsvg langtag matching bug. This report is not about that bug.

Event Timeline

Copy text from T279874

However, that gets into another issue.

SvgHandler::getMatchedLanguage() is semantically incorrect and its existing and new unit tests are faulty.

	/**
	 * SVG's systemLanguage matching rules state:
	 * 'The `systemLanguage` attribute ... [e]valuates to "true" if one of the languages indicated
	 * by user preferences exactly equals one of the languages given in the value of this parameter,
	 * or if one of the languages indicated by user preferences exactly equals a prefix of one of
	 * the languages given in the value of this parameter such that the first tag character
	 * following the prefix is "-".'
	 *
	 * Return the first element of $svgLanguages that matches $userPreferredLanguage
	 *
	 * @see https://www.w3.org/TR/SVG/struct.html#SystemLanguageAttribute
	 * @param string $userPreferredLanguage
	 * @param string[] $svgLanguages
	 * @return string|null
	 */
	public function getMatchedLanguage( $userPreferredLanguage, array $svgLanguages ) {
		// Explicitly requested undetermined language (text without svg systemLanguage attribute)
		if ( $userPreferredLanguage === 'und' ) {
			return 'und';
		}
		foreach ( $svgLanguages as $svgLang ) {
			if ( strcasecmp( $svgLang, $userPreferredLanguage ) === 0 ) {
				return $svgLang;
			}
			$trimmedSvgLang = $svgLang;
			while ( strpos( $trimmedSvgLang, '-' ) !== false ) {
				$trimmedSvgLang = substr( $trimmedSvgLang, 0, strrpos( $trimmedSvgLang, '-' ) );
				if ( strcasecmp( $trimmedSvgLang, $userPreferredLanguage ) === 0 ) {
					return $svgLang;
				}
			}
		}
		return null;
	}

Issue 1
In particular, the statement

	 * Return the first element of $svgLanguages that matches $userPreferredLanguage

is wrong. The method should return $userPreferredLanguage rather than an element of $svgLanguages if there is a match. A match signifies that $userPreferredLanguage is an acceptable langtag for the SVG user agent.

The error is subtle, so I will try to make a clear but contrived example. Say an SVG file has many languages (eg, en, de, and fr), but say the SVG file also distinguishes British and American spellings in a few (but not all) places. It might have

  • `<text systemLanguage="de">Center</text>
  • <text systemLanguage="en-GB">centre</text>
  • <text systemLanguage="en-US">center</text>

in one switch but not need to distinguish elsewhere

  • `<text systemLanguage="de">Strasse</text>
  • <text systemLanguage="en">street</text>

So $svgLanguages will be ["en-GB", "en-US", "en", "de", "fr"].

Clearly,

  • getMatchedLanguages("en-GB", $svgLanguages) should produce en-GB and
  • getMatchedLanguages("en-US", $svgLanguages) should produce en-US

but what should

  • getMatchedLanguages("en", $svgLanguages) produce?

The given code might produce en, en-US, or en-GB depending on the order $svgLanguages is traversed. First match wins. That inconsistent values are possible should raise a red flag.

Notice that en will display all three strings (which is what is intended), but the other matches will display only one string.

Returning userPreferredLanguage is the proper action. The user wanted that langtag, and that langtag is supported in the SVG file. The same value is returned no matter the order of $svgLanguages.

Issue 2
The SVG/BCP47 matching test in getMatchedLanguage() is also overly complex. It does not need to break at multiple hyphens. The test is much simpler. Either the langtags are an exact match:

  • strcasecmp($userPreferredLanguage, $svgLang) === 0

or $userPreferredLanguage is a proper substring of svgLang and the following character is a hyphen. I do not know PHP, but something like

  • length($userPreferredLanguage) < length($svgLang) AND
  • strcasecmp($userPreferredLanguage, substr($svgLang, 0, length($userPreferredLanguage)) === 0 AND
  • svgLang[length($userPreferredLanguage)] === '-'