Page MenuHomePhabricator

Bad language code: zh_Hans should be zh-Hans
Open, Needs TriagePublic5 Estimated Story Points

Description

A file has contracted a case of bad language code upon use of SVG Translate Tool. The revision in question is https://upload.wikimedia.org/wikipedia/commons/archive/d/dc/20201231085805%21Research_design_and_evidence_-_Capho.svg which has systemLanguage="zh_HANS" (with an underscore); it works after being changed to https://upload.wikimedia.org/wikipedia/commons/d/dc/Research_design_and_evidence_-_Capho.svg which has systemLanguage="zh-Hans" (with a hyphen).

To reproduce, just choose 中文(简体)as the target language. Typing "zh" in the search box helps.

The language code should be normalized to all lowercase and with a hyphen separator, regardless of what format is already in the file.

Acceptance criteria:

  • No file being output from the tool should be able to have a systemLanguage attribute in the wrong format.

Related Objects

Mentioned In
T319259: Check that document element is <svg> and in the right namespace
rGSVTd5fc9388e033: Allow tspan to be in svg namespace (#622)
T248252: SVG Translate: Skip unsupported text pattern and continue with the supported ones
rGSVT91d30977e118: Create (and catch) exceptions for all existing error states
rGSVT9561b2678e15: Create (and catch) exceptions for all existing error states
rGSVTb8c23d973c36: Create (and catch) exceptions for all existing error states
T316741: Allow svg namespace prefixes other than 'svg'
rGSVT599570f1c6a7: Allow tspan to be in svg namespace
rGSVTdacd8d2f86fc: Allow tspan to be in svg namespace
rGSVT91689a31bb6b: Fix regex to find $1, $2, etc.
rGSVT05d2724e6765: Normalize lang codes when reading SVGs and writing translations
T261192: Rendering multilingual (systemLanguage) SVG files fails locally after upgrading librsvg from 2.40.21 to 2.44.10
T40010: RFC: Re-evaluate librsvg as SVG renderer on Wikimedia wikis
T275263: Translation dropdown not available on File: page after translating a specific SVG file on Commons via svgtranslate tool
T271595: SVG translate tool replaces all fields with "$1" (style element needs at least one trailing character)
Mentioned Here
T271595: SVG translate tool replaces all fields with "$1" (style element needs at least one trailing character)
T221382: [BUG] Some CSS selectors break translation input
T319259: Check that document element is <svg> and in the right namespace
T316310: Add Phan to SVG Translate CI
T316741: Allow svg namespace prefixes other than 'svg'
T221453: Add "newer" open fonts
T280718: Re-evaluate whether keeping around https://noc.wikimedia.org/conf/fc-list is a good practive
T261192: Rendering multilingual (systemLanguage) SVG files fails locally after upgrading librsvg from 2.40.21 to 2.44.10
T40010: RFC: Re-evaluate librsvg as SVG renderer on Wikimedia wikis
T154237: SVG image wikisyntax can't use "lang=zh-hant"
T265549: Update librsvg to > 2.44.10

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

This tag is far less visible than a column. Tickets wont get done, and new developers won't onboard. This project direly need new, active developers.

@Yug: It is not, as tags are shown on workboards. Having 2000 incompatible, different per-projcet ways that make it impossible to find "easy to pick" tasks across projects does not help, when good first task are mentioned on https://www.mediawiki.org/wiki/New_Developers and other pages.

Received. Thanks for the alternatives.

I noticed that one user made several Kurdish translations in Arabic script that failed to show up. Looking at the SVG, the files had systemLanguage="ku_ARAB" instead of systemLanguage="ku-Arab" (capitalization is not important).

I went to https://svgtranslate.toolforge.org/File:100_Years_War_France_1435.svg and selected default language to ku-Arab (I typed in "ku" and then selected the language in Arabic script), made a test edit, and then downloaded the file. The file used systemLanguage="ku_ARAB" for my test edit.; it should have used ku-Arab.

That suggests SVG Translate is using Unix locale strings instead of IETF langtags. The correct SVG syntax uses a hyphen rather than an underscore.

Please do not be confused by WMF feeding librsvg a langtag through its $LANG environment (locale) variable. That is a hack, and it will bite us in the upgrade to the newer version of librsvg. See T154237, T265549, and T40010.

The code should never be looking for underscores in a systemLanguage attribute.

svgtranslate/src/Model/Svg/SvgFile.php /

/**
 * Reorder text elements within the document so that all sublocales (i.e. systemLanguage values
 * containing a hyphen or underscore, e.g. de_CH) are moved to the beginning of the switch
 * element, and all fallback elements are moved to the end.
 */
protected function reorderTexts(): void

The underscore problem seems to be even more involved.

SVG has been invoked several times to add zh-Hant and zh-TW translation to https://commons.wikimedia.org/wiki/File:2022_Russian_invasion_of_Ukraine.svg. The resulting SVG has several switch clauses with identical systemLanguage attributes. There should be only one clause per language.

<switch id="switch2174" transform="translate(1853.7,532.54)" class="place" font-size="5.34px">
  <text id="text3001-zh-tw" systemLanguage="zh_TW"><tspan id="trsvg142-zh-tw">基夫沙里夫卡</tspan></text>
  <text id="text3001-zh-hant" systemLanguage="zh_HANT"><tspan id="trsvg142-zh-hant">基夫沙里夫卡</tspan></text>
  <text id="text3001-zh-tw" systemLanguage="zh_TW"><tspan id="trsvg142-zh-tw">基夫沙里夫卡</tspan></text>
  <text id="text3001-zh-hant" systemLanguage="zh_HANT"><tspan id="trsvg142-zh-hant">基夫沙里夫卡</tspan></text>
  <text id="text6217" systemLanguage="zh_TW"><tspan id="tspan6215">基夫沙里夫卡</tspan></text>
  <text id="text6221" systemLanguage="zh_HANT"><tspan id="tspan6219">基夫沙里夫卡</tspan></text>
  <text id="text6225" systemLanguage="zh_TW"><tspan id="tspan6223">基夫沙里夫卡</tspan></text>
  <text id="text6229" systemLanguage="zh_HANT"><tspan id="tspan6227">基夫沙里夫卡</tspan></text>
  <text id="text6233" systemLanguage="zh_TW"><tspan id="tspan6231">基夫沙里夫卡</tspan></text>
  <text id="text6237" systemLanguage="zh_HANT"><tspan id="tspan6235">基夫沙里夫卡</tspan></text>
  <text id="text6241" systemLanguage="zh_TW"><tspan id="tspan6239">基夫沙里夫卡</tspan></text>
  <text id="text6245" systemLanguage="zh_HANT"><tspan id="tspan6243">基夫沙里夫卡</tspan></text>
  <text id="text6249" systemLanguage="zh_HANT"><tspan id="tspan6247">基夫沙里夫卡</tspan></text>
     
  <text systemLanguage="en" id="trsvg1842"><tspan id="trsvg1218">Kivsharivka</tspan></text>
  <text id="text3001-it" systemLanguage="it"><tspan id="trsvg142-it">Kovšarovka</tspan></text>
  <text id="text3001-fr" systemLanguage="fr"><tspan id="trsvg142-fr">Kivcharivka</tspan></text>
  <text id="text3001-el" systemLanguage="el"><tspan id="trsvg142-el">Κιβσαρίφσκα</tspan></text>
  <text id="text3001-ru" systemLanguage="ru"><tspan id="trsvg142-ru">Ковшаровка</tspan></text>
  <text id="text3001-uk" systemLanguage="uk"><tspan id="trsvg142-uk">Ківшарівка</tspan></text>
  <text id="text3001-ka" systemLanguage="ka"><tspan id="trsvg142-ka">კივშარივკა</tspan></text>
  <text id="text3001-lt" systemLanguage="lt"><tspan id="trsvg142-lt">Kivšarivka</tspan></text>
  <text id="text3001-ca" systemLanguage="ca"><tspan id="trsvg142-ca">Kivxàrivka</tspan></text>
  <text id="text3001"><tspan id="trsvg142">Kivsharivka</tspan></text>
</switch>

Notice that the clauses also have identical id attributes. That's a no-no. After several SVG Translate edits, the the file was pulled into Inkscape. Perhaps Inkscape noticed the error and generated new id attributes (e.g. text6217), and then subsequent invocations of SVG Translate introduced more duplicate clauses and duplicate identifiers.

The file

Has the switch

<switch transform="translate(20, 90)">
      <text systemLanguage="en"><tspan>en</tspan></text>
      <text systemLanguage="zh_Hans"><tspan>zh_Hans</tspan></text>
      <text systemLanguage="zh-Hant"><tspan>zh-Hant</tspan></text>
      <text>Default language</text>
</switch>

Invoking SVG Translate to add a gan translation changed it to

<switch transform="translate(20, 90)">
      <text id="trsvg19" systemLanguage="zh_HANT"><tspan id="trsvg8">zh-Hant</tspan></text>
      <text id="trsvg18" systemLanguage="zh_HANS"><tspan id="trsvg7">zh_Hans</tspan></text>
      <text systemLanguage="zh-Hant" id="trsvg19"><tspan id="trsvg8">zh-Hant</tspan></text>
      <text systemLanguage="zh_Hans" id="trsvg18"><tspan id="trsvg7">zh_Hans</tspan></text>
      <text systemLanguage="en" id="trsvg17"><tspan id="trsvg6">en</tspan></text>
      <text id="trsvg20-gan" systemLanguage="gan"><tspan id="trsvg9-gan">3 which version of Chinese?</tspan></text>
      <text id="trsvg20"><tspan id="trsvg9">Default language</tspan></text>
</switch>

Notice that two Chinese text elements changed into four text elements.
Notice there are duplicated ids trsvg8, trsvg9, trsvg18, and trsvg19.
Notice there are equivalent systemLanguages: zh_Hant and zh_HANT; IETF langtags do not distinguish case.

The langtags are being mishandled.

Consider reading a switch that has text elements with langtags de, FR, and zh_HANT.

When the file is parsed, the langtags are canonized in SVGFile.php (https://github.com/wikimedia/svgtranslate/blob/913c447750ef16a023daed3681b3f57f39af29f1/src/Model/Svg/SvgFile.php) at line 431:

$langCode = str_replace('_', '-', strtolower($lang));

So the langtags become de, fr, and zh-hant.

Later, the SVG DOM is updated in switchToTranslationSet at line 555. $language is the lowercased langtag. The unattached text elements are generated and their systemLanguage attributes are set to self::langCodeToOs($language) on line 580. So these unattached SVG elements will have the langtags de, fr, and zh_HANT.

There is an XPath query built at 605 to 609. It looks in the DOM's switch to see if there are existing text elements with text[@systemLanguage='$language']. Notice that it is looking for $language (which is all lowercase). Consequently, it will find the original de element in the DOM and update it, but fr is not FR and zh-hant is not zh_HANT, so the query will not find those elements in the DOM. Consequently, it will incorrectly insert the two unattached text clauses into the switch element DOM.

The switch will now have the langtags de, FR, zh_Hant, fr and zh_HANT.

Running SVG Translate again will compound the problem. The old FR langtag will (mis)match to the new fr element and add nothing, but the two zh_HANT langtags will add more zh_HANT clauses (the hyphenated $language zh-hant will never match an underscored zh_HANT).

We have an exponential bug, and it affects File:2022 Russian invasion of Ukraine.svg, a file that is getting several uses of SVG Translate.

Notice also that if the XPath expression returns more than 1 result, then nothing is done. It should at least log an error.

SVG Translate should not upload files with underscore langtags; they are not legal IETF langtags. langCodeToOs should not be used. Yes, I know it does a librsvg workaround, but that workaround does not work on Commons. Update librsvg and do not pass langtags in $LANG. T261192

A quick fix for SVG Translate includes lowercasing all systemLanguage attributes and replacing underscores when it first processes the DOM.

Instead of

text[@systemLanguage='$language']

one can apply the same operation to the attribute comparison (IETF langtags are essentially ASCII, so lowercasing can be done with translate()):

text[translate(@systemLanguage, "ABCDEFGHIJKLMNOPQRSTUVWXYZ_", "abcdefghijklmnopqrstuvwxyz-")='$language']

Just helped out another user with a failed Chinese translation:

https://commons.wikimedia.org/wiki/Commons_talk:SVG_Translate_tool#Supporting_Chinese_characters?

He finally uploaded the translations, but they do not display in the language dropdown until the langtags have the underscores removed
https://commons.wikimedia.org/wiki/File:Venn_Diagram_of_Numbers.svg

JMcLeod_WMF set the point value for this task to 5.Jun 22 2022, 11:10 AM

SVG Translate should not output files with underscores.

SVG Translate has already output many SVG files with inappropriate underscores. Those files should be repaired. A robot might replace the underscores with hyphens. A hack might have SVG Translate replace underscores with hyphens on any file it processes. An underscore is not a legal constituent of an IETF langtag.

Not sure if it's worth looking into more, but the --stylesheet option might let us add a font-face (something like FreeSans )?

@font-face {
    font-family: FreeSans;
    src: url('FreeSans.ttf') format('truetype');
    font-weight: normal;
    font-style: normal;
}

Edit: the installed version of rsvg-convert doesn't have a --stylesheet option..

Not sure if it's worth looking into more, but the --stylesheet option might let us add a font-face (something like FreeSans )?

A better method would just add the fonts to the thumbnailer's operating system.

Moreover, SVG files should be written to display on many systems -- not just Commons. That means the fonts should not rely on specific fonts but rather have generic fallbacks.

A stylesheet is normally used to add consistent formatting such as selecting a default font or setting the color of an a element. I do not think there is a reason for Commons to adopt its own default style. Commons should just use the agent's style. In addition, when the SVG file is displayed on another agent (such as a browser), then the stylesheet will not be added.

There are issues about adding fonts. For example, T221453.

The font list:
https://noc.wikimedia.org/conf/fc-list

There are also issues about maintaining that list. For example, T280718 (includes links to several font issues).

I've made a patch that normalizes the language codes when read and when written: https://github.com/wikimedia/svgtranslate/pull/604

@Grlx thanks for all your hard work in looking into the details of this! It's really great.

The above patch opts for the simpler approach of strtolower(str_replace('_', '-', $lang)), but do you think there will be any problems with that? It does mean that it's not possible to round-trip an unchanged SVG file, but I think that's probably okay.

(The issue of what to do when more than one matching tag is found I'll sort out in a separate patch.)

@Samwilson Thanks for working on this problem. It affects many users.

Removing the underscores is a reasonable solution. Forcing all systemLanguage langtags to lowercase does not affect basic SVG semantics. It would affect extremely rare case-sensitive CSS attribute selectors (e.g., text[@systemLanguage="DE"]), but librsvg does not handle attribute selectors, and the selector should have been case insensitive (e.g., text[@systemLanguage="DE" i]). I do not see that as a problem.

For matching tags, I'm OK with using any one tag and throwing the others away. It may lose some information, but it removes ambiguity.

I've also used a quick and dirty test. If text elements have matching systemLanguage attributes, then test if text1.textContent == text2.textContent. If equal, then throw one away. If not equal, then give up. The test ignores attributes, but makes sure the text is the same. It works often, but it can be confused when xml:space is not preserve. A slightly better test would .trim() and replace multiple space characters with a single space before testing for equality, but I do not think it is worth it. The .isEqualNode() method would be more rigorous, but it will crash and burn on id attributes.

PR merged, and release cut (1.1.15)

nb. I'm not sure how this is deployed to toolforge (just a git pull?) — so this is currently not deployed :)
nbnb. either someone deployed it, or it does it automagically — either way, this change is now live :)

I uploaded a gan translation for this file using svgtranslate (I hope that is OK).

I notice that when I try to "Render this image in" zh-hans (i.e. https://commons.wikimedia.org/w/index.php?lang=zh-hans&title=File%3ASVG_Switch_Test.svg) it shows me the translation for zh-hant. @Samwilson @Glrx?

I have downloaded the SVG from commons and it looks ok to me:

<switch transform="translate(20, 90)">
  <text systemLanguage="zh-hant" id="trsvg19"><tspan id="trsvg8">zh-Hant</tspan></text>
  <text systemLanguage="zh-hans" id="trsvg18"><tspan id="trsvg7">zh_Hans</tspan></text>
  <text systemLanguage="en" id="trsvg17"><tspan id="trsvg6">en</tspan></text>
  <text id="trsvg20-gan" systemLanguage="gan"><tspan id="trsvg9-gan">gan</tspan></text>
  <text id="trsvg20"><tspan id="trsvg9">Default language</tspan></text>
</switch>

MediaWiki uses an old version of librsvg that only matches the first subtag (the zh in zh-hans). That is why zh-hans can "match" zh-hant and produce unexpected results. It is a well-known bug. T154237. Newer versions of librsvg have a different matching algorithm with different problems.

My impression is that SVG Translate started using non-compliant langtags with underscores (e.g., zh_Hant) to work around the librsvg problem. The trick worked on Commons until MW started rejecting non-compliant tags. The trick never worked in other SVG agents.

For this issue, the goal is producing compliant SVG. That SVG will display correctly in most browsers. The problems with MediaWiki are separable.

@Samwilson a couple of things I am seeing, not sure if they are related to this work.

https://svgtranslate.toolforge.org/File:Research_design_and_evidence_-_Capho.svg does not pick up any translations (including English) in the right "to" dropdown. Notice it does pick them up on the left "from" dropdown:

zh-hans_not_picked_up.png (574×1 px, 29 KB)

https://svgtranslate.toolforge.org/File:SVG_Switch_Test.svg has display issues when you select zh-hans in the left "from" dropdown:

zh-hans_from.png (568×982 px, 19 KB)

https://commons.wikimedia.beta.wmflabs.org/wiki/File:Research_design_and_evidence_-_Capho.svg was previously valid SVG (according to w3c validator). I uploaded a translation and now it is invalid (mostly for duplicate IDs).

@dom_walden Thanks for checking.

Initial file has systemLanguage="zh-Hans" elements with actual Chinese. The default switch clauses have id attributes that pattern match trsvg[0-9]+.

Translated file has systemLanguage="zh-Hans"elements with actual Chinese content and systemLanguage="zh-hans" elements with prefixed English content (zh-hansXXXX).

In the translated file, it looks like the actual Chinese did not get its systemLanguage attribute canonized (line 293?), SVG Translate's id generator just appended -zh-hans to the default clause id (generating duplicate id values for each new translation), and the merging of translations (line 610?) did an insertion rather than a replacement because zh-Hans is not equal to zh-hans.

Failed canonization would explain the problem, but I'm not seeing it.

The getElementsByTagName() at line 280 is a live list. The operations at 298 could affect the order of the list. That should not be a problem for this SVG file because all text elements have a switch parent, so the operations at 298 should not occur.

I would not expect changing systemLanguage would affect the order of the live list. Furthermore, none of the systemLanguage="zh-Hans" attributes were canonized. If changing the attribute affected the order, at least one change would get through.

The SVG file has both unprefixed (e.g., switch) and prefixed (e.g., svg:switch) elements. That does not affect the results: both have duplicate id attributes.

I do not know PHP, but is the RegEx pattern starting at line 285 correct?

// Text strings like $1, $2 will cause problems later because
// self::replaceIndicesRecursive() will try to replace them
// with (non-existent) child nodes.
if (preg_match('/$[0-9]/', $text->textContent)) {
    $this->logFileProblem('File {file} has text with $-numbers');
    return false;
}

I'm thinking the $ should be escaped (that is '/\$[0-9]/').

Initial file has systemLanguage="zh-Hant" elements with actual Chinese. The default switch clauses have id attributes that pattern match trsvg[0-9]+.

Translated file has systemLanguage="zh-Hans"elements with actual Chinese content and systemLanguage="zh-hans" elements with prefixed English content (zh-hansXXXX).

In the translated file, it looks like the actual Chinese did not get its systemLanguage attribute canonized (line 293?), SVG Translate's id generator just appended -zh-hans to the default clause id (generating duplicate id values for each new translation), and the merging of translations (line 610?) did an insertion rather than a replacement because zh-Hans is not equal to zh-hans.

Failed canonization would explain the problem, but I'm not seeing it.

I think it is not normalising the language codes when it is analysing the SVG file (SvgFile:analyse() line 398), causing it to upload a duplicate zh-hans translation.

This might also explain why it is not picking up the translations (see my comment T271000#8130650).

I do not know PHP, but is the RegEx pattern starting at line 205 correct?
...
I'm thinking the $ should be escaped (that is '/\$[0-9]/').

I think you are right:

$ php -a
php > echo preg_match('/\$[0-9]/', "$1");
1
php > echo preg_match('/$[0-9]/', "$1");
0

The header comment for analyse() at line 394 suggests makeTranslationReady() has already been performed. It looks to me like that is called right after loading the file, and it will lowercase the langtags before analyse() is invoked.

Can getInFileTranslations() or getSavedLanguages() or getFilteredTextNodes() (which call analyze()) be invoked before the constructor is finished?

I would expect a subsequent rune of SVG Translate to trip the test at line 360 (finding zh-Hans and zh-hans are duplicate language).

Notes.
in src/Model/Svg/SvgFile.php
__construct()

  • creates a document
  • loads the document
  • creates XPath
  • calls _makeTranslationReady().

makeTranslationReady()

  • checks that at least one text element is present
  • looks for problematic CSS
  • looks for tref element
  • grabs tspan and text elements
  • tests for sensible tspan elements
  • tests for sensible text elements; wraps #text node with a tspan
    • line 236 puts the text node in translatableNodes[].
  • check id attributes
  • rebuild translatableNodes at line 260...
  • assign id attributes
  • scan text elements at line 280
    • throw out $1-style text content
    • normalize systemLanguage (this should have changed zh-Hans to zh-hans; how does zh-Hans survive?)
    • make sure there is a parent switch
    • dangerous style hoist at line 309
    • test for unexpected children at line 315
  • scan switch elements at line 309
    • make systemLanguage single valued attribute. (The attribute has already been lowercased, so $realLang is lowercase.)

Thank you @Glrx for your detailed analysis here! I'm working through your notes, and sorting out the issues.

I do not know PHP, but is the RegEx pattern starting at line 285 correct?

This is now fixed, in this patch: https://github.com/wikimedia/svgtranslate/pull/620

The SVG file has both unprefixed (e.g., switch) and prefixed (e.g., svg:switch) elements. That does not affect the results: both have duplicate id attributes.

It does look like this might be the root problem here. It only matters for tspans, as far as I can see, but when tspans have an svg namespace their parent text element doesn't get the systemLanguage normalized. Here's a test: https://github.com/wikimedia/svgtranslate/pull/622/files#diff-4b20740e1030e8aa5f7744f8b04db26f4204d9a3224a2b8f411be83cf1c85502R572-R581

@Glrx If you have any test cases that you'd like me to follow up on, that'd be great. It's sometimes hard to narrow down what in a large SVG file is causing the issues! But we can add lots of small tests to the automated test suite and so guard against regressions, and fix these bugs.

The code at 217 concerns wrapping #text nodes in tspan elements or giving up. In the problem file, the text is already wrapped. An svg:tspan element is not a #text node, so it should return false and give up on translating.

The systemLanguage is canonized at line 293. Except for the live list issue, I do not see how any text element escapes canonization. (I am assuming continue only continues the innermost for/forEach. Also, the text elements in the problem did not need to be wrapped with a switch, so I do not see the live list changing.)

Looking at https://commons.wikimedia.beta.wmflabs.org/wiki/File:Research_design_and_evidence_-_Capho.svg

Assuming the input SVG file for SVG Translate is the previous version

The initial file has actual Chinese with systemLanguage="zh-Hans" and default text. The first element to be translated is

Input file
<switch>
  <text transform="matrix(0.9962 0 0 1 444.9482 237.4736)" class="st8 st9" id="trsvg18-zh-hans" systemLanguage="zh-Hans">
    <tspan id="trsvg1-zh-hans">系统</tspan>
  </text>
  <text transform="matrix(0.9962 0 0 1 444.9482 237.4736)" class="st8 st9" id="trsvg18">
    <tspan id="trsvg1">Systematic</tspan>
  </text>
</switch>

The translated file is

SVG Translate comment states "File uploaded using svgtranslate tool (http://svgtranslate-test.toolforge.org/). Added translation for zh-hans."
The file has apparently added a translation for zh-hans rather than canonizing and overwriting the existing zh-Hans translation.

Input file translated
<switch>
  <text transform="matrix(0.9962 0 0 1 444.9482 237.4736)" class="st8 st9" id="trsvg18-zh-hans" systemLanguage="zh-hans">
    <tspan id="trsvg1-zh-hans">zh-hansSystematic</tspan>
  </text>
  <text transform="matrix(0.9962 0 0 1 444.9482 237.4736)" class="st8 st9" id="trsvg18-zh-hans" systemLanguage="zh-Hans">
    <tspan id="trsvg1-zh-hans">系统</tspan>
  </text>
  <text transform="matrix(0.9962 0 0 1 444.9482 237.4736)" class="st8 st9" id="trsvg18">
    <tspan id="trsvg1">Systematic</tspan>
  </text>
</switch>

There are duplicated identifiers in the result; that confuses things.
The text element with actual Chinese has systemLanguage="zh-Hans".
The text element with phony Chinese has systemLanguage="zh-hans".

When user selects target language, is that langtag canonized?

Seeing if this test (or a simpler one) can be repeated would be good.
Thanks for adding the unit tests.

Uploaded file to Commons:
https://commons.wikimedia.org/wiki/File:SVG_Translate_test_namespace.svg

Expect it to fail qualified name tests because qualified names should be "foo:text".

Content is

SVG_Translate_test_namespace.svg
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<svg xmlns="http://www.w3.org/2000/svg"
     xmlns:foo="http://www.w3.org/2000/svg"
     viewBox="0 0 800 100"
     version="1.1">
  <title>SVG Translate test with non-svg prefix</title>
  <desc>
This file tests for an unusual svg prefix.
  </desc>
  
  <rect width="100%" height="100%" fill="pink"/>

  <foo:switch font-size="40" transform="translate(20,70)">
    <foo:text id="trsvg18-zh-hans" systemLanguage="zh-Hans"><foo:tspan id="trsvg1-zh-hans">foo:zh-Hans: simplified Chinese</foo:tspan></foo:text>
    <foo:text id="trsvg18"><foo:tspan id="trsvg1">foo:default: English</foo:tspan></foo:text>
  </foo:switch>

</svg>

On file page, clicked the SVG Translate tool link provided by Template:Translate

Did not click any language choices (there is something screwy happening with defaults). I had tried a single line generic SVG file earlier, but it worked without any trouble. I did select languages in those tests.

SVG Translate provided the single source phrase "foo:default: English".
IIRC, SVG Translate was previously setup for default to zh-Hans for translating a different file.
IIRC, the target translation box was blank; it did not have the zh-Hans text.

Typed "what happened?" as the translation.

Clicked download and got this result:

downloaded file
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<svg xmlns="http://www.w3.org/2000/svg" xmlns:foo="http://www.w3.org/2000/svg" viewBox="0 0 800 100" version="1.1">
  <title>SVG Translate test with non-svg prefix</title>
  <desc>
This file tests for an unusual svg prefix.
  </desc>
  
  <rect width="100%" height="100%" fill="pink"/>

  <foo:switch font-size="40" transform="translate(20,70)"><foo:text id="trsvg18-zh-hans" systemLanguage="zh-hans"><foo:tspan id="trsvg1-zh-hans">what happened?</foo:tspan></foo:text><foo:text id="trsvg18-zh-hans" systemLanguage="zh-Hans"><foo:tspan id="trsvg1-zh-hans">foo:zh-Hans: simplified Chinese</foo:tspan></foo:text>
    
    
  <foo:text id="trsvg18"><foo:tspan id="trsvg1">foo:default: English</foo:tspan></foo:text></foo:switch>

</svg>

There is a <code>zh-Hans</code> text element with original text and a <code>zh-hans</code> text element with translated text. They have duplicate identifiers.

I've been confused.

I tried processing https://commons.wikimedia.org/wiki/File::SVG_Switch_Test.svg

I expected it to refuse to translate the file because the file has something like

SVG Switch Test.svg
<switch>
  <title>Tooltip</title>
  ...
</switch>

I expected makeTranslationReady() in https://github.com/wikimedia/svgtranslate/blob/master/src/Model/Svg/SvgFile.php to examine every switch at line 329, discover an unexpected title element at line 349, and return false:

SvgFile.php
if ('text' !== $sibling->nodeName && 'svg:text' !== $sibling->nodeName) {
    $this->logFileProblem('Encountered unexpected element {elementName} in file {file}',
        ['elementName' => $sibling->nodeName]
    );
    return false;
}

I expected that returning false would abort the translation. Instead the UI shows and makes me believe I can do translations.

The comment for the method reads

SvgFile.php
/**
 * Makes $this->document ready for translation by inserting <switch> tags where they need to be, etc.
 * Also works as a check on the compatibility of the file since it will return false if it fails.
 *
 * @todo Find a way of making isTranslationReady a proper check
 * @todo add interlanguage consistency check
 * @return bool False on failure, DOMDocument on success
 */
protected function makeTranslationReady(): bool
{

The code does not follow its apparent contract; it does not return a DOMDocument. The constructor has already read the document and squirreled it away in $this->$document.

I suspect the method returns false but the rest of the code just forges ahead with the translation instead of checking whether the file is good.

Maybe the method or the constructor should null $this->$document if there is a problem. In any event, the overall logic needs to be checked.

Should I open a new issue for this problem?

BTW, can a mere mortal like me can see the log file?

Just guessing here... I do not know PHP or phpdoc.

Looks like makeTranslationReady() should throw exceptions (like it does with throw new NestedTspanException()) rather than return false. The method should advertise its @throw objects at line 134. Alternatively, __construct() can see the false and throw an exception. It might be something generic such as ComplexStructureException(). The constructor should advertise additional @throw objects at line 78 (it does not advertise the tspan exception).

The exception(s) can be caught at https://github.com/wikimedia/svgtranslate/blob/master/src/Controller/TranslateController.php line 67:

TranslateController.php
// Fetch the SVG from Commons.
try {
    $this->assertFileType($normalizedFilename);
    $path = $cache->getPath($filename);
    $svgFile = $this->svgFactory->create($path);
} catch (ImageNotFoundException $exception) {
    return $this->showError('not-found', $normalizedFilename);
} catch (InvalidFormatException $exception) {
    return $this->showError('invalid-format', $normalizedFilename);
} catch (SvgLoadException $exception) {
    return $this->showError('invalid-svg', $normalizedFilename);
} catch (RequestException $exception) {
    return $this->showError('network-error', $normalizedFilename);
} catch (NestedTspanException $exception) {
    $id = $exception->getTspanId();
    $reason = $id
        ? $intuition->msg('nested-tspans-with-id', [ 'variables' => [ "<code>#$id</code>" ] ] )
        : $intuition->msg( 'nested-tspans-without-id' );
    return $this->showError('unsupported-svg', $normalizedFilename, [
        'msg_params' => [ $reason, '<em>' . $exception->getTextContent() . '</em>' ],
    ]);
}

https://github.com/wikimedia/svgtranslate/blob/master/src/Service/SvgFileFactory.php line 28 already @throws more than just a load error.

SvgFileFactory.php
/**
 * Creates an instance of SvgFile for the given path
 *
 * @param string $path
 * @return SvgFile
 * @throws \App\Exception\SvgLoadException
 */
public function create(string $path): SvgFile
{
    return new SvgFile($path, $this->logger);
}
  • I've created T316310 to start cleaning up some of the return type errors (and others).
  • It seems like non-standard prefixes are likely to be rare, so I've split that out to T316741 so it can be handled separately. I imagine most SVG files have either no namespace or use svg; is that correct do you think?

The file has apparently added a translation for zh-hans rather than canonizing and overwriting the existing zh-Hans translation.

I've been trying to reproduce this. I'm testing that this:

<svg><switch>
<text id="trsvg18-zh-hans" systemLanguage="zh-Hans"><tspan id="trsvg1-zh-hans">系统</tspan></text>
<text id="trsvg18"><tspan id="trsvg1">Systematic</tspan></text>
</switch></svg>

…after having this translation added: zh_Hans['trsvg1' => 'newtext']

…results in this:

<svg xmlns="http://www.w3.org/2000/svg"><switch>
<text id="trsvg18-zh-hans" systemLanguage="zh-hans"><tspan id="trsvg1-zh-hans">newtext</tspan></text>
<text id="trsvg18"><tspan id="trsvg1">Systematic</tspan></text>
</switch></svg>

And it does. No extra <text> element is added with an un-normalized systemLanguage.


I'll keep working through your notes above. Thanks for being so detailed! It's really useful.

Should I open a new issue for this problem?

Yes, please do. Let's keep this one for just the language code normalization.

BTW, can a mere mortal like me can see the log file?

Not for the production installation at the moment, no. It might help for svgtranslate to make the log available for a given translation attempt. Although, I guess if we actually handle each type of error better then perhaps that's not required (i.e. if we're logging something then we're already thinking the user needs to know). But yes, if you think this would be a good feature, please do create a task for it.

My suspicion is the uncanonized langtags appear when makeTranslationReady() discovers the SVG is too complex to handle, returns false before canonizing zh-Hans, the return value is ignored, and SVG Translate forges ahead believing all the langtags have been fixed.

I thought SVG Translate was handling some special cases correctly because it did not complain and produced good results. In reality, it was complaining (but I could not see the complaints in the log file), forging ahead, and other code just happened to do the right thing.

Everytime makeTranslationReady() returns false is when SVG Translate should have stopped processing. All of those situations are invisible to users.

That's a very good point you make about makeTranslationReady(). I've changed it to give errors in all those cases, and hopefully that'll help prevent a bunch of issues. Here's the PR: https://github.com/wikimedia/svgtranslate/pull/640

There will of course be other problems, but we can start adding checks for those (and maybe fixing other things, such as the possibly limited testing of CSS).

All PRs merged, and I've just released version 1.2.0 with all updates.

I think this addresses the primary issues of the above discussions. There are others of course, but let's handle them in separate tasks.

@Samwilson I see this ticket goes a while back. My apologies if I don't understand this ticket. Being new to this team and Foundation and I'm trying to get a better understanding of what I'm supposed to test on some of these tickets. Can you please give us a little more information on what to test? Like what steps did you take to get the error and what is the expected, post-patch? Thanks in advance!

@GMikesell-WMF sorry, the discussions here have ranged widely from the original description! I think we've captured the other issues in separate tasks though, so this task can be specifically about the language code formatting. I've updated the description here a bit. Mainly we want to check that a) files with existing (correct or incorrect) systemLanguage attributes are modified by SVG Translate to only have correct values for those attributes; and b) files without any language codes also end up with the correct ones after translating.

Testing notes

  1. Remember to test on https://svgtranslate-test.toolforge.org as that will upload files to beta commons rather than production
  2. I have uploaded some SVG files into beta commons which in theory should work on svgtranslate: https://commons.wikimedia.beta.wmflabs.org/wiki/Category:Translation_possible_-_SVG_(switch)
    1. In reality, some may return errors when you try to load them in svgtranslate
  3. SVGs are XML files. You can view the structure of the XML either in the browser (using the devtools > inspector) or in a text editor.
  4. Use svgtranslate to upload translations to the SVGs
    1. In particular, languages with language variants (i.e. those with hyphens like zh-hant, pt-br)
    2. Upload modifications to existing translations as well as creating new ones
      1. When you view a file on commons you may see the dropdown labelled Render this image in which will show all the languages in the SVG
    3. Try to find SVGs with invalid language codes for their systemLanguage attributes (e.g. zh_HANS)
      1. I have done so already for one image: https://commons.wikimedia.beta.wmflabs.org/wiki/File:20201231085805!Research_design_and_evidence_-_Capho.svg
  5. At a minimum, check that no invalid language codes have been saved to the SVG and all the language codes are valid (e.g. zh-hans not zh_HANS)
    1. You could also verify that the SVG is valid with this site https://validator.w3.org/
    2. You could compare before and after you upload a translation, paying particular attention to how it has modified any <switch> statements (you can see examples of switch statements above, e.g. T271000#8201384)
      1. When looking at a file on commons, it will have a file history (such as this) where you can download all the previous versions

I'm looking at tests/Model/Svg/SvgFileTest.php, and I see test cases such as the two starting at line 645

SvgFileTest.php
'CSS too complex' => [
    'svg' => '<svg><style>#foo { stroke:1px; } .bar { color:pink; }</style><text>Foo</text></svg>',
    'message' => 'structure-error-css-too-complex',
    'params' => [0 => ''],
],
'tref' => [
    'svg' => '<svg xmlns="http://www.w3.org/2000/svg" version="1.0" xmlns:xlink="http://www.w3.org/1999/xlink">
        <defs><text id="tref-id">Lorem</text></defs>
        <text id="text"><tref xlink:href="#tref-id" /></text></svg>',
    'message' => 'structure-error-contains-tref',
    'params' => [0 => 'text'],
],

The first case does not have an xmlns: namespace declaration, but the second does.

All test cases should have an xmlns: namespace declaration.

The SvgFile constructor should throw an exception if the resulting document element is not svg in the SVG namespace.

PS duplicating the first test case above but replacing the "#" with "." should tickle the SVG Translate bug that wants at least one character after the last CSS rule.

Testing notes

Thanks @dom_walden!


The first case does not have an xmlns: namespace declaration, but the second does.

The tool already adds the missing namespace if there isn't one.

All test cases should have an xmlns: namespace declaration.

I think as the no-xmlns fix has existed for ages, we should just leave it in, and in that case it's okay for test cases to not provide a namespace.

The SvgFile constructor should throw an exception if the resulting document element is not svg in the SVG namespace.

Good point. I've raised this: T319259: Check that document element is <svg> and in the right namespace

PS duplicating the first test case above but replacing the "#" with "." should tickle the SVG Translate bug that wants at least one character after the last CSS rule.

Is that bug part of T221382?

Testing notes

Thanks @dom_walden!


The first case does not have an xmlns: namespace declaration, but the second does.

The tool already adds the missing namespace if there isn't one.

I doubt that adding the namespace declaration changes the SVG file's elements. The document has already been parsed, so all the elements have had their namespace property set. SVG Translate is using a poor namespace model, so it will treat elements in the default namespace the same way as it would treat elements in the SVG namespace. SVG Translate uses getElementsByTagName() rather than getElementsByTagNameNS(). That poor model is also why the code separately checks for "svg:text". It should not be doing that. T316741: Allow svg namespace prefixes other than 'svg'

I would expect the XML serializer to either discard the xmlns attribute or generate a prefix for the default namespace.

input.svg
<svg>
  <rect width="100" height="100" />
</svg>

Might produce

serialized.svg
<foo:svg xmlns="http://www.w3.org/2000/svg" xmlns:foo="">
  <rect xmlns="" width="100" height="100" />
</foo:svg>

All test cases should have an xmlns: namespace declaration.

I think as the no-xmlns fix has existed for ages, we should just leave it in, and in that case it's okay for test cases to not provide a namespace.

As explained above, I do not think the no-xmlns fix does the right thing. I do not believe there is an XML DOM method that changes an element's namespace. One must create a new element in the proper namespace and then copy or move the attributes and children.

The SvgFile constructor should throw an exception if the resulting document element is not svg in the SVG namespace.

Good point. I've raised this: T319259: Check that document element is <svg> and in the right namespace

I suspect this check will properly trigger on all the null namespace tests.

PS duplicating the first test case above but replacing the "#" with "." should tickle the SVG Translate bug that wants at least one character after the last CSS rule.

Is that bug part of T221382?

No. T221382: [BUG] Some CSS selectors break translation input is about troublesome selectors. T271595: SVG translate tool replaces all fields with "$1" (style element needs at least one trailing character) is about requiring a trailing character.

@Samwilson Dom and I validated that when modifying the text, attributes are modified by SVG Translate, code compare, and verify that the SVG is valid, all have successfully passed. Was there anything else you wanted us to test? Thanks!

Language_Code_PASS.png (852×3 px, 270 KB)

Modified_Different_Language_PASS.png (1×1 px, 222 KB)

Test link: https://commons.wikimedia.beta.wmflabs.org/w/index.php?lang=es&title=File%3ALanguage_modified_zh-hant.svg
Environment: Beta
OS: Chrome

Testing notes

  1. Remember to test on https://svgtranslate-test.toolforge.org as that will upload files to beta commons rather than production
  2. I have uploaded some SVG files into beta commons which in theory should work on svgtranslate: https://commons.wikimedia.beta.wmflabs.org/wiki/Category:Translation_possible_-_SVG_(switch)
    1. In reality, some may return errors when you try to load them in svgtranslate
  3. SVGs are XML files. You can view the structure of the XML either in the browser (using the devtools > inspector) or in a text editor.
  4. Use svgtranslate to upload translations to the SVGs
    1. In particular, languages with language variants (i.e. those with hyphens like zh-hant, pt-br)
    2. Upload modifications to existing translations as well as creating new ones
      1. When you view a file on commons you may see the dropdown labelled Render this image in which will show all the languages in the SVG
    3. Try to find SVGs with invalid language codes for their systemLanguage attributes (e.g. zh_HANS)
      1. I have done so already for one image: https://commons.wikimedia.beta.wmflabs.org/wiki/File:20201231085805!Research_design_and_evidence_-_Capho.svg
  5. At a minimum, check that no invalid language codes have been saved to the SVG and all the language codes are valid (e.g. zh-hans not zh_HANS)
    1. You could also verify that the SVG is valid with this site https://validator.w3.org/
    2. You could compare before and after you upload a translation, paying particular attention to how it has modified any <switch> statements (you can see examples of switch statements above, e.g. T271000#8201384)
      1. When looking at a file on commons, it will have a file history (such as this) where you can download all the previous versions

Dom and I validated that when modifying the text, attributes are modified by SVG Translate, code compare, and verify that the SVG is valid, all have successfully passed. Was there anything else you wanted us to test? Thanks!

Terrific, thanks! No, I think that's probably all. Other stuff is tracked in other tasks now, and I think this can move onwards.

@Glrx, for those other matters, let's move discussion to the relevant tasks.