Page MenuHomePhabricator

[Spike 4 hours] Investigate the work involved in defaulting SVGs to show wiki language if available
Closed, ResolvedPublic

Description

Problem: Switch-translated SVGs can hold multiple translations in the same file. However when such files are used on other wikis, they need to have a lang=<lang-code> parameter for the image to render in the correct language.
Here's an example image: https://commons.wikimedia.org/wiki/File:Neural_crest.svg which contains translations for ru, hr and de as apparent from the Render this image in dropdown -


This dropdown was added in this patch in 2014.
Here's the image usage in ruwiki, hrwiki, dewiki. If you at the source, they contain lang=ru/lang=hr/lang=de after the file name (like [[File:abc.svg|lang=de]])
This is problematic because:

  • The file will not be updated on articles even if a translation is added for it on Commons. This is the case for a lot of files out there.
  • It is an unfair work burden on non-enwiki contributors.

Proposed solution: MediaWiki should be smart enough to automatically show the file in the wiki language, if that translation exists. If not, it should fallback to the default.

General direction for this investigation:

  • Find out what loads and creates the SVG links
  • Find out what it means to conditionally change the SVG links based on content language
  • Come up with an implementation plan, or, if there are multiple options, describe all relevant

See also:

Technical notes

  • MediaWiki stores the available translations for each SVG in the img_metadata field of the image table and this information is available via the getAvailableLanguages() function in MediaWiki's SvgHandler class.
  • The thumbnails displayed on each wiki are actually PNG thumbnails generated by librsvg. librsvg handles rendering the SVG into a PNG with the desired systemLanguage.
  • The solution of just generating thumbnails for every language was previously rejected as unnecessary cache fragmenting, as most languages won't actually have translations and thus would be identical to the English (default) thumbnail.

Event Timeline

Niharika triaged this task as Medium priority.Aug 18 2018, 12:30 AM
Niharika created this task.

@Mooeypoo @MusikAnimal @Samwilson @aezell I want us to estimate this ticket in the next Estimation meeting. It's a technical investigation ticket. I've put down some ideas for what the investigation should look for but please feel free to add things I've missed.

Mooeypoo added a comment.EditedAug 18 2018, 5:07 AM

I played around a bit (language'd stuff always pique my interest) and just a couple of quick comments that I've noticed:

The actual filename seem to have a preset with 'langxx-' and if the language isn't available, it's defaulting to the default svg language (which seems to be English most times).

Example:

So it seems ResourceLoader is already somewhat ready for this, and the main issue is how to make the page request this in advance per all and any SVG (which is the part that *isn't* simple)

One issue I noticed, is that it doesn't seem to use our language fallback list (which makes sense, but we need to take that into account.)
For example, "frr" (North Frisian) falls back to "de" (german) and yet the image here is in English ,even though German is available https://upload.wikimedia.org/wikipedia/commons/thumb/6/60/Neural_crest.svg/langfrr-675px-Neural_crest.svg.png

As an aside, for reference, we can refer to the fallback json file used for maps: https://github.com/kartotherian/babel/blob/master/lib/fallbacks.json

W can't really use this file directly (some languages were edited because of community request specifically for map data), but it can give us a sense of what fallbacks exist, and we can always recreate the original fallback file from the original script.

That said, if this was a dynamic file displaying in some extension, it would likely be fairly straight forward. In cases of SVGs though, we may have to look at ResourceLoader and how it loads the svg images (and transforms them to PNGs) and have a method to attach the language there.

So from my very quick observations, I'd say the investigation should try and establish also

  • Whether we need to touch ResourceLoader or if there's any other way to do this
  • What it would mean to use the fallback system, but I would heavily timebox this; it can range from easy to create-our-own, which we might want to consider whether we want the feature at all.
  • Make sure non languaged SVG files are not affected by whatever language-specific methods we use.

Anyways, these are only surface observations, to try and inform the investigation.

We should consider time-boxing this but it will probably not require a whole lot of time for us to see what direction we should go with, and then try to get a sense of how much work that would be.

kaldari updated the task description. (Show Details)Aug 18 2018, 5:18 AM

My understanding of the thumbnailing process, which is limited, is that it doesn't involve ResourceLoader. The parser and probably some other pieces translate the [[File:abc.svg|lang=de]] in the local WikiText into an HTML image tag for a specific PNG thumbnail, for example, https://upload.wikimedia.org/wikipedia/commons/thumb/6/60/abc.svg/langde-250px-abc.svg.png. If that thumbnail doesn't yet exist, the librsvg script on the scaling servers generates it with the requested language (if any was requested) at the desired resolution.

I could be wrong, but it doesn't seem like fixing this should be terribly hard. We basically just need to figure out exactly where the thumbnail URL is generated and have it check which translations are available using SvgHandler::getAvailableLanguages(). If the local wiki's language is available, construct the URL with the langxx- part. Otherwise, leave that part off and just use the default thumbnail (to avoid cache fragmentation). And as Moriel mentioned, we could also have it use a language fallback chain in the process.

My idea above is probably too simple. If it were actually that easy, I imagine it would have been implemented that way from the beginning. Maybe @Bawolff knows what dragons lurk here!

Mooeypoo updated the task description. (Show Details)Aug 21 2018, 10:23 PM
Niharika renamed this task from Investigate the work involved in defaulting SVGs to show wiki language if available to [Spike 4 hours] Investigate the work involved in defaulting SVGs to show wiki language if available.Aug 21 2018, 11:28 PM
Niharika moved this task from To be estimated/discussed to Estimated on the Community-Tech board.
Niharika removed Niharika as the assignee of this task.Aug 22 2018, 12:00 AM
MaxSem claimed this task.Sep 4 2018, 7:12 PM
MaxSem moved this task from Ready to In Development on the Community-Tech-Sprint board.
  • Create a way for Parser::makeImage to let the media handler to know page's content language, controlled by a feature flag.
    • Caveat: varying all SVGs for all languages would create a huge load on caches and scalers, SvgHandler should be aware of languages actually present in the image and not create image links varied by langauges needlessly.
      • Caveat: this requires loading image metadata, need to make sure performance will not suffer.
      • Question: do we want to force the language if it has been requested by user explicitly, e.g. [[File:Foo.svg|lang=bar]]?

@Mooeypoo: I don't understand what ResourceLoader has to do with this, please clarify.

Glrx added a comment.Sep 5 2018, 9:20 PM

There are several issues.

I don't know MW software, but overtime I've built the following impression.

The parser sees a [[File:...|lang=de|...]], so it builds the URLs Mooeypoo describes. The URL will be embedded in an img element to produce something like

<img alt="" src="//upload.wikimedia.org/wikipedia/commons/thumb/1/1d/First_Ionization_Energy.svg/langde-220px-First_Ionization_Energy.svg.png" width="220" height="92" class="thumbimage" srcset="/ /upload.wikimedia.org/wikipedia/commons/thumb/1/1d/First_Ionization_Energy.svg/langde-330px-First_Ionization_Energy.svg.png 1.5x, / /upload.wikimedia.org/wikipedia/commons/thumb/1/1d/First_Ionization_Energy.svg/langde-440px-First_Ionization_Energy.svg.png 2x" data-file-width="1200" data-file-height="500" />

The src URL has encoded the language information with "langde-".

//upload.wikimedia.org/wikipedia/commons/thumb/1/1d/First_Ionization_Energy.svg/langde-220px-First_Ionization_Energy.svg.png

When a user loads the page, her browser tries to load the image URL.

(That request goes to upload.wikimedia.org, but I believe that server will ignore Accept-Languages. I doubt WMF wants to do it, but a language dispatch could happen here. The North Frisian user will probably accept languages "frr, de". If a language-neutral URL (".../300px-...") is received from that user's browser, then the server could check if the SVG supports frr or de and serve ".../langfrr-300px-..." or ".../langde-300px-...".)

So when the HTML is loaded into a browser, that language-specific image is requested. If not cached already, the parameters are extracted from the upload URL by Thumbor, and librsvg is called to build the PNG for the specified language.

Mooeypoo's "Hebrew (which doesn't exist, so the image falls back automatically to English)" requires a little elaboration. I think MW builds a "translated" to "he" PNG of the SVG; it just didn't do any translating. Most SVGs will display English when they do not match the requested langtag; "he" is never matched, so English comes out. (There are some switch-translated SVGs whose switch default clauses are French.) This building pointless translations is the crux of the resource problem. Setting lang=xx will build a new PNG; it does not redirect to a default en version. Having the xx.WP always assume lang=xx will generate lots of identical copies and consume lots of resources (computation and space) doing it.

I do not know, but I think the wiki parser uses makeParamString() to build the "220px" portion of the URL.

I think SvgHandler.php overloads ImageHandler.php's makeParamString() to include the language parameter ("langde-220px"):

https://doc.wikimedia.org/mediawiki-core/master/php/SvgHandler_8php_source.html#l00532

With no lang= parameter, no langXX- is constructed and that portion of the URL starts with the pixel width. With a $params['lang'] parameter, that parameter gets prepended to the width.

(The constructed URL also includes hyphenated languages such as zh-hans and zh-hant. There will be URLs with "langzh-hans-220px".)

The routine currently filters out |lang=en. The "en" filtering is dubious. The semantics of the |lang parameter is a request that the SVG be localized to that language. It makes sense to ask for a file localized to "en" or any other language. The filtering is done with the belief that the default language version of an SVG file will always be "en", but that is not true. Many SVG files have no English. MW does not have a clear idea of what the "default" language looks like. It gets complicated further by librsvg being told the language is "en" or librsvg default the language to "en" from its environment variables.

makeParamString() canonizes the lang parameter as lowercase; that allows users to say |lang=DE without generating a duplicate cached PNG of lang=de.

I believe the wiki parse already reads the SVG file, so there is little additional computation cost to have the wiki parser choose an appropriate language.

I believe the parser knows the SVG metadata because the Wikipedia editor only specifies a width (220px) on the file transclusion, but the wiki parser builds the img element AND knows enough about the file to compute its corresponding height (92px). That suggests the parser has access to at least the SVG's width and height / viewBox information and has possibly read the SVG file with the SVGMetadataExtractor. If the parser has already read the metadata, then the parser has access to kaldari's SvgHandler::getAvailableLanguages() and can make language selections cheaply.

(getAvailableLanguages is buggy. It was not properly fixed at Phab:xxxxxx.)

To use Mooeypoo's North Frisian example, the frr.WP should want to localize its SVG images to frr. If the editor did not specify |lang=, then the frr.WP should consult the available languages to see if "frr" is available. If it is, then build the params as "langfrr-220px". If "frr" is not available, then look for fallback "de" and use it if available. Ultimately, frr.WP might look for fallback "en", but it may not find that language (some SVG diagrams are numbers only and have no explicit language). If it finds nothing, then the frr.WP could build the params as "220px" (thus mimicing the current default SVG PNG request).

(The current WP semantics are if lang=de is specified, then the user wants that localization even if she is on the ru.WP.)

(The semantics of defaults need to be worked out. If an SVG root element has xml:lang="de" or lang="de", then what should happen on the de.WP? Possibly "langde-" can be dropped because the default language would be the same. This optimization is the filter "en" above done without being English-centric. Think about the case where a German graphic artist made a German language diagram but included systemLanguage="en" clauses. In that case, the en.WP does not want the "default" rendering of the SVG but rather the en rendering. There are some semantic choices to be made here. Currently both MW and librsvg use "en" when a language is not specified. There are language neutral SVG files (e.g., using numbers as labels). How is a Wikipedia editor supposed to specify he wants the language neutral version (which may be the SVG default display)?)

Bawolff added a comment.EditedSep 5 2018, 9:27 PM

Mostly: @kaldari is correct in T202181#4511685 . Note that the RL svg handling is a separate subsystem (which is a bit confusing) that is not used for user svgs.

Historical note: I believe when this was originally implented, there was concern that it would confuse users to have images display differently on commons as other projects (consider vandalism potential), so it was opted not to autochange language as the easiest path forward. Prior to being able to detect what language the svg had available, we treated every language as valid (and possibly still do under the assumption that the code to detect lang sucks), so there was also concern about an explosion of thumbnails for every possible language (even unusued one), taking up too much disk space.

But largely speaking this was a political decision not a technical one. Having the svg default to content language, should be no big deal (provided that some code to make sure it only happens if the content language is one of the langs the svg is translated into).

@Mooeypoo: I don't understand what ResourceLoader has to do with this, please clarify.

Yeah I was referring to the creation of a png we serve when we ask for an SVG file. This process (as @Bawolff mentioned, and @kaldari I think also said) should be in ResourceLoader. I was wondering if we can get into that process and request content language then.

This would mean that we get consistent behavior no matter where we ask for an SVG on wikis, if the SVG has language switch.

However -- I'm not sure what is involved in that, so the investigation should probably look into how hard that is and whether that's even an option. The comments we all left seemed to mostly be theoretical assumption of what direction things could take. The next step (Which we hope to do with this spike) is to see what our actual implementation options are, and whether there are key things that should guide us to choose one method over another, if there are more than one option.

Does that answer the question, @MaxSem ?

@Mooeypoo @MaxSem Could you create follow-up tickets after this investigation is done? I'd like to get a move on this as soon as we can.

MaxSem closed this task as Resolved.Oct 3 2018, 9:28 PM
MaxSem moved this task from Needs Review/Feedback to Q2 2018-19 on the Community-Tech-Sprint board.