Page MenuHomePhabricator

parser tags such as <ref>, <poem>, <timeline> etc. cannot be localized
Open, MediumPublic

Description

Before the Berlin 2011 hackathon i published in several Hebrew Wikipedia-related forums a call for the most annoying RTL issues. This was the most frequent complaint:

Parser tags such as <ref>, <poem>, <timeline> etc. cannot be localized. This is not a terrible issue for left-to-right languages, but it is a serious one for RTL languages. For example, <ref> is very often used with URLs, and these get jumbled up. It's hard to write them in the first place, and it's even harder to correct them after they are written. Replacing <ref> with something like <הערה> would make adding references to RTL wikis a lot easier. The same is true for the other tags of this kind.

I talked to Victor Vasiliev and Tim Starling about it in the Berlin Hackathon 2011 and they said that it's generally doable.


Version: unspecified
Severity: normal

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I'd like to refocus this issue only on <ref> and little else.

In actuality, it's not impossible to have these tags localized. It's already done in LabeledSectionTransclusion. However, I don't actually think that localizing all tags to all languages is important.

In the coming age of VisualEditor localization of tags is supposed to become entirely irrelevant, because ideally they should be used only internally and not typed by editors.

Until that age comes, however, people will do a lot of manual adding, removing and editing of tags in wiki syntax mode. For tags like <poem> and <timeline> it's not actually disastrous and nobody really complains about them (the content of <timeline> is a pain in RTL, but that's an entirely different issue).

For <ref> however, it's a nightmare in RTL languages. What I am imagining at this point is a way to get <ref> localized to RTL languages (and probably not even all of them) using a mechanism that is works with wiki syntax and with VisualEditor and Parsoid, and to be ready to get rid of it in the far future when direct wiki syntax editing becomes unimportant.

I committed an experimental patch for this:
https://gerrit.wikimedia.org/r/#/c/163467/

@Amire80 I see you had abandoned this back in April. Any chance "some other time" could be December 2016?! ;)

Yes, I want to revive it very soon, at least for <ref>.

This may also need some work to work with Parsoid.

This may also need some work to work with Parsoid.

Yes, it won't be a lot of work in Parsoid however. Some minimal code updates. The bigger thing to establish is how extensions specify their localized tags and how this information will be exposed in the API. Given that translations are likely going to be a one-time change (given code breakage risk), I don't think these should be part of system messages. These should be part of the extension config. While that requires wikis to go through devs for establishing translations, I don't think it is an unreasonable burden given the one-time nature of these changes.

Re: API, I propose we just extend the existing *.i18n.alias.php files, which are used for hard-coding special page aliases. I assume that a new variable added there would not automatically get translated by TranslateWiki (pinging @Nikerabbit) so all changes would go through code review, which we want.

You don't even need a lot of translations. This is a real problem only for RTL alphabets, where mixing Latin XML-like tags with RTL text is awful. A small number of translations to RTL languages is all that is needed. Translating <ref> to Hindi, Chinese, and Russian can be possible, but I would discourage actually doing it unless there is a good reason to do it.

i18n.alias and i18n.magic files are currently not up for translation in translatewiki.net. I would however encourage to create a new file for keeping it simple and obvious which extensions have tags to translate.

Tangential: In the long term my wish is to migrate also these files to JSON format and make them available for translation in translatewiki.net. But because the values are not plain strings, the translatewiki.net side is not currently possible.

Tangential: In the long term my wish is to migrate also these files to JSON format and make them available for translation in translatewiki.net.

Changing a tag translation or special page name would be a breaking change with respect to existing wiki content, so we shouldn't allow TW users to do it.

On multilingual wikis (commons, meta, wikimania, etc) you should probably be aware that multiple tag aliases may be in use. Could be exciting! Probably select which one to use based on the page language... which works for everything except the <translate> extension, which can have chunks of text in different languages on the same page...

I think actually using templates to work around the monolingual-ness of <ref> (as hewiki does) is not a terrible idea. The HTML-ish tags (like <b>, <br> are always going to be English -- why not just establish a convention that you can use {{<b}} and {{>b}} to localize these, with whatever name you like for "b". Then you can just write multiple templates on multilingual wikis if necessary. (Although that pushes the responsibility onto VE to know the proper language for the article it is editing, and use the proper templates.)

Note that LanguageConverter will also raise issues, since like <translate> it mixes content in multiple languages together on the same page. zhwiki can be expected to fight over whether <ref> gets localized in simplified or traditional characters, etc. A single localization isn't really going to work.

why not just establish a convention that you can use {{<b}} and {{>b}} to localize these, with whatever name you like for "b". Then you can just write multiple templates on multilingual wikis if necessary. (Although that pushes the responsibility onto VE to know the proper language for the article it is editing, and use the proper templates.)

To clarify, I'm using one of many bikeshedable syntaxes for heredoc templates here (T114432: [RFC] Heredoc arguments for templates (aka "hygienic" or "long" arguments)), where Template:b (or :es:Template:negrita or whatever ) is just <b>{{{1}}}</b>.

Note that LanguageConverter will also raise issues, since like <translate> it mixed content in multiple languages together on the same page.

The only mixing of two languages at at time is filling missing translations with content in the source language (and those pages are not editable anyway). It's against the best practice to use non-canonical names for tags/special pages/etc. in text that is going to be translated or in its translations exactly because it is not necessarily known whether any non-canonical names will be available where those translations are displayed.

Changing a tag translation or special page name would be a breaking change with respect to existing wiki content, so we shouldn't allow TW users to do it.

How easy it is to break BC is a different question from how easy it is to provide translations. Naturally, we would take appropriate measures to avoid breaking BC, as we have done so far.

This is long standing request become a more serious issue as VE doesn't support ref templates (which use {{#tag:ref|ref content}}). As handling it is quite challenging in VE (especially if the template contains more than {{#tag}} and possibly more features), it looks like having a native localized <ref> could be the easiest solution for RTL/LTR mix caused by <ref>.

During Wikimania 2017 I talked with @Esanders and @Mooeypoo and they suggested to make sure first there is a community interest in using localized ref before engineering invest time on supporting it.
I opened it to discussion in hewiki WP:VP and there seems to be a great interest of the community to have a localized version of ref tag:
https://he.wikipedia.org/w/index.php?title=%D7%95%D7%99%D7%A7%D7%99%D7%A4%D7%93%D7%99%D7%94:%D7%9E%D7%96%D7%A0%D7%95%D7%9F&oldid=21366986#.D7.AA.D7.92.D7.99.D7.AA_.3C.D7.94.D7.A2.D7.A8.D7.94.3E

I can assure you, this and T15673 are among the most wanted features in Persian wikis as well.

It is a really bad idea to translate tag functions. Actually I believe it is a really bad idea to translate all such markup and programming constructs without a working translator for those constructs.

I have programmed in localized programming languages for several years, outside Mediawiki, and trying to reuse code really sucks big time.

If some community (like the Hebrew community) want to shoot themselves in the foot with a artillery cannon by reimplementing tag functions as templates, then let them do so, but do not tempt other communities to do the same. It is better to make clean, reusable code (and wikitext), that can be moved between projects.

It is a really bad idea to translate tag functions.
If some community (like the Hebrew community) want to shoot themselves in the foot with a artillery cannon by reimplementing tag functions as templates, then let them do so, but do not tempt other communities to do the same. It is better to make clean, reusable code (and wikitext), that can be moved between projects.

If you think that tag translating is a bad idea, and reimplementing as templates is a bad idea, and it is impossible to use the regular tags in rtl texts inline, how do you suggest to manage references usage at all?

@IKhitron

If you that think tag translating is a bad idea, and reimplementing as templates is a bad idea, and it is impossible to use the regular tags in rtl texts inline, how do you suggest to manage references usage at all?

So you think that

<שפת סימני עריכה לתמליל - על>
<רֹאשׁ>
<כותרת>בָּר</כותרת>
</רֹאשׁ>
<גוּף>
<עמ '>מזון בָּר</עמ '>
</גוּף>
</שפת סימני עריכה לתמליל - על>

is rather better than

<html>
<head>
<title>בָּר</title>
</head>
<body>
<p>בר מזון</p>
</body>
</html>

isn't you?!

This comment was removed by IKhitron.

@IKhitron
So you think that
`
<שפת סימני עריכה לתמליל - על>
<רֹאשׁ>

</רֹאשׁ>
</שפת סימני עריכה לתמליל - על>
`
is rather better than
`
<html>
a
</html>
`

Not at all. I think that

<הערה שם=אבג>טקסט1 טקסט2 טקסט3</הערה>

is much better than

<ref name=אבג>טקסט1 טקסט2 טקסט3</ref>

[Thinking loud] Ideally core wiki syntax should be BIDI neutral: for example links ([[]]), and templates ({{}}) are bidi neutral syntax.
Maybe references are so widely used, that we can think on how we should make them also with bidi neutral code, and maybe also include them as part of the core (they are already leaking to parsoid with some dedicated code for it). But this is probably out of scope of this task.

It is a really bad idea to translate tag functions. Actually I believe it is a really bad idea to translate all such markup and programming constructs without a working translator for those constructs.

I'm not certain I agree; I think if we have reasonable standards and guidelines, translation could be fine.

IMO this is related to the Shadow namespaces/global modules stuff. The way I see it is (to use <ref> as an example):

  1. There is a global template called meta:Template:Ref which expands to <ref>{{{1}}}</ref> (or some similar straightforward thing).
  2. Local wikis can make their own templates which inherit/invoke the global template. For instance hewiki:Template:הערה expands to {{Ref|{{{1}}}}} (or some such)

Then we have a principled way of "translating" tag-based constructs, and analysis tools (editors, bots, etc) can automatically determine that {{הערה}} is equivalent to <ref> by just following the template expansion.

Further, this generalizes straight-forwardly to allow translating template arguments. So long as the "translation" template has a 1-to-1 mapping w/ all arguments to some template in the global namespace, we can automatically recognize it as a translation. Further, this completely decouples the mapping from the core code or parser. You can change the argument or tag translations or even possibly add aliases without writing any code; the semantics follows from the inheritance from the global functions.

Thank you, @cscott. One important extra point to your post: the tags are recognized as references in Visual Editor, the templates aren't.

Thank you, @cscott. One important extra point to your post: the tags are recognized as references in Visual Editor, the templates aren't.

This, in 2017, is indeed the most central point.

The problem is:

  • ref tags are very common
  • editing ref tags in source code in RTL languages is very hard
  • replacing ref tags with templates solves the above point for people who edit in source code, but complicates things in VE: templates are identified as templates and not as refs.

Whatever solves this problem for both wiki syntax and VE editing is good.

@IKhitron Right. But VE has a bunch of stuff awkwardly hard-coded at the moment; there's already a discussion about how to generalize this properly. One of the options is to teach VE about the global templates and inheritance stuff described above, so that VE wouldn't have to have the tags hard-coded. There are some other options, including adding a special semantic marker of some kind to the template that VE could look for. No one is satisfied with the way VE hard-codes things right now.

(A year ago I'd be excited about global templates. These days I still think that having them would be a step forward from the current madness. However, my thinking on this evolved: I'd be happier to convert a lot of templates, especially citations and infoboxes, to real features that can be installed and localized like extensions. In this light, converting tags or magic words to templates looks like a regression. I realize that this is quite far fetched thinking, of course.)

@Amire80 we have a bunch of different syntactic constructs, including tags, magic words, parser functions, and templates. The primary argument for solving this via templates is (in my opinion) consistency: we can solve the problem once and be done with it. I've got some other syntax improvements for templates which should make this even nicer.

I published an earlier discussion about semantic tagging in general at T176242: [EPIC] Representing / extracting wiki-specific application-level semantics. That underlies the "but VE only knows X" issue @IKhitron brought up.

It is a really bad idea to translate tag functions. Actually I believe it is a really bad idea to translate all such markup and programming constructs without a working translator for those constructs.

I have programmed in localized programming languages for several years, outside Mediawiki, and trying to reuse code really sucks big time.

If some community (like the Hebrew community) want to shoot themselves in the foot with a artillery cannon by reimplementing tag functions as templates, then let them do so, but do not tempt other communities to do the same. It is better to make clean, reusable code (and wikitext), that can be moved between projects.

In essence, it was already done. In the Hebrew Wikipedia, editors who edit in the wiki syntax, use a template instead of a <ref> tag, and when VE inserts the explicit <ref> tag, it is auto-replaced by a template with a bot. We know that it's suboptimal (for reasons explained in other comments here), but in practice it's less torture for wiki syntax editors, who will remain the majority for the foreseeable future, than using explicit <ref> tags mixed with right-to-left text. That's precisely why we're trying to think of a better solution.

In T274521#6921841 we see that apparently LST is localized

In T274521#6921841 we see that apparently LST is localized

Indeed, but when I tried to suggest a patch to localize <ref>, it was rejected in review.

Change 674705 had a related patch set uploaded (by Arlolra; author: Arlolra):
[mediawiki/services/parsoid@master] Be more permissive for extension tag names

https://gerrit.wikimedia.org/r/674705

Indeed, but when I tried to suggest a patch to localize <ref>, it was rejected in review.

Do you have the gerrit number for that?

Change 674705 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Be more permissive for extension tag names

https://gerrit.wikimedia.org/r/674705

Indeed, but when I tried to suggest a patch to localize <ref>, it was rejected in review.

Do you have the gerrit number for that?

https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Cite/+/163467

I remember that it definitely worked, as far as the installation of core MediaWIki and Cite on my laptop was concerned, but I don't remember testing it with full-circle Parsoid and VE, and it's possible that it wouldn't work with them. Parsoid changed a lot since then, in any case. I abandoned it after lack of activity, a -1 from C. Scott, and some oral conversations at Wikimania with VE and Parsoid people, in which they were all reluctant about it for various reasons.

The problem in the task description is still very much present and in need of resolution.

The way this is implemented is that the local names are hardcoded in the source,
https://github.com/wikimedia/mediawiki-extensions-LabeledSectionTransclusion/blob/master/includes/LabeledSectionTransclusion.php#L12-L30

		'he' => [
			'section' => 'קטע',
			'begin' => 'התחלה',
			'end' => 'סוף',
		],

and then a second, local tag is registered if a name is resolved based on $wgLanguageCode,

		// Register the localized version of <section> as a noop as well
		$localName = self::getLocalName( 'section' );
		if ( $localName !== null ) {
			$parser->setHook( $localName, [ __CLASS__, 'noop' ] );
		}

I abandoned it after lack of activity, a -1 from C. Scott, and some oral conversations at Wikimania with VE and Parsoid people, in which they were all reluctant about it for various reasons.

It would have been nice if the reasons were documented here.

The problem in the task description is still very much present and in need of resolution.

LST has been doing this for a long time (I mean, I guess as long as this task has been open),
https://github.com/wikimedia/mediawiki-extensions-LabeledSectionTransclusion/commit/752e9b4700c93f75461f3c62ae7471f1caca2b2a

I'm not saying that's the right implementation though but maybe the problem here is more social than technical. Perhaps you'll need to follow some sort of RFC process to make it happen?

FYI: Since I just worked on MassMessage, I know that MassMessage only supports unlocalised LST tags. If there was a standard way to do localised tag names, it would likely make possible to support those in MassMessage.

Summarizing a bunch of comments here, since @Amire80 asked for a summary on Slack:

Three main issues:

  1. Changing the translation breaks all wikipages so the usual translatewiki approach for l10n isn't great. Also the localized names need to be exported in a uniform manner in extension.json, via siteinfo API, etc, so that parsoid and others 3rd party clients can determine what tags are valid in wikitext. (This was the reason for the C-1 of https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Cite/+/163467 at a time when Parsoid was external and needed siteinfo for tag names.)
  2. Harms reuse of templates/wikitext. We have some ad-hoc solutions in ContentTranslation and HTML cut-and-paste in VE, but these need to be tested and possibly generalized. (Ie, does VE selective serialization work properly if the localized name is changing; this is an issue we've had trouble with previously for namespace localization, eg.) There are a lot of similarities here with Global Templates, so there's a strong temptation to hold off until we have a uniform treatment.
  3. As a social issue: what's the best practice for multilingual wikis/multilingual content. This includes sites like commons -- can we ensure that image captions in language X will always recognize tag names localized in language X? Or do we say that only english tag names are valid on multilingual projects? What about wikis which use the <translate> extension: will localized tag names work when you change the page language, ie on the article titled [[Foo/he]]? This requires a bit of adjustment in how tag extensions are registered, both in core and in Parsoid. What about language converter content? Do we/can we support both tags named with both simplified *and* traditional characters? Both cyrillic and latin tag names on Serbian wiki? Etc. (IMO a good idea, but one that complicates a one-to-one mapping of language to tag name.)

These aren't insoluble issues by any means, but it probably would benefit from going through a formal RFC process to build consensus to the proposed solutions for 1/2/3 before diving into code.

And I'll note there is a *huge* amount of overlap here with Global Templates; as I've elaborated in T204283 and elsewhere, there are some good reasons for extension tag syntax to be viewed as just a convenient variant for writing {{#tag|ExtTagName|attr1=value|<<< .... >>>}} and so if we can/do come up with a standardized way of remapping arguments in {{#tag|ref|...}}, it will also solve the <ref> issue. Further, that uniform approach will also (hopefully) allow our tools to uniformly apply those mappings when doing Content Translation, cut-and-paste of HTML into VE, etc. If we are going to use WikiData, TemplateData, Abstract Wikipedia, <something else> for the Global Templates issue, then it would make a lot of sense to use it also to solve issues #1 and #2 above.

Which isn't to say we should boil the oceans before we can make tea. But if we have a basic idea of the approach we want to use for Global Templates, then we can use <ref> say as a small-scale initial test of the Big Idea, implementing localization/etc in a way consistent with the Big Idea. That seems preferable to doing something ad-hoc for <ref> and then having to unwind and undo that later.

Change 675310 had a related patch set uploaded (by Subramanya Sastry; author: Subramanya Sastry):
[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.13.0-a30

https://gerrit.wikimedia.org/r/675310

Change 675738 had a related patch set uploaded (by C. Scott Ananian; author: Subramanya Sastry):
[mediawiki/vendor@wmf/1.36.0-wmf.37] Bump wikimedia/parsoid to 0.13.0-a30

https://gerrit.wikimedia.org/r/675738

Change 675310 merged by jenkins-bot:
[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.13.0-a30

https://gerrit.wikimedia.org/r/675310

Change 675738 merged by jenkins-bot:
[mediawiki/vendor@wmf/1.36.0-wmf.37] Bump wikimedia/parsoid to 0.13.0-a30

https://gerrit.wikimedia.org/r/675738

Change 675738 merged by jenkins-bot:
[mediawiki/vendor@wmf/1.36.0-wmf.37] Bump wikimedia/parsoid to 0.13.0-a30

https://gerrit.wikimedia.org/r/675738

I'm intrigued by this, @cscott and @ssastry. How is it related to this task?

I'm intrigued by this, @cscott and @ssastry. How is it related to this task?

It has the commit from T30980#6943898

That it, it removes a syntax restriction on the content of tag names which parsoid had (but core did not). Necessary before localized tag names could be used, but doesn't directly address the core issues raised in T30980#6949196

Summarizing a bunch of comments here, since @Amire80 asked for a summary on Slack:

Three main issues:

  1. Changing the translation breaks all wikipages so the usual translatewiki approach for l10n isn't great.

That's fine. It shouldn't be as easy to change it as it is to fix a translation of a usual message. It should be more like special page aliases, or magic words like {{#invoke}} (source code. Fun fact: this file has quite a lot of translations defined, but most of them don't seem to be used, although the one in Arabic —an RTL language— is used quite a lot. Also, I somehow never bothered to define one for Hebrew, and I don't remember anyone ever complaining about it, but maybe that's because it's usually not used in articles, but mostly in the Template namespace.)

Also the localized names need to be exported in a uniform manner in extension.json, via siteinfo API, etc, so that parsoid and others 3rd party clients can determine what tags are valid in wikitext. (This was the reason for the C-1 of https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Cite/+/163467 at a time when Parsoid was external and needed siteinfo for tag names.)

The Scribunto example above uses a PHP file. I'd be fine with a JSON file, too. I don't care about the format very much.

  1. Harms reuse of templates/wikitext. We have some ad-hoc solutions in ContentTranslation and HTML cut-and-paste in VE, but these need to be tested and possibly generalized. (Ie, does VE selective serialization work properly if the localized name is changing; this is an issue we've had trouble with previously for namespace localization, eg.) There are a lot of similarities here with Global Templates, so there's a strong temptation to hold off until we have a uniform treatment.

We have the same problem with parser functions like {{#ifexist}}. Is it really a problem, however?

The important practical difference between <ref> and {{#ifexist}} is that <ref> is often used in the articles themselves, so practically all the editors have to touch it, while {{#ifexist}} is rarely used directly in articles, and is mostly used in the code of templates, which are edited only by a few advanced editors. So if a template is designed for reuse in different languages, it's OK to write it with the generic English name.

When Global Templates come along, the code of the templates in the repository will probably use the generic English names. I've just added a sentence about it to https://www.mediawiki.org/wiki/Global_templates/Proposed_specification. I can also imagine the possibility of defining a different language for the template page using Special:PageLanguage, and then allowing the use of localized magic words, but it's probably unnecessarily complicated.

  1. As a social issue: what's the best practice for multilingual wikis/multilingual content. This includes sites like commons -- can we ensure that image captions in language X will always recognize tag names localized in language X? Or do we say that only english tag names are valid on multilingual projects? What about wikis which use the <translate> extension: will localized tag names work when you change the page language, ie on the article titled [[Foo/he]]?

Hmm, I haven't thought about that. I'd actually expect this to support localized parser functions already, but apparently it doesn't work at the moment. I changed the page language of https://www.mediawiki.org/wiki/User:Amire80/test_localized_ifexist to Spanish and tried to use a localized Spanish parser function there, and it didn't work.

This requires a bit of adjustment in how tag extensions are registered, both in core and in Parsoid.

Yeah, this justifies some thought.

What about language converter content? Do we/can we support both tags named with both simplified *and* traditional characters? Both cyrillic and latin tag names on Serbian wiki? Etc. (IMO a good idea, but one that complicates a one-to-one mapping of language to tag name.)

Maybe a lazy solution is to just add tag names in all variants? :)