Page MenuHomePhabricator

Change translation variable (tvar) syntax
Open, HighPublic

Description

Per T261181#6831706 the current syntax is problematic for Parsoid.

Current syntax example:

Latest '''[[<tvar|technews>m:Special:MyLanguage/Tech/News</>|tech news]]''' from the Wikimedia technical community. Please tell other users about these changes. Not all changes will affect you. [[<tvar|more-transl>m:Special:MyLanguage/Tech/News/2021/07</>|Translations]] are available.

Subbu's proposed syntax example:

Latest '''[[<tvar id='technews'>m:Special:MyLanguage/Tech/News</tvar>|tech news]]''' from the Wikimedia technical community. Please tell other users about these changes. Not all changes will affect you. [[<tvar id='more-transl'>m:Special:MyLanguage/Tech/News/2021/07</tvar>|Translations]] are available.

One possible issue with this is that id attributes are global to the page, while tvars are local to the translation unit. Often numbered ids starting from 1 are used, causing the same id to be used multiple times on a translatable page. This syntax does not appear in the rendered output, but could be confusing to the editors.

Alternative would be to use other name like <tvar name='1'>...</tvar> or <tvar x='1'>...</tvar> to make it shorter, or explore whether <tvar 1>...</tvar> syntax would be possible. The quotes around the value would be optional, only needed if there are spaces (which I think should be disallowed in variable names if not already prevented).

Since Translate is (for now) string based, the regular expression would be amended to allow whitespace and quotes, but not other attributes (similar to <translate nowrap>.

For migration, old syntax can automatically be converted to the new syntax whenever a page is (re-)marked for translation.

Documentation pages that need updating

Current status

  • Support for new syntax <tvar name=1>value</tvar> is deployed and documented
  • Showing warnings for misuse of variables on Special:PageTranslation is waiting to be deployed
  • There is no automatic migration to the new syntax, nor easy way to find pages using the old syntax (Special:Search might work though)

Related Objects

Event Timeline

Does an arbitrary tag’s id attributes really need to be unique per page? If Translate handles this wikitext and doesn’t omit id attributes in the resulting HTML, what’s wrong with it? x is hard to understand, and name is long (although I think it’s acceptable if id can’t be used).

By the way, I like this change: it not only helps Translate to be compatible with Parsoid, but also fixes that MediaWiki-extensions-CodeMirror gives up proper highlighting on translation variables, the page is incredibly damaged if the translation syntax is incorrect and no replacement is made by Translate, and possibly quite some other bugs.

Thanks for the feedback. I'll try to clarify my concern: Because id attributes are expected to be globally unique in an HTML document, there is a risk that some editors would assume the same for tvar ids and avoid re-using ids across translation units (not a problem) and feeling discouraged of the extra effort doing so (a problem).

If the ids get exposed in id attributes in the parser output (e.g. for VisualEditor being able to change their name), that could be problematic. Likely some kind of mapping between id and metadata would be needed, and it might be clearer to just use a different name to avoid confusions. Hopefully someone more familiar with Parsoid can clarify whether this is a problem not.

There is also a distant possibility that there would be a linter in the future that warns about non-unique ids in the wikitext without understanding tvar elements. I'm not really worried about this.

To clarify, I am not attached to id. That was just a placeholder for something more appropriate. Pick something that makes sense in the translate extension context. I don't need to be involved in that bikeshedding. :)

Thanks for the announce on Meta and for working on this issue.

Tvar are added by translation admins who should learn a pretty complex syntax: they can learn the attribute name is name or the attribute id is not unique. Whatever this attribute name, they will have to learn this syntax.

The real way to help them is not choosing between id which is shorter or name which is not confusing, but would be the creation of a tool which would add with a keyboard shortcut, even with an automatically-chosen name.

The quotes around the value would be optional, only needed if there are spaces (which I think should be disallowed in variable names if not already prevented).

Authorized characters are currently obscure for me and lead to at least one bug: in TUX, tvar helping-button don’t ever match real tvar names.

About migration, I am not sure to understand all technical aspects, but having to re-mark all pages for translation seems me useless and/or insufficient.

  • useless because source page is the only one to contain <tvar>, there is no need to update translation template and translation pages.
  • insufficient if Parsoid becomes the only available parser because all pages which were not re-marked for translation will be broken: in that case, all pages should be fixed by bot, or, ideally, silently fixed directly in database, even for old versions.

add with a keyboard shortcut, even with an automatically-chosen name.

That should be possible with the editor(s). We can consider that thing separately. I think it's a good idea.

Authorized characters are currently obscure for me and lead to at least one bug: in TUX, tvar helping-button don’t ever match real tvar names.

We call those insertables. Currently translatable pages only support numbers in the name, but this is easily changeable. We can do this separately as well.

I'm open to suggestions what should be the limitations with var names? Something like a-zA-Z0-9_- would be simple, but would prevent names in other writing systems.

having to re-mark all pages for translation seems me useless and/or insufficient.

It's trivial to write a bot to convert the syntax. Having the syntax automatically converted would just be a handy way to speed things up without causing a huge backlog of pages waiting to be marked for a translation. Like I allured above, it is not safe to mark a page for translation if it contains other unmarked changes. This will need manual involvement of translation admins.

Thanks for the feedback. I'll try to clarify my concern: Because id attributes are expected to be globally unique in an HTML document, there is a risk that some editors would assume the same for tvar ids and avoid re-using ids across translation units (not a problem) and feeling discouraged of the extra effort doing so (a problem).

If the ids get exposed in id attributes in the parser output (e.g. for VisualEditor being able to change their name), that could be problematic. Likely some kind of mapping between id and metadata would be needed, and it might be clearer to just use a different name to avoid confusions. Hopefully someone more familiar with Parsoid can clarify whether this is a problem not.

There is also a distant possibility that there would be a linter in the future that warns about non-unique ids in the wikitext without understanding tvar elements. I'm not really worried about this.

Thanks for the explanation! As I mentioned, even though IMO the shorter, the better, I can live with the longer version if the shorter is likely to cause trouble.

Change 674323 had a related patch set uploaded (by Nikerabbit; owner: Nikerabbit):
[mediawiki/extensions/Translate@master] Support new HTML-y translation variable syntax

https://gerrit.wikimedia.org/r/674323

The patch implements the following syntax: <tvar name="key">value</tvar>. Quotes are optional. Whitespace is allowed in limited ways. key can contain any Unicode letter or number, -, _ and $ (per T275106: Expand the allowed characters for variable insertable for translatable pages.).

Examples of valid syntax:

<tvar name="key">value</tvar>
<tvar name=key>value</tvar>
<tvar name='key'>value</tvar>
<tvar name    =   'key'   >value</tvar>
<tvar name=1 >value</tvar>
<tvar name=key$>value</tvar>text
<tvar name=стоянка>value</tvar>

Examples of unsupported syntax:

<tvar name=key>value</>
<tvar name = key >value</tvar>
<tvar name="#2" >value</tvar>

For now, and probably for forever, old syntax stays supported. For now, there is no automatic conversion to new syntax nor any tooling to help find and replace old syntax.

Change 674323 merged by jenkins-bot:

[mediawiki/extensions/Translate@master] Support new HTML-y translation variable syntax

https://gerrit.wikimedia.org/r/674323

Should now do the documentation update.

Current status

Patch that implement support for <tvar name=1>value</tvar> syntax is in review. The new syntax is more strict about allowed chars in keys.

The patch has already been merged more than a week ago. Also, as far as I see the latest version, it’s not stricter about allowed characters since patchset 3. (I’ve already updated m:Wikipedia with a tvar name $wp.org, which doesn’t meet the restrictions set in T274881#6937992, but works with the current code. This means that if the syntax becomes stricter in the future, that’d be a breaking change.)

Yes, I changed approach, so that the validation is not performed as part of the parsing. For insertables, I plan to use the new allowed list of chars, which is more relaxed actually. It doesn't include a period $wp.org would (still) be misparsed as an insertable to be just $wp. So, even though it technically works for now, I don't recommend it because it won't work with insertables.

I wouldn’t use it, either, if I named the translation variable, but the name was already there, so I decided to keep it instead of breaking and manually fixing dozens of translations.

Change 685772 had a related patch set uploaded (by Nikerabbit; author: Nikerabbit):

[mediawiki/extensions/Translate@master] Add a warning about non-insertable translation variable names

https://gerrit.wikimedia.org/r/685772

Change 685773 had a related patch set uploaded (by Nikerabbit; author: Nikerabbit):

[mediawiki/extensions/Translate@master] Highlight duplicate translation variable names

https://gerrit.wikimedia.org/r/685773

Small note about the last patch: it is sometimes useful to reuse the same tvar name when the content is identical (I think it is more understandable for translators).

For example

<translate>The parameter’s value is {{#if:<tvar name="1">{{{param|}}}</tvar>|<tvar name="1">{{{param|}}}</tvar>|not specified}}.</translate>

This way translators are certain that this is the same text twice. The important thing is that they should contain byte-for-byte the same value, e.g. dropping the pipe in the second occurrence would potentially break it (I don’t remember whether the first or the last occurrence is used by Translate, but it doesn’t matter either—if you have different content, it’s wrong.)

Also, in the above example wrapping the whole unit is okay, but wrapping specifically the translation variable would break the whole {{#if:, as {{#if:<span class="mw-translate-fuzzy"></span>|<span class="mw-translate-fuzzy"></span>|undefined}} returns the empty <span>, not not specified. This message is for the translation admins anyways, not for translators/readers, so if anywhere, it should appear on the Special:PageTranslation interface, not on the translated pages. As we said, the current implementation results in quite some false positives, but it would make sense if you would test whether eventual duplicates are byte-for-byte equal, and warn about them on Special:PageTranslation.

This message is for the translation admins anyways, not for translators/readers, so if anywhere, it should appear on the Special:PageTranslation interface, not on the translated pages.

That's a good point. I will explore that approach.

Do you have examples?

No in vivo examples right now, but I think to usernames (“$user1 and $user2 have drafted this text. $user1 wrote the other text too.”) or variables/parameter values on technical documentation (“You may use $val1 or $val2 as values. $val1 will display the raw value whereas $val2 will render formatted value.”)
Not a major issue though: I surely prefer to have this nice warning feature!

I don’t think it to be a critical feature (it’s a nice-to-have, though, of course), and given the problems I listed in T274881#7069958, I decided to give the patch a −1 on Gerrit.

By the way, thinking it over, I think the two patches should provide the same UX—they’re both issues with the translation variables that should be fixed by editing the source page. There can be warnings on page preview, there can be warnings when marking the page for translation, there can be even both, but the same for the two warning categories.

Putting these warnings on Special:PageTranslation resulted in much cleaner code. This is how it looks:

There is a minor possible confusion that in the messages variable names are prefixed with $ while in the source they are not. Seems acceptable to me.

Change 685772 abandoned by Nikerabbit:

[mediawiki/extensions/Translate@master] Add a warning about non-insertable translation variable names

Reason:

going with a different approach

https://gerrit.wikimedia.org/r/685772

Change 685773 abandoned by Nikerabbit:

[mediawiki/extensions/Translate@master] Highlight duplicate translation variable names

Reason:

going with a different approach

https://gerrit.wikimedia.org/r/685773

Change 689011 had a related patch set uploaded (by Nikerabbit; author: Nikerabbit):

[mediawiki/extensions/Translate@master] Add validation for translation variables

https://gerrit.wikimedia.org/r/689011

Using $ should be totally OK—this is how translators see it, which should be familiar for translation administrators as well. Actually, I’d write it this way manually, when it doesn’t matter which one is easier to produce. I found some other issues though that I noted on Gerrit. However, overall it’s nice work, thanks!

Change 689011 merged by jenkins-bot:

[mediawiki/extensions/Translate@master] Add validation for translation variables

https://gerrit.wikimedia.org/r/689011