Page MenuHomePhabricator

Translation unit markers interferes with comment markup in Parsoid rendering
Open, MediumPublicBUG REPORT

Description

The wikitext

<!-- <translate>hello</translate> -->

triggers the creation of a TU marker:

<!-- <translate><!--T:1--> plop</translate> -->

This is not handled well by Parsoid, which doesn't recognize the translation unit, just closes the comment, complains it can't find the beginning of the range (this seems to be a common cause of T302886), and displays the rest of the comment in the page.

Things work out correctly in the Legacy parser because all the markup (including TU markers) is stripped from the page before it's processed, so this doesn't trigger in the current read-views, but it does have a -possibly- significant impact on Parsoid rendering - I think it might stop all "range extension processing" moving forward, which is less than ideal - and it displays comments on the page.

I'm not sure it should be considered as valid Wikitext, but this is the behaviour of the Translate extension (and not only "people having commented out things after having creating the TU markers"), and I *can* see the use case for defining TUs for comments in the page.

I see several options:

  • change the Wikitext markup for Translation Units (there's not that many wikis using Translate, it might be feasible to automate a migration?)
  • handle these "comments in comments" in Parsoid and parse them as translation unit markers (this adds Translate-specific code to Parsoid)
  • deprecate that usage, possibly lint for it, and leave it as that.

Event Timeline

ihurbain renamed this task from Translation unit markers interact with comments markup in Parsoid rendering to Translation unit markers interferes with comments markup in Parsoid rendering.Mar 11 2022, 5:00 PM
ihurbain renamed this task from Translation unit markers interferes with comments markup in Parsoid rendering to Translation unit markers interferes with comment markup in Parsoid rendering.

To be clear, I assume that Translate is stripping the <!--T:1--> TUs *before* the legacy parser is "stripping comments", so that the end result is that the legacy parser sees only the valid <!--<translate>plop</translate>--> comment?

Because <!-- <!-- --> --> isn't a valid comment in either HTML or wikitext; all the "rules" say the comment should end at the first -->.

I'd suggest a combination of your options 1 and 3 would be best:

  1. Start by linting for it, that's easy enough.
  2. It may be worth adding alternative syntax for TUs, like <meta tu="1"> or {{#tu:1}} or something.

It does seem like folks might want to be able to comment out chunks of a page without looking for all the TUs and changing --> to - -> or something like that. Unfortunately there's no nested comment syntax in wikitext (or HTML) so this is sort of par for the course: if there were non-TU comments in the section you'd have to do the same thing to prevent your commented out region from ending early. Eventually I'd expect that heredoc syntax would help, because heredoc delimiters *do* nest (sort of!):

{{0x|uniqueID<<<

can contain anything, {{ --> <!-- [[ >>> whatever

>>>uniqueID}}

will safely ignore anything inside, so long as it doesn't contain the string >>>uniqueID (which presumably you've chosen your uniqueID to avoid). So that would give an alternate syntax for "commenting out chunks" that doesn't break if there are HTML comments (or TUs) embedded inside.

While we wait for that (T114432), having an alternative syntax around --- even if you don't make it the default or try to migrate existing content -- allows anyone who wants to comment out a chunk to do a search-and-replace of any <!--T:x--> with {{#tu:x}} (or whatever), and they don't have to undo that change if they un-comment-out the region.

But is that any easier/better than just doing the search-and-replace of --> with - ->? Seems like that would be easier than all this, and if it's an uncommon case then just linting things away is probably best.

To be clear, I assume that Translate is stripping the <!--T:1--> TUs *before* the legacy parser is "stripping comments", so that the end result is that the legacy parser sees only the valid <!--<translate>plop</translate>--> comment?

It sees <!-- plop -->: neither <translate> is a usual parser tag, it’s removed in the same go as the unit ID comment, in rETRA tag/PageTranslationHooks.php:48 (at 9ae8c76961b1) (ParserBeforeInternalParse hook handler).

  1. It may be worth adding alternative syntax for TUs, like <meta tu="1"> or {{#tu:1}} or something.

If we want to add another syntax for translation units, we could just add an attribute to the <translate> tag, like <!-- <translate tid="1">plop</translate> --> (although all parsing takes place using regexes, not via XML tag processing, so adding more and more attributes makes the parsing more and more painful, but adding a new tag name would in turn make the usage more painful and the wikicode harder to understand).

This is just nasty that arises from a processing model that specifies what should be done and in what order vs describing semantics and specs.

I think the ideal solution here is to deprecate this usage and encourage usage that wraps the entire comment in translate markers rather than just the content of the comment.

So, <translate><!--hello--></translate> would be the preferred markup, so a comment isn't constructed by gluing together 3 substrings. It can work in some cases (ex: attribute strings), but it is hard to guarantee correct token and string gluing behavior that works in all the edge cases. Parsoid strong frowns upon that usage and has had to support some usages of string/token gluing simply because of the widespread usage. But, in an ideal scenario, we would deprecate and migrate to other syntax / forms that provide similar functionality.

Alternatively, if one argues that translation should only care about text nodes (which isn't really the case currently since translate markup wraps markup characters as well), then yes, we would need to pick a different representation of translation unit markers. We cannot simply rely on an attribute of the translate tag since right now, multiple translation units can be generated by a translate wrapper.

All that said, my preferred strategy is to deprecate translate markers around comment text and migrate such usages to wrap the entire comment. That seems like the least disruptive / low-impact solution that doesn't foreclose possibilities of picking alternative representations for translation units in the future.

For the purposes of linting, we may need to bake in some hacky regexps in the PEG tokenizer if we want to go the linting route. But if this usage is uncommon, we should use rendering / editing breakage as a potential hook for getting those pges fixed if that is not too disruptive of a strategy.

One use case I can see that would not work with <translate><!--hello--></translate> is commenting out a large part of a translatable page (where markers have already been added). I have no idea how common that would be, but I could see it happening - and at the same time, not wanting to lose the existing translated content just because it got commented out. It's also worth mentioning that the <translate> themselves are not the issue, but the TU markers (<!-- T3 -->). If there's a large area comprising several translation units, and that only some of that area (including a marker) get commented out, we're having a mess again - because in that case Parsoid will display whatever is after the TU marker, whereas the legacy parser would not.

I guess I am saying, we won't support embedding comments within comments - that is undefined behavior that might happen to work in the legacy parser but isn't guaranteed to work in Parsoid and users shouldn't rely on it. We can consider hacking in some support for it if there is a large set of pages that depend on it, but ideally no.

change the Wikitext markup for Translation Units (there's not that many wikis using Translate, it might be feasible to automate a migration?)

Translate is also part of MLEB which is available for self hosted wiki's to use for making their content available in multiple languages.

deprecate that usage, possibly lint for it, and leave it as that.

The option to lint for this makes sense. Additionally, can we implement a check for this and not allow users to use pages with such markup in VE but continue to behave the same way with the legacy parser? This can be in place until we decide the syntax version and migration path.

I'm not sure about the appropriate syntax to use for translation units, but on the Translate side we could look at using the SYNTAX_VERSION to identify which markup has been used.

My gut feeling is that commenting out content with HTML comments is not as common in wikitext as it is in programming languages. I imagine there are workarounds for that, starting from templates that produce no output when rendered (as long as the content itself is balanced ;). Since nested HTML comments are not possible, I don't see a product reason why this use case should be supported.

It makes sense to me, like @abi_ proposed, to try to detect and lint these things together with the other corner cases you have found and treat such pages as VE-incompatible until fixed.

I can also imagine a case where someone wants to produce a translated comment in the wikitext (e.g. to create pages used with preload-parameters), but given it seems possible by wrapping the whole comment inside <translate> tags, I think we are good on this.

Arlolra triaged this task as Medium priority.Apr 26 2022, 10:43 PM
Arlolra moved this task from Needs Triage to Known Differences on the Parsoid board.

Ran "by chance" into another instance here: https://wikimania.wikimedia.org/wiki/2024:Program - top of the page has the following markup:

<!--
<translate>
<!--T:5-->
The proposals will be made public on eventyay for feedback and coordination purposes. The form fields that will be made public are labeled "This will be shown publicly". Other fields will not be public.
</translate>-->

I just ran into this on MediaWiki.org (that's the permalink, but it's the current version of the page). There are actually several commented-out ranges on the page, which are all displaying incorrectly.

Screenshot 2024-04-24 at 15.32.14.png (980×1 px, 406 KB)