Parsoid should support <translate> and <tvar> natively as they are not normal parser tags
Closed, ResolvedPublic

bzimport set Reference to bz48891.
bzimport added a subscriber: Unknown Object (MLST).
TheDJ created this task.May 28 2013, 12:34 PM

Yes, the Translate extension needs to be extended to support VisualEditor editing, but the core issue is that Parsoid doesn't recognise <translate> as an extension block; should it take the local list of registered extensions, and for ones it hasn't been told how to deal with just wrap it as a typeof="mw:Object/Unknown" or something similar?

Parsoid does fetch a list of registered extensions from the config and uses that to wrap the block with mw:Object/Extension/<ext-name-here> type. All other unknown tags are pass through as plain text.

The problem is that mediawiki does not report <translate> as an installed extension. Check http://www.mediawiki.org/w/api.php?action=query&meta=siteinfo&format=jsonfm&siprop=extensiontags

Any idea why mw.org is not reporting <translate> as a registered extension?

(In reply to comment #2)

Parsoid does fetch a list of registered extensions from the config and uses
that to wrap the block with mw:Object/Extension/<ext-name-here> type. All
other unknown tags are pass through as plain text.

Ah, my apologies; I thought you were doing this, but it didn't even occur to me that the Translate extension might fail to do this. Re-filing.

<translate> and <tvar> are not a normal extension tags. Please explain why I should register them.

Parsoid needs to know if something is a valid extension tag so that it can process them rather than escape them to literal text. This will also let VE support them in the future and protect them from being modified right now.

Is there any other way to discover that <translate>, <tvar>, <language> are valid tags on the wiki, i.e. what is the mediawiki api endpoint? So far we've been using siprop and fetching installed extension tags and use that to detect these.

  • Bug 52408 has been marked as a duplicate of this bug. ***

@Niklas are there issues with registering them?

I don't know. Either the registration would just be symbolic without any side effects, or it could also interfere with the current functionality.

Does someone want to try registering them and see what happens? I probably don't have time before Wikimania.

Change 83427 had a related patch set uploaded by Nikerabbit:
Register translate and tvar to the parser

https://gerrit.wikimedia.org/r/83427

See the above commit. It is not thoroughly tested. I still don't understand what would the benefit.

Niklas: If VE has to support editing for <translate>, <tvar>, etc. tags, Parsoid has to know about those tags. Parsoid finds out about installed tags by querying mediawiki API. So, either MW API should expose this information for hooks automatically or we need a different API endpoint, or you have to register them.

In the long term yes, but wont this render VE useless on translatable pages in the short them?

(In reply to comment #14)

In the long term yes, but wont this render VE useless on translatable pages
in the short them?

Yes. On MW.org you can use the alien node editor to edit any extension node's contents (but they won't be rich), but on other wikis it will make them un-editable in VE.

However, making a Translation extension editor in VE shouldn't be hard at all - I've created that as bug 53974 along the lines of other bugs.

Qgil added a comment.Sep 27 2013, 11:21 PM

fwiw this bug has been mentioned at https://en.wikipedia.org/wiki/Wikipedia:Village_pump_%28technical%29#VisualEditor_weekly_update_-_2013-09-26_.28MW_1.22wmf19.29

I have also seen this problem here and there while editing at mediawiki.org. Then again these tags are not used in most Wikimedia projects or MediaWikis there so fair enough.

Assigning to James, hoping this may get a parsoid person to review the patch that's been available for nearly a month now.

(In reply to comment #17)

Assigning to James, hoping this may get a parsoid person to review the patch
that's been available for nearly a month now.

Oh. We've been patiently waiting for you to merge it. :-(

Will ping the team.

I think I already did it yesterday.

(In reply to comment #19)

I think I already did it yesterday.

There is a comment, but now a followup question requires addresssing

Change 83427 merged by jenkins-bot:
Register translate and tvar to the parser

https://gerrit.wikimedia.org/r/83427

I don't deserve the credit for Niklas's fix. :-)

GWicke added a comment.Oct 7 2013, 7:14 PM

That patch won't work well with Parsoid, see the comment in the patch set.

This issue appears to suffer from scope creep. The requirements of the current summary have been met. See https://translatewiki.net/wiki/Special:Version?uselang=en: Section "Parser extension tags" now shows "<translate> and <tvar>"

Misunderstanding or miscommunication about the actual requirements is not scope creep. The patch which was made does not add value. I changed the title to reflect the issue with my understanding of it.

The patch will work with Parsoid right now, because Parsoid always goes through the PHP parser when dealing with extensions and tag hooks. But, there are plans for Parsoid to bypass the PHP parser and call the extensions/hooks directly.

Right now, The translate code preprocesses the page by stripping the <translate> and <tvar> tags before the PHP parser actually parses the page source (by registering for ParserBeforeStrip hook). So, even though the tag hooks are registered (with a callback that bombs if called), the PHP parser never gets to calling them.

If Parsoid is able to mimic the behavior (by calling all hooks, not just extension callback hooks), the existing code will continue to work. However, if Parsoid only supports tag extensions directly, then, the ParserBeforeStrip code won't be invoked by Parsoid at that time and this will be a problem at that point.

So, this is not an issue *right now*, but could be at a later point depending on what functionality Parsoid will implement natively and what it will continue to defer to the PHP parser.

I think Gabriel was responding to that future concern and also hoping that we can use the opportunity of Parsoid's ongoing development to cleanup some of these interfaces and mechanisms. We could streamline extensions to go through narrow interfaces rather than continuing to use all sort of hooks into various points of the parsing timeline. Anything this is something we could discuss more.

Hope this summarizes where we stand now.

Actually, now that I wrote that, I might have misspoken about whether it will work now as well. Niklas is right. For <translate> and <tvar> tags found in top-level pages, since we never go through PHP parser at all, this may still not work.

Parsoid will recognize the hooks because of the registration. In turn, it will call the PHP parser to translate the extension content, which in turn may callback into the translate extension rather than the ParserBeforeStrip callback (which is what <translate> and <tvar> relies on), and that will bomb.

One of us (Gabriel or me) will experiment with the updated code locally and update this bug.

Okay, we tested and all is good for now. You can ignore #c27 :-).

As I indicated in #c26, this is a problem for when we start updating Parsoid code to call extensions directly rather than go through the full parse pipeline (which was meant to be a temporary hack while we figured out how to call extensions directly).

GWicke added a comment.Oct 9 2013, 4:01 PM

To summarize:

  • We currently work around the missing tag hook API by calling the full action=parse pipeline for each extension tag. This is quite expensive as you can imagine.
  • This lets the translate hack work for now. At least it won't crash.
  • The translate extension will crash once we actually call the registered tag hook.

So there is some time to fix this up, but we should not pretend that it is fixed right now.

Can you provide more information on what stops you from using the actual tag hook?

I just found out that if edit is made via API, the ParserBeforeStrip hook I am using does not get called and the exception is shown. Does anybody have an idea why?

Niklas: To me, it looks like an existing bug got exposed. Can you provide more details? Maybe full url of the API call? And/or, if you hop onto #mediawiki-parsoid, we could investigate.

Niklas: So, here is what I found:

https://meta.wikimedia.org/w/api.php?action=parse&text=<translate>foo</translate> works and does not crash.

https://meta.wikimedia.org/w/api.php?action=parse&text=<translate><!--T:1-->foo</translate> throws the exception.

I am a bit baffled why the comment in the extension content should change what hook is called. I'll let you investigate from here. It might be worth using this opportunity to use a regular extension hook (rather than the ParserBeforeStrip hook), if that might work for you.

gmaruzz wrote:

bump

Trying the parser hook solution now. I have code like:

		$parser->setHook( 'translate', function ( $input, $params, $parser, $frame ) {
			$re = '~<tvar\|([^>]+)>(.*?)</>~u';
			$output = preg_replace( $re, '\2', $input );

			$output = $parser->recursiveTagParse( $output, $frame );
			$output = trim( $output );
			return $output;
		} );

This seems to work okay on simple text, with the exception of section edit links.

I'm still trying with more complex pages with templates as well extensions and our tutorial.

Change 91583 had a related patch set uploaded by Nikerabbit:
Add $wgTranslatePageTranslationUseParserHook

https://gerrit.wikimedia.org/r/91583

Issues I have found:

  • [MAJOR] Disabling edit section links does not work for headings inside <translate> tags
  • [NORMAL] Edit section links don't work. Not a regression, but would be nice fix. Above issue make this a bigger issue.
  • [BLOCKER] Table of contents does not include links
  • [MINOR] White space trimming behavior has changed. The original logic cannot be implemented with information provided for the parser hook. You can see the buildup at bottom of http://dev.translatewiki.net/wiki/Extension:Translate compare with http://www.mediawiki.org/wiki/Help:Extension:Translate . This is annoying and would be nice to fix, but the workaround is to edit page source. Documentation about whitespace before has to be update if this way is chosen.

Please help me solve the remaining issues.

Also, please test the patch if you can to see if you find any other changes in behavior.

Editing translatable pages via API seems to work now.

(In reply to comment #36)

Issues I have found:

  • [MAJOR] Disabling edit section links does not work for headings inside <translate> tags

You could try marking your output as HTML according to https://www.mediawiki.org/wiki/Manual:Tag_extensions#How_can_I_avoid_modification_of_my_extension.27s_HTML_output.3F

I'm not sure whether that suppresses section edit links too, but worth a try IMO.

  • [NORMAL] Edit section links don't work. Not a regression, but would be nice fix. Above issue make this a bigger issue.
  • [BLOCKER] Table of contents does not include links

Can you check whether your headings are matched by the formatHeadings regexp?

Ideally we'd move both TOC and section edit links to JS / CSS. That is our plan for Parsoid output, but not going to happen over night.

(In reply to comment #39)

(In reply to comment #36)
> Issues I have found:
>
> * [MAJOR] Disabling edit section links does not work for headings inside
> <translate> tags

You could try marking your output as HTML according to
https://www.mediawiki.org/wiki/Manual:
Tag_extensions#How_can_I_avoid_modification_of_my_extension.27s_HTML_output.
3F

This causes (lack of) whitespace issues and breaks all lists.

I'm not sure whether that suppresses section edit links too, but worth a try
IMO.

It doesn't.

> * [NORMAL] Edit section links don't work. Not a regression, but would be nice
> fix. Above issue make this a bigger issue.
> * [BLOCKER] Table of contents does not include links

Can you check whether your headings are matched by the formatHeadings regexp?

I'm assuming it processes the html I'm returning, which is:
<h2><span class="mw-headline" id="Culture">Culture</span><mw:editsection page="Pupu" section="1">Culture</mw:editsection></h2>

Not sure whether that matches or not.

Ideally we'd move both TOC and section edit links to JS / CSS. That is our
plan
for Parsoid output, but not going to happen over night.

Any changes for interim solutions?

Change 91583 abandoned by Siebrand:
Add $wgTranslatePageTranslationUseParserHook

Reason:
Abandoned as there is no progress on this patch set. Can be re-activated if there is outlook on getting this in a mergable state.

https://gerrit.wikimedia.org/r/91583

No patch any more. :-(

(In reply to James Forrester from comment #42)

No patch any more. :-(

Hmm I'm not sure that's what "Can be re-activated if there is outlook on getting this in a mergable state" means, PATCH_TO_REVIEW seems a sensible way to tag such stale patches in need of TLC.

(In reply to Nemo from comment #43)

(In reply to James Forrester from comment #42)
> No patch any more. :-(

Hmm I'm not sure that's what "Can be re-activated if there is outlook on
getting this in a mergable state" means, PATCH_TO_REVIEW seems a sensible
way to tag such stale patches in need of TLC.

No. There is no patch to review for merge. There is patch to re-*do*.

To be clear: there is nothing wrong with that patch itself, it just doesn't work due to limitations in the parser itself. I cannot continue with this patch/bug until I get help from other people to modify parser/parsoid.

Kabhi2104 asked on MediaWiki-Internationalization:

Nemo_bis: i had one doubt, what does normal parser tags mean ?

Explained briefly from a non-technical perspective, this means that <translate> is a "tag" for the Translate extension, but the parser "doesn't know" it. A symptom is the absence from the list on Special:Version.

Concretely, the parser deals with tags in ways that interfere with the Translate syntax, with some consequences identified in T50891#524894. But even in the current situation there are some limitations, especially for the newline, which is meaningful for the parser and perhaps shouldn't be required.

More broadly, one could speculate that most wikitext markup is meant to add or modify the HTML of the resulting page, or properties of the page in the database: while translate tags don't do either and are properties of the wikitext itself.

Nikerabbit lowered the priority of this task from "Normal" to "Low".
Nikerabbit raised the priority of this task from "Low" to "Normal".Jan 28 2015, 5:46 AM
Nemo_bis changed the task status from "Open" to "Stalled".Jan 28 2015, 6:57 AM
Nemo_bis set Security to None.

I had a chat with Subbu about this. In the short tem (this year) we could add configurable support for <translate> tags in parsoid. That support will preserve the tags and comments when pages are edited with VE. In the long term we will implement support for translatable pages without any kind of markup. Marking those pages for translation would only work with VE and it needs special page translation module in VE.

Since <translate> is a special extension that hooks into the parser pipeline in a way that is not visible to Parsoid, and also since it doesn't render anything new, but marks DOM fragments for translation, the ideal way to deal with this in the short term (till translation goes with a VE based approach) is to implement a native version of translate in Parsoid (just like Cite and planned Gallery extension).

This native translate will be relatively straightforward and will not do anything special beyond recognizing the <translate> and associated tags and marking up DOM fragments with a special typeof that clients like VE or gadgets or anyone could use to enable translation. In addition, it would also have to register a html2wt handler for serializing it back to wikitext.

This should not be a lot of work -- most of it will be in figuring out if there are any tricky things that the translate extension does. Talking to Niklas, we figured this is not high priority, but something worth tackling over the next 6 months.

marcoil changed the title from "<translate> and <tvar> do not work with Parsoid because they are not normal parser tags" to "Parsoid should support <translate> and <tvar> natively as they are not normal parser tags".Feb 10 2015, 2:49 PM
marcoil added a project: Blocked-on-Parsoid.
awight added a subscriber: awight.May 8 2015, 3:22 AM
kaldari added a subscriber: kaldari.Dec 1 2015, 3:22 AM

Any update on this?

Any update on this?

A generic extension registration mechanism in Parsoid is one of our (parsing team) goals for this quarter and once that is done, we should be able to work on specific extensions. As noted in T50891#1002436, I think we can implement a parsoid-native version of this which adds annotations to the DOM that marks trees that need to be translated. VE and other tools can then use this to provide translation support by querying the API to manage translations.

I think we can get the extension part done within the next month or so, but the tools to build on top of that is something for VE and language team to work on.

Change 261293 had a related patch set uploaded (by Arlolra):
T50891: Register <translate> and <tvar> natively

https://gerrit.wikimedia.org/r/261293

Change 261293 merged by jenkins-bot:
T50891: Register <translate> and <tvar> natively

https://gerrit.wikimedia.org/r/261293

Arlolra closed this task as "Resolved".Dec 29 2015, 11:39 PM
Arlolra claimed this task.
Arlolra added a subscriber: Arlolra.

Parsoid will now wrap these appropriately with the proper typeof mw:Extension/translate and whatnot.

Support for the Translate extension in VE is still needed, but a separate task.

Add Comment