Page MenuHomePhabricator

Evaluate: <style> tag holding template definition
Open, MediumPublic

Description

We observed a special case in DOM Spec where style tags are used for semantic content. The HTML for this page: https://en.wikipedia.org/wiki/User:Jpita23/test when obtained from restbase https://en.wikipedia.org/api/rest_v1/page/html/User%3AJpita23%2Ftest has the following content

<style data-mw-deduplicate="TemplateStyles:r886058088" typeof="mw:Extension/templatestyles mw:Transclusion" 
about="#mwt2" data-mw='{"parts":[{"template":{"target":{"wt":"ISBN","href":"./Template:ISBN"},"params":{"1":{"wt":"978-953-51-1197-9"}},"i":0}}]}' id="mwBA">

.mw-parser-output cite.citation{font-style:inherit}.mw-parser-output .citation q{quotes:"\"""\"""'""'"}
....
</style>
<a rel="mw:WikiLink" href="./International_Standard_Book_Number" title="International Standard Book Number" about="#mwt2">ISBN</a>
<span typeof="mw:Entity" about="#mwt2"> </span>
<a rel="mw:WikiLink" href="./Special:BookSources/978-953-51-1197-9" title="Special:BookSources/978-953-51-1197-9" about="#mwt2" id="mwBQ">978-953-51-1197-9</a>

The corresponding wikitext is {{ISBN|978-953-51-1197-9}}

As you can see the core definition of template is at data-mw attribute of style tag. It has RDFA attribute typeof="mw:Extension/templatestyles mw:Transclusion"

We had a case to remove style tags since it is irrelevant for translation and ended up removing ISBN templates T217585: CX2: ISBN doubled, one correctly formatted with {{ISBN}}, another incorrectly formatted with [[Special:BookSources]]. Fixed it as a special case, but we are hitting multiple issues in our parsing logic as we no longer can ignore style tags and need to parse it as semantic content.

This ticket is to evaluate the style tag holding semantic content situation and explore if we can have a cleaner definition.

Event Timeline

ssastry triaged this task as Medium priority.Oct 31 2019, 4:34 PM
ssastry added a project: Parsoid.

Another example where this causing issue in translating quotes:

Wikitext:

{{Quote|text=To be or not to be|author=William Shakespeare|source=Hamlet}}

HTML:

<section data-mw-section-id="0" id="mwAQ">
<style data-mw-deduplicate="TemplateStyles:r960796168"
    typeof="mw:Extension/templatestyles mw:Transclusion" about="#mwt1"
    data-mw='{"parts":[{"template":{"target":{"wt":"Quote","href":"./Template:Quote"},"params":{"text":{"wt":"To be or not to be"},"author":{"wt":"William Shakespeare"},"source":{"wt":"Hamlet"}},"i":0}}]}'
    id="mwAg">
.mw-parser-output .templatequote{
    overflow:hidden;
    margin:1em 0;
    padding:0 40px
} 
</style>
<blockquote class="templatequote" about="#mwt1" id="mwAw">
    <p>To be or not to be</p><div class="templatequotecite"><span typeof="mw:Entity"></span><cite>William Shakespeare, Hamlet</cite></div>
</blockquote>
</section>

Expected result from translating from en to es:

<style about="#mwt1" data-cx="[{&#34;adapted&#34;:true,&#34;partial&#34;:false,&#34;targetExists&#34;:true}]" data-mw="{&#34;parts&#34;:[{&#34;template&#34;:{&#34;target&#34;:{&#34;wt&#34;:&#34;Citaĵo&#34;,&#34;href&#34;:&#34;./Ŝablono:Citaĵo&#34;},&#34;params&#34;:{&#34;teksto&#34;:{&#34;wt&#34;:&#34;To be or not to be&#34;},&#34;aŭtoro&#34;:{&#34;wt&#34;:&#34;William Shakespeare&#34;},&#34;verko&#34;:{&#34;wt&#34;:&#34;Hamlet&#34;}},&#34;i&#34;:0}}]}" data-mw-deduplicate="TemplateStyles:r960796168" id="mwAg" typeof="mw:Extension/templatestyles mw:Transclusion">
.mw-parser-output .templatequote{
    overflow:hidden;
    margin:1em 0;
    padding:0 40px
}
.mw-parser-output .templatequote .templatequotecite{
    line-height:1.5em;
    text-align:left;
    padding-left:1.6em;
    margin-top:0
}
</style>
<blockquote about="#mwt1" class="templatequote" id="mwAw">
    <p>To be or not to be</p><div class="templatequotecite"><span typeof="mw:Entity"></span><cite>William Shakespeare, Hamlet</cite></div>
</blockquote>

Another example where this causing issue in translating quotes

What is the issue here?

The Parsoid contract is that you cannot strip tags that have template information. So, stripping it will cause problems.

But, apart from that, one option for parsoid is to add a synthetic tag as the first node that holds the transclusion information. We are evaluating the specific choice of node. @Arlolra proposed meta tag as a potentially safe candidate. Separately, we are evaluating if we should do that always (vs. only in special circumstances such as this). So, that is one possible solution.

Change 640285 had a related patch set uploaded (by Santhosh; owner: Santhosh):
[mediawiki/services/cxserver@master] WIP: Support adaptation of templates with templatestyle holding definition

https://gerrit.wikimedia.org/r/640285

But, apart from that, one option for parsoid is to add a synthetic tag as the first node that holds the transclusion information. We are evaluating the specific choice of node. @Arlolra proposed meta tag as a potentially safe candidate. Separately, we are evaluating if we should do that always (vs. only in special circumstances such as this). So, that is one possible solution.

Note that if we do this, the style tag will still have a typeof for TemplateStyles. But, other than that, if we pursue this solution, would it help your use cases?

But, apart from that, one option for parsoid is to add a synthetic tag as the first node that holds the transclusion information. We are evaluating the specific choice of node. @Arlolra proposed meta tag as a potentially safe candidate. Separately, we are evaluating if we should do that always (vs. only in special circumstances such as this). So, that is one possible solution.

Note that if we do this, the style tag will still have a typeof for TemplateStyles. But, other than that, if we pursue this solution, would it help your use cases?

Yes, it will help. CX ignores(does not parse, adapt, translate) all styles since it is irrelevant across wikis. The problem here, we need to add exceptions to that "skip style tags" rule because styles holding semantic information about template. That is what we are trying to do as short term solution in cxserver. But it is tricky because our strict DOMPurifier will remove style tags from content when the content goes to an external machine translation service and come back(We strict validate the content coming from external services for security reasons).

If it is meta tag, will it look like this?

<section data-mw-section-id="0" id="mwAQ">
<meta typeof="mw:Transclusion" about="#mwt1"
    data-mw='{"parts":[{"template":{"target":{"wt":"Quote","href":"./Template:Quote"},"params":{"text":{"wt":"To be or not to be"},"author":{"wt":"William Shakespeare"},"source":{"wt":"Hamlet"}},"i":0}}]}'
    id="mwAg"/> 
<style data-mw-deduplicate="TemplateStyles:r960796168"
    typeof="mw:Extension/templatestyles" about="#mwt1"
    id="mwAh">
.mw-parser-output .templatequote{
    overflow:hidden;
    margin:1em 0;
    padding:0 20px
} 
</style>
<blockquote class="templatequote" about="#mwt1" id="mwAw">
    <p>To be or not to be</p><div class="templatequotecite"><span typeof="mw:Entity"></span><cite>William Shakespeare, Hamlet</cite></div>
</blockquote>
</section>

If so, it is problematic since a default DOMPurifier configuration will remove meta tag(you can try that here with above html: https://cure53.de/purify). Note that even if the HTML is per parsoid spec, this need to be sent to external html machine translation systems and if such services do sanitization on input or we do sanitization on result, we should not lose important information. So please consider this angle when you decide on a solution.

If it is meta tag, will it look like this?

Something like that, yes.

If so, it is problematic since a default DOMPurifier configuration will remove meta tag(you can try that here with above html: https://cure53.de/purify). Note that even if the HTML is per parsoid spec, this need to be sent to external html machine translation systems and if such services do sanitization on input or we do sanitization on result, we should not lose important information. So please consider this angle when you decide on a solution.

I checked some snippets there and it also removes about ids. But, that is a problem as well. How do you handle that?

Besides about ids, DOMPurify also removes link tags (which are also part of Parsoid output).

Right now, Parsoid adds transclusion info to the first output element of a template. So, if that happens to be a link or meta tag, then DOMPurify can drop valid info as well.

How do you handle that?

In any case, looks these issues need some clarity before we make any changes on the Parsoid end.

Change 640285 merged by jenkins-bot:

[mediawiki/services/cxserver@master] Support adaptation of templates with templatestyle holding definition

https://gerrit.wikimedia.org/r/640285

Change 699089 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[operations/deployment-charts@master] Add support for Elia MT to cxserver

https://gerrit.wikimedia.org/r/699089

Change 699089 merged by jenkins-bot:

[operations/deployment-charts@master] Add support for Elia MT to cxserver

https://gerrit.wikimedia.org/r/699089

Mentioned in SAL (#wikimedia-operations) [2021-06-21T05:50:31Z] <kart_> cxserver: Added support for Elia MT + Updated to 2021-06-10-074331-production (T276059, T275803, T276246, T283513, T255231, T237028)

Does anything still need to be done here ?