Page MenuHomePhabricator

RESTBase (?) and core REST API (??) produce different Parsoid HTML
Closed, ResolvedPublic

Description

(I've been looking closely at Parsoid HTML while working on a VisualEditor issue https://gerrit.wikimedia.org/r/c/mediawiki/extensions/VisualEditor/+/756682)

I have this little page on the Beta Cluster: https://en.wikipedia.beta.wmflabs.org/wiki/Table_templated
I can fetch the Parsoid HTML for it here (I believe this uses RESTBase): https://en.wikipedia.beta.wmflabs.org/api/rest_v1/page/html/Table_templated

I have an identical copy of that page and the templates it uses on my local wiki.
I can fetch the Parsoid HTML for it there (using core REST API): http://localhost:3080/w/rest.php/v1/page/Table_templated/html

I noticed that they produce significantly different results: on Beta using RESTBase, some <style> tags in "fosterable positions" are emptied, while on localhost using core API, they are replaced by <link> tags. The core API behavior seems incorrect to me.

Beta Cluster / RESTBase (correct):

<!DOCTYPE html>
<html prefix="dc: http://purl.org/dc/terms/ mw: http://mediawiki.org/rdf/" about="https://en.wikipedia.beta.wmflabs.org/wiki/Special:Redirect/revision/535679"><head prefix="mwr: https://en.wikipedia.beta.wmflabs.org/wiki/Special:Redirect/"><meta property="mw:TimeUuid" content="2404bff0-7fd7-11ec-983b-73a21f35e6e9"/><meta charset="utf-8"/><meta property="mw:pageId" content="270828"/><meta property="mw:pageNamespace" content="0"/><link rel="dc:replaces" resource="mwr:revision/0"/><meta property="mw:revisionSHA1" content="f4603fa221771e48ecdc96acf8508d13cf20c2a9"/><meta property="dc:modified" content="2022-01-28T01:07:12.000Z"/><meta property="mw:htmlVersion" content="2.4.0"/><meta property="mw:html:version" content="2.4.0"/><link rel="dc:isVersionOf" href="https://en.wikipedia.beta.wmflabs.org/wiki/Table_templated"/><title>Table templated</title><base href="https://en.wikipedia.beta.wmflabs.org/wiki/"/><link rel="stylesheet" href="/w/load.php?lang=en&amp;modules=mediawiki.skinning.content.parsoid%7Cmediawiki.skinning.interface%7Csite.styles&amp;only=styles&amp;skin=vector"/><meta http-equiv="content-language" content="en"/><meta http-equiv="vary" content="Accept"/></head><body id="mwAA" lang="en" class="mw-content-ltr sitedir-ltr ltr mw-body-content parsoid-body mediawiki mw-parser-output" dir="ltr"><section data-mw-section-id="0" id="mwAQ"><span about="#mwt1" typeof="mw:Transclusion" data-mw='{"parts":[{"template":{"target":{"wt":"table start","href":"./Template:Table_start"},"params":{},"i":0}},"\n",{"template":{"target":{"wt":"table row","href":"./Template:Table_row"},"params":{"1":{"wt":"Hello"},"2":{"wt":"good"}},"i":1}},"\n",{"template":{"target":{"wt":"table row","href":"./Template:Table_row"},"params":{"1":{"wt":"Goodbye"},"2":{"wt":"bad"}},"i":2}},"\n",{"template":{"target":{"wt":"table row","href":"./Template:Table_row"},"params":{"1":{"wt":"Welcome"},"2":{"wt":"good"}},"i":3}},"\n",{"template":{"target":{"wt":"table end","href":"./Template:Table_end"},"params":{},"i":4}}]}' id="mwAg">
</span><table class="wikitable" about="#mwt1">
<tbody><tr>
<style data-mw-deduplicate="TemplateStyles:r535677" typeof="mw:Extension/templatestyles" about="#mwt4" data-mw='{"name":"templatestyles","attrs":{"src":"Table row/styles.css"}}'>.mw-parser-output .good{background:#9EFF9E}.mw-parser-output .bad{background:#FFC7C7}</style>
<td class="good">Hello</td></tr>
<tr>
<style data-mw-deduplicate="TemplateStyles:r535677" typeof="mw:Extension/templatestyles" about="#mwt7" data-mw='{"name":"templatestyles","attrs":{"src":"Table row/styles.css"}}'></style>
<td class="bad">Goodbye</td></tr>
<tr>
<style data-mw-deduplicate="TemplateStyles:r535677" typeof="mw:Extension/templatestyles" about="#mwt10" data-mw='{"name":"templatestyles","attrs":{"src":"Table row/styles.css"}}'></style>
<td class="good">Welcome</td></tr>
</tbody></table></section></body></html>

localhost / core REST API (incorrect???):

<!DOCTYPE html>
<html prefix="dc: http://purl.org/dc/terms/ mw: http://mediawiki.org/rdf/" about="http://localhost:3080/wiki/Special:Redirect/revision/5437"><head prefix="mwr: http://localhost:3080/wiki/Special:Redirect/"><meta charset="utf-8"/><meta property="mw:pageId" content="1848"/><meta property="mw:pageNamespace" content="0"/><link rel="dc:replaces" resource="mwr:revision/5435"/><meta property="mw:revisionSHA1" content="f4603fa221771e48ecdc96acf8508d13cf20c2a9"/><meta property="dc:modified" content="2022-01-27T23:56:43.000Z"/><meta property="mw:htmlVersion" content="2.4.0"/><meta property="mw:html:version" content="2.4.0"/><link rel="dc:isVersionOf" href="http://localhost:3080/wiki/Table_templated"/><title>Table templated</title><base href="http://localhost:3080/wiki/"/><link rel="stylesheet" href="/w/load.php?lang=en&amp;modules=mediawiki.skinning.content.parsoid%7Cmediawiki.skinning.interface%7Csite.styles&amp;only=styles&amp;skin=vector"/><meta http-equiv="content-language" content="en"/><meta http-equiv="vary" content="Accept"/></head><body lang="en" class="mw-content-ltr sitedir-ltr ltr mw-body-content parsoid-body mediawiki mw-parser-output" dir="ltr"><section data-mw-section-id="0"><span about="#mwt1" typeof="mw:Transclusion" data-mw='{"parts":[{"template":{"target":{"wt":"table start","href":"./Template:Table_start"},"params":{},"i":0}},"\n",{"template":{"target":{"wt":"table row","href":"./Template:Table_row"},"params":{"1":{"wt":"Hello"},"2":{"wt":"good"}},"i":1}},"\n",{"template":{"target":{"wt":"table row","href":"./Template:Table_row"},"params":{"1":{"wt":"Goodbye"},"2":{"wt":"bad"}},"i":2}},"\n",{"template":{"target":{"wt":"table row","href":"./Template:Table_row"},"params":{"1":{"wt":"Welcome"},"2":{"wt":"good"}},"i":3}},"\n",{"template":{"target":{"wt":"table end","href":"./Template:Table_end"},"params":{},"i":4}}]}'>
</span><table class="wikitable" about="#mwt1">
<tbody><tr>
<style data-mw-deduplicate="TemplateStyles:r5432" typeof="mw:Extension/templatestyles" about="#mwt4" data-mw='{"name":"templatestyles","attrs":{"src":"Table row/styles.css"}}'>.mw-parser-output .good{background:#9EFF9E}.mw-parser-output .bad{background:#FFC7C7}</style>
<td class="good">Hello</td></tr>
<tr>
<link rel="mw-deduplicated-inline-style" href="mw-data:TemplateStyles:r5432"/>
<td class="bad">Goodbye</td></tr>
<tr>
<link rel="mw-deduplicated-inline-style" href="mw-data:TemplateStyles:r5432"/>
<td class="good">Welcome</td></tr>
</tbody></table></section></body></html>

I am slightly confused by the abundance of APIs, and whether this is a bug or not? However, if VisualEditor started using the HTML produced on my localhost by the core REST API, then this would cause issues like T299767, which would be Very Bad. Therefore I am complaining.

(On my localhost, VisualEditor currently receives the same HTML as in the RESTBase example, from some different API – I'm not sure where it's even coming from, so it's alright for now, but I worry it'll get consolidated at some point and the issue will appear.)

Event Timeline

There are actually 3 ways to get Parsoid HTML right now.

Two of them are (which actually agree with each other):

RESTBase actually calls https://en.wikipedia.beta.wmflabs.org/w/rest.php/en.wikipedia.beta.wmflabs.org/v3/page/pagebundle/Table%20templated, extracts the HTML blob and stores it in RESTBase which is then accessible via the proxied endpoint above. In this HTML blob, data-parsoid attributes have been stripped and replaced with id attributes. And, RESTBase adds a time uiid met tag to the header. But, for this, the HTML in these 2 cases are identical.

So, for local development, if you don't have RESTBase, use the Parsoid REST API. This isn't different from how it was in Parsoid/JS land .. it had to expose a REST API for RESTBase to call into since they are different services.

Now, for the 3rd way to get Parsoid HTML which is the core REST API

This one seems to strip the data-parsoid attribute as well, but doesn't add an equivalent id attribute. And yes, there does seem to be a difference that you noted. I don't know why this is yet. I imagine some config setting is different or how the config objects might be constructed may be different. We'll look into it, but this is likely some edge case.

But, as I said before, for local development and debugging of VE / DiscussionTools issues, either use RESTBase or Parsoid's REST API. They should be identical (with the data-parsoid caveat above, but VE & DT should never use that attribute anyway since it is considered to be a Parsoid-private attribute that could be changed without notice) and reflect what is used in production.

localhost / core REST API (incorrect???):

This seems to be calling getText
https://github.com/wikimedia/mediawiki/blob/master/includes/Rest/Handler/PageHTMLHandler.php#L84

with 'deduplicateStyles' => true,
https://github.com/wikimedia/mediawiki/blob/master/includes/parser/ParserOutput.php#L476-L503

In the former case you're getting the raw output from Parsoid, the core REST API is doing some post-processing on that html.

Thanks. So… should that use 'deduplicateStyles' => false, instead? Because Parsoid already deduplicates styles using a slightly different implementation?

matmarex claimed this task.
matmarex edited projects, added MediaWiki-REST-API; removed Parsoid.

Change 757987 had a related patch set uploaded (by Bartosz Dziewoński; author: Bartosz Dziewoński):

[mediawiki/core@master] PageHTMLHandler: Do not de-duplicate styles in Parsoid HTML

https://gerrit.wikimedia.org/r/757987

Change 757987 merged by jenkins-bot:

[mediawiki/core@master] PageHTMLHandler: Do not de-duplicate styles in Parsoid HTML

https://gerrit.wikimedia.org/r/757987