Page MenuHomePhabricator

Parsoid removes some non-optional numeric template parameters
Closed, ResolvedPublicBUG REPORT

Description

Given this wikitext:

{{1x|1=<div class="a">a</div>}}

Parsoid removes the parameter name:

{{1x|<div class="a">a</div>}}

This destroys the wikitext and makes round-trip tests fail.

The reason for this behavior is that Parsoid actively strips numeric parameter names when they are not needed. While this is expected and correct behavior, it is not when:

  1. The content contains an equal sign and requires the parameter to be named because of this. It looks like the existing str_contains( …, '=' ) checks Parsoid does in a few places are not sufficient.
  2. When the original wikitext actually contained the parameter name, it shouldn't be removed. While the wikitext is functionally identical, we usually consider this a dirty diff and try to avoid it.

I run into this when I tried to construct parser test cases in Cite.

Event Timeline

Change #1171232 had a related patch set uploaded (by Thiemo Kreuz (WMDE); author: Thiemo Kreuz (WMDE)):

[mediawiki/services/parsoid@master] Fix TemplateEncapsulator dropping numeric parameter names

https://gerrit.wikimedia.org/r/1171232

Change #1171234 had a related patch set uploaded (by Thiemo Kreuz (WMDE); author: Thiemo Kreuz (WMDE)):

[mediawiki/extensions/Cite@master] [WIP] Add class names to make parser tests more readable

https://gerrit.wikimedia.org/r/1171234

I need to understand more how to reproduce this, because I see this on the CLI:

╭─subbu@earth ~/work/wmf/parsoid  ‹master*› 
╰─➤  echo '{{1x|1=<div class="a">a</div>}}' | php bin/parse.php --wt2wt
{{1x|1=<div class="a">a</div>}}

I cannot reproduce this via parser tests either

╰─➤  git diff tests/parser/regressions.txt | more
diff --git a/tests/parser/regressions.txt b/tests/parser/regressions.txt
index 7b13c3e43..441412da5 100644
--- a/tests/parser/regressions.txt
+++ b/tests/parser/regressions.txt
@@ -199,3 +199,12 @@ parsoid=wt2html
 <table data-parsoid='{"stx":"html"}'><tbody><tr data-parsoid='{"stx":"html"}'><td data-parsoid='{"stx":"html"}'><i>a</i></td><td title="&amp;" data-parsoid='{"stx":"html","a":{"title":"&amp;"},"sa":{"title":"&a
mp;amp;"}}'>b</td></tr></tbody></table>
 <link rel="mw:PageProp/Category" href="./Category:C" />
 !! end
+
+!! test
+T400080 test
+!! wikitext
+{{1x|1=<div class="a">a</div>}}
+!! html/parsoid
+<div class="a" about="#mwt1" typeof="mw:Transclusion" data-parsoid='{"stx":"html","pi":[[{"k":"1","named":true}]]}' data-mw='{"parts":[{"template":{"target":{"wt":"1x","href":"./Template:1x"},"params":{"1":{"wt
":"&lt;div class=\"a\">a&lt;/div>"}},"i":0}}]}'>a</div>
+!! end
+
╭─subbu@earth ~/work/wmf/parsoid  ‹master*› 
╰─➤  php bin/parserTests.php --filter T400 tests/parser/regressions.txt 
Loaded known failures from /home/subbu/work/wmf/parsoid/tests/parser/regressions-standalone-knownFailures.json
EXPECTED PASS: T400080 test (wt2html)
EXPECTED PASS: T400080 test (wt2wt)
EXPECTED PASS: T400080 test (html2html)
EXPECTED PASS: T400080 test (html2wt)

https://gerrit.wikimedia.org/r/1171234/2 demonstrates the issue. Both html2wt as well as the html2html roundtrip failed (on patchset 2).

This is the Parsoid HTML output I got:

<div class="a" about="#mwt1" typeof="mw:Transclusion" id="mwAg" data-mw="{&quot;parts&quot;:[{&quot;template&quot;:{&quot;target&quot;:{&quot;wt&quot;:&quot;1x&quot;,&quot;href&quot;:&quot;./Template:1x&quot;},&quot;params&quot;:{&quot;1&quot;:{&quot;wt&quot;:&quot;<div class=\&quot;a\&quot;>a</div>&quot;}},&quot;i&quot;:0}}]}">a</div>

When I look at this HTML there really is not enough information to decide if the 1= was part of the original wikitext or not. I was wondering if this was a mistake.

I can see the additional data-parsoid and the {"k":"1","named":true} in your example. That's obviously what I was looking for. I know the parser test runner actively strips data-parsoid because it's almost never relevant. What I don't understand is why I never see it, not even in the unstripped HTML?

I did two things now:

  • I manually added the data-parsoid from your example to my patch. However, I'm not sure if this is valid or if I'm "faking" the test now.
  • I turned https://gerrit.wikimedia.org/r/1171232 into an experiment to see if we can potentially merge the two spots in data-parsoid and data-mw that store almost the same information (that is the original wikitext of a parameter name that might not survive the roundtrip otherwise). What do you think? Would that be worth it when it potentially minimizes the complexity of the code, without changing anything else?

https://www.mediawiki.org/w/index.php?title=User%3ASSastry_%28WMF%29%2FSandbox&diff=7765339&oldid=7765335 shows that VE/Parsoid handle edits to this properly.

This is likely a tooling issue -- we'll have to see why the data-parsoid attribute is getting normalized away. I can reproduce this normalization when I run echo '{{1x|1=<div class="a">a</div>}}' | php maintenance/parse.php --parsoid in the mediawiki core repository. I am sure core parser test runner is also doing some normalization and not emitting unnormalized output.

I mean, when I look at an example page like https://en.wikipedia.beta.wmcloud.org/wiki/Conflict-title-0.27539716256828073-Iñtërnâtiônàlizætiøn?useparsoid=1 there is no data-parsoid. Is this actively stripped and only in the API output?

I mean, when I look at an example page like https://en.wikipedia.beta.wmcloud.org/wiki/Conflict-title-0.27539716256828073-Iñtërnâtiônàlizætiøn?useparsoid=1 there is no data-parsoid. Is this actively stripped and only in the API output?

Data-parsoid is parsoid-internal information. So, it is not actually shipped to clients but stashed server side on edit requests. On read requests, it is not shipped. Same with API requests.

EDIT: https://www.mediawiki.org/wiki/Parsoid/Internals/data-parsoid is where we document what data-parsoid contains and that it is internal use only.
What you instead see are element ids associated with all elements and the data-parsoid stashed server side are keyed on element ids. On edit submit, the stashed data-parsoid is send to Parsoid with the stashed original html, and the edited html.

But, as @Arlolra pointed out, there is actually a html2wt wikitext escaping bug since Parsoid should not introduce semantic diffs when HTML without data-parsoid is submitted to the API. It can introduce dirty syntactic diffs however.

So, there is an edge case bug in the html2wt wikitext escaping code.

Change #1172373 had a related patch set uploaded (by Arlolra; author: Arlolra):

[mediawiki/services/parsoid@master] [WIP] Switch to serialized as named

https://gerrit.wikimedia.org/r/1172373

Change #1172421 had a related patch set uploaded (by Arlolra; author: Arlolra):

[mediawiki/services/parsoid@master] [WIP] Fix breaking on equals

https://gerrit.wikimedia.org/r/1172421

Change #1172373 merged by jenkins-bot:

[mediawiki/services/parsoid@master] Switch to named when numbered tpl arg has tag with attr

https://gerrit.wikimedia.org/r/1172373

Change #1173436 had a related patch set uploaded (by Arlolra; author: Arlolra):

[mediawiki/services/parsoid@master] Parser functions don't need equals escaping

https://gerrit.wikimedia.org/r/1173436

Change #1173436 merged by jenkins-bot:

[mediawiki/services/parsoid@master] Parser functions don't need equals escaping

https://gerrit.wikimedia.org/r/1173436

Change #1175601 had a related patch set uploaded (by Subramanya Sastry; author: Subramanya Sastry):

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.22.0-a15

https://gerrit.wikimedia.org/r/1175601

Change #1175601 merged by jenkins-bot:

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.22.0-a15

https://gerrit.wikimedia.org/r/1175601

Change #1175955 had a related patch set uploaded (by Arlolra; author: Arlolra):

[mediawiki/services/parsoid@master] [WIP] Reapply "Switch to named when numbered tpl arg has tag with attr"

https://gerrit.wikimedia.org/r/1175955

Change #1171232 abandoned by Thiemo Kreuz (WMDE):

[mediawiki/services/parsoid@master] [POC] Replace data-parsoid→named flag with data-mw→key→wt

https://gerrit.wikimedia.org/r/1171232

Change #1175955 merged by jenkins-bot:

[mediawiki/services/parsoid@master] Reapply "Switch to named when numbered tpl arg has tag with attr"

https://gerrit.wikimedia.org/r/1175955

Change #1185118 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.22.0-a20

https://gerrit.wikimedia.org/r/1185118

Change #1185118 merged by jenkins-bot:

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.22.0-a20

https://gerrit.wikimedia.org/r/1185118

Change #1186634 had a related patch set uploaded (by Arlolra; author: Arlolra):

[mediawiki/services/parsoid@master] Consider all pf arguments as positional

https://gerrit.wikimedia.org/r/1186634

Change #1172421 merged by jenkins-bot:

[mediawiki/services/parsoid@master] Fix breaking tpl args on equals in html tags

https://gerrit.wikimedia.org/r/1172421

Change #1187074 had a related patch set uploaded (by Arlolra; author: Arlolra):

[mediawiki/extensions/Cite@master] Disable tests to update output

https://gerrit.wikimedia.org/r/1187074

Change #1187086 had a related patch set uploaded (by Arlolra; author: Arlolra):

[mediawiki/extensions/Cite@master] Re-enable tests after Parsoid bump in vendor

https://gerrit.wikimedia.org/r/1187086

Change #1187116 had a related patch set uploaded (by Arlolra; author: Arlolra):

[mediawiki/extensions/Scribunto@master] Disable tests to update output

https://gerrit.wikimedia.org/r/1187116

Change #1187117 had a related patch set uploaded (by Arlolra; author: Arlolra):

[mediawiki/extensions/Scribunto@master] Re-enable tests after Parsoid bump in vendor

https://gerrit.wikimedia.org/r/1187117

Change #1187121 had a related patch set uploaded (by Arlolra; author: Arlolra):

[mediawiki/extensions/SyntaxHighlight_GeSHi@master] Disable tests to update output

https://gerrit.wikimedia.org/r/1187121

Change #1187122 had a related patch set uploaded (by Arlolra; author: Arlolra):

[mediawiki/extensions/SyntaxHighlight_GeSHi@master] Re-enable tests after Parsoid bump in vendor

https://gerrit.wikimedia.org/r/1187122

Change #1187074 merged by jenkins-bot:

[mediawiki/extensions/Cite@master] Disable tests to update output

https://gerrit.wikimedia.org/r/1187074

Change #1187116 merged by jenkins-bot:

[mediawiki/extensions/Scribunto@master] Disable tests to update output

https://gerrit.wikimedia.org/r/1187116

Change #1187121 merged by jenkins-bot:

[mediawiki/extensions/SyntaxHighlight_GeSHi@master] Disable tests to update output

https://gerrit.wikimedia.org/r/1187121

Change #1186634 merged by jenkins-bot:

[mediawiki/services/parsoid@master] Consider all pf arguments as positional

https://gerrit.wikimedia.org/r/1186634

Change #1187841 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.22.0-a21

https://gerrit.wikimedia.org/r/1187841

Change #1187841 merged by jenkins-bot:

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.22.0-a21

https://gerrit.wikimedia.org/r/1187841

Change #1187122 merged by jenkins-bot:

[mediawiki/extensions/SyntaxHighlight_GeSHi@master] Re-enable tests after Parsoid bump in vendor

https://gerrit.wikimedia.org/r/1187122

Change #1187117 merged by jenkins-bot:

[mediawiki/extensions/Scribunto@master] Re-enable tests after Parsoid bump in vendor

https://gerrit.wikimedia.org/r/1187117

Change #1187086 merged by jenkins-bot:

[mediawiki/extensions/Cite@master] Re-enable tests after Parsoid bump in vendor

https://gerrit.wikimedia.org/r/1187086