Page MenuHomePhabricator

Off-by-1 DSR computation edge case error (that leads to subsequent invalid utf-8 errors in template wrapping)
Closed, ResolvedPublic

Description

Reduced test case:

[subbu@earth:~/work/wmf/parsoid] cat /tmp/wt
*<div>{{1x|
*c
}}</div>d
[subbu@earth:~/work/wmf/parsoid] php bin/parse.php --wt2wt < /tmp/wt
*<div>{{1x|
*c
}}</div>dd

[subbu@earth:~/work/wmf/parsoid] php bin/parse.php < /tmp/wt
<ul about="#mwt1" typeof="mw:Transclusion" data-parsoid='{"dsr":[0,24,0,0],"firstWikitextNode":"UL","pi":[[{"k":"1"}]]}' data-mw='{"parts":["*&lt;div>",{"template":{"target":{"wt":"1x","href":"./Template:1x"},"params":{"1":{"wt":"\n*c\n"}},"i":0}},"&lt;/div>d"]}'><li><div></div></li>
<li>c<meta typeof="mw:Placeholder/StrippedTag"/></li></ul><span about="#mwt1">
</span><p data-parsoid='{"dsr":[24,24,0,0]}'>d</p>

The list has got offset [0,24] whereas it should be [0,23]. If that 'd' was a multi-byte character, then template wrapping would try to take a substring in the middle of the character causing it to trigger a UTF-8 error.

See below:

[subbu@earth:~/work/wmf/parsoid] cat /tmp/wt
*<div>{{1x|
*c
}}</div>の

[subbu@earth:~/work/wmf/parsoid] php bin/parse.php < /tmp/wt
Wikimedia\Assert\InvariantException from line 224 of /home/subbu/work/wmf/parsoid/vendor/wikimedia/assert/src/Assert.php: Invariant failed: Bad UTF-8 at end of string (3 byte sequence)
#0 /home/subbu/work/wmf/parsoid/src/Utils/PHPUtils.php(218): Wikimedia\Assert\Assert::invariant()
#1 /home/subbu/work/wmf/parsoid/src/Wt2Html/PP/Processors/WrapTemplates.php(963): Wikimedia\Parsoid\Utils\PHPUtils::safeSubstr()
#2 /home/subbu/work/wmf/parsoid/src/Wt2Html/PP/Processors/WrapTemplates.php(1226): Wikimedia\Parsoid\Wt2Html\PP\Processors\WrapTemplates::encapsulateTemplates()
#3 /home/subbu/work/wmf/parsoid/src/Wt2Html/PP/Processors/WrapTemplates.php(1239): Wikimedia\Parsoid\Wt2Html\PP\Processors\WrapTemplates::wrapTemplatesInTree()
#4 /home/subbu/work/wmf/parsoid/src/Wt2Html/DOMPostProcessor.php(158): Wikimedia\Parsoid\Wt2Html\PP\Processors\WrapTemplates->run()
#5 /home/subbu/work/wmf/parsoid/src/Wt2Html/DOMPostProcessor.php(837): Wikimedia\Parsoid\Wt2Html\DOMPostProcessor->Wikimedia\Parsoid\Wt2Html\{closure}()
#6 /home/subbu/work/wmf/parsoid/src/Wt2Html/DOMPostProcessor.php(887): Wikimedia\Parsoid\Wt2Html\DOMPostProcessor->doPostProcess()
#7 /home/subbu/work/wmf/parsoid/src/Wt2Html/DOMPostProcessor.php(904): Wikimedia\Parsoid\Wt2Html\DOMPostProcessor->process()
#8 /home/subbu/work/wmf/parsoid/src/Wt2Html/ParserPipeline.php(178): Wikimedia\Parsoid\Wt2Html\DOMPostProcessor->processChunkily()
#9 /home/subbu/work/wmf/parsoid/src/Wt2Html/ParserPipelineFactory.php(310): Wikimedia\Parsoid\Wt2Html\ParserPipeline->parseChunkily()
#10 /home/subbu/work/wmf/parsoid/src/Core/WikitextContentModelHandler.php(106): Wikimedia\Parsoid\Wt2Html\ParserPipelineFactory->parse()
#11 /home/subbu/work/wmf/parsoid/src/Parsoid.php(162): Wikimedia\Parsoid\Core\WikitextContentModelHandler->toDOM()
#12 /home/subbu/work/wmf/parsoid/src/Parsoid.php(194): Wikimedia\Parsoid\Parsoid->parseWikitext()
#13 /home/subbu/work/wmf/parsoid/bin/parse.php(327): Wikimedia\Parsoid\Parsoid->wikitext2html()
#14 /home/subbu/work/wmf/parsoid/bin/parse.php(584): Parse->wt2Html()
#15 /home/subbu/work/wmf/parsoid/tools/doMaintenance.php(53): Parse->execute()
#16 /home/subbu/work/wmf/parsoid/bin/parse.php(620): require_once('/home/subbu/wor...')
#17 {main}

This is probably another source of UTF-8 production error logged in T269749: WrapTemplates: UTF-8 errors

Event Timeline

ssastry triaged this task as Medium priority.Mar 14 2021, 5:42 PM

This looks more like a list-wrapping error where the list wrapper treats the </div> as part of the list item. This causes the newline to be swapped from inside the list to outside the list and confuses the DSR algo later. It even helpfully warns about this.

[subbu@earth:~/work/wmf/parsoid] cat /tmp/wt
*<div>
*a
</div>b
[subbu@earth:~/work/wmf/parsoid] php bin/parse.php --trace html < /tmp/wt
0-[HTML]       | {"type":"TagTk","name":"ul","attribs":[],"dataAttribs":{"tsr":[0,0],"tmp":{"tagId":1}}}
0-[HTML]       | {"type":"TagTk","name":"li","attribs":[],"dataAttribs":{"tsr":[0,1],"tmp":{"tagId":2}}}
0-[HTML]       | {"type":"TagTk","name":"div","attribs":[],"dataAttribs":{"tsr":[1,6],"stx":"html","tmp":{"tagId":3}}}
0-[HTML]       | {"type":"EndTagTk","name":"li","attribs":[],"dataAttribs":{"tmp":{}}}
0-[HTML]       | {"type":"NlTk","dataAttribs":{"tsr":[6,7],"tmp":{}}}
0-[HTML]       | {"type":"TagTk","name":"li","attribs":[],"dataAttribs":{"tsr":[7,8],"tmp":{"tagId":4}}}
0-[HTML]       | "a"
0-[HTML]       | {"type":"EndTagTk","name":"div","attribs":[],"dataAttribs":{"tsr":[10,16],"stx":"html","tmp":{}}}
0-[HTML]       | {"type":"EndTagTk","name":"li","attribs":[],"dataAttribs":{"tmp":{}}}
0-[HTML]       | {"type":"EndTagTk","name":"ul","attribs":[],"dataAttribs":{"tmp":{}}}
0-[HTML]       | {"type":"NlTk","dataAttribs":{"tsr":[9,10],"tmp":{}}}
0-[HTML]       | {"type":"TagTk","name":"p","attribs":[],"dataAttribs":{"tmp":{"tagId":5}}}
0-[HTML]       | "b"
0-[HTML]       | {"type":"EndTagTk","name":"p","attribs":[],"dataAttribs":{"tmp":{}}}
0-[HTML]       | {"type":"NlTk","dataAttribs":{"tsr":[17,18],"tmp":{}}}
0-[HTML]       | {"type":"EOFTk"}
[warn/dsr/inconsistent] DSR inconsistency: cs/s mismatch for node: li s:8 ; cs:9
<ul data-parsoid='{"dsr":[0,16,0,0]}'><li data-parsoid='{"dsr":[0,6,1,0]}'><div data-parsoid='{"stx":"html","autoInsertedEnd":true,"dsr":[1,6,5,0]}'></div></li>
<li data-parsoid='{"dsr":[7,16,1,0]}'>a<meta typeof="mw:Placeholder/StrippedTag" data-parsoid='{"src":"&lt;/div>","name":"div","dsr":[10,16,null,null]}'/></li></ul>
<p data-parsoid='{"dsr":[17,17,0,0]}'>b</p>

Change 672161 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/services/parsoid@master] WIP: ListHandler: when in EOL state, close lists always

https://gerrit.wikimedia.org/r/672161

Change 672161 merged by jenkins-bot:
[mediawiki/services/parsoid@master] ListHandler: when in EOL state, close lists always

https://gerrit.wikimedia.org/r/672161

Change 675310 had a related patch set uploaded (by Subramanya Sastry; author: Subramanya Sastry):
[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.13.0-a30

https://gerrit.wikimedia.org/r/675310

Change 675738 had a related patch set uploaded (by C. Scott Ananian; author: Subramanya Sastry):
[mediawiki/vendor@wmf/1.36.0-wmf.37] Bump wikimedia/parsoid to 0.13.0-a30

https://gerrit.wikimedia.org/r/675738

Change 675310 merged by jenkins-bot:
[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.13.0-a30

https://gerrit.wikimedia.org/r/675310

Change 675738 merged by jenkins-bot:
[mediawiki/vendor@wmf/1.36.0-wmf.37] Bump wikimedia/parsoid to 0.13.0-a30

https://gerrit.wikimedia.org/r/675738

Another off-by-1 DSR computation error that triggers utf-8 errors in template wrapping

[subbu@earth:~/work/wmf/parsoid] cat /tmp/thwt
{{font||css=background-color: rgb(250, 245, 0)}} ''{{font||css=background-color: rgb(255, 40, 0)}} เ'' {{font||css=background-color: rgb(200, 0, 255)}} x
[subbu@earth:~/work/wmf/parsoid] php bin/parse.php --domain th.wikipedia.org --dump dom:post-dsr < /tmp/thwt
----- DOM: post-dsr -----
<body data-parsoid='{"tmp":null,"dsr":[0,156,0,0]}'><p data-parsoid='{"tmp":{"tagId":1,"bits":0},"dsr":[0,null,0,0]}'><meta typeof="mw:Transclusion" about="#mwt1" data-parsoid='{"tmp":{"tagId":2,"bits":0,"tplarginfo":{"targetWt":"font","func":null,"href":"./แม่แบบ:Font","paramInfos":{"1":{"k":"1"},"css":{"k":"css","named":true}}}},"tsr":[0,48],"src":"{{font||css=background-color: rgb(250, 245, 0)}}","dsr":[0,48,null,null]}'/><span style="font-family:sans-serif;font-size:100%;color:black;background-color:transparent;;background-color: rgb(250, 245, 0)" data-parsoid='{"tmp":{"tagId":3,"bits":256},"stx":"html"}'>{{{text}}}</span></p>
<meta typeof="mw:Transclusion/End" about="#mwt1" data-parsoid='{"tmp":{"tagId":4,"bits":256},"tsr":[null,48],"dsr":[null,48,null,null]}'/><pre data-parsoid='{"tmp":{"tagId":5,"bits":0},"dsr":[48,155,1,0]}'><i data-parsoid='{"tmp":{"tagId":6,"bits":0},"tsr":[49,51],"autoInsertedEnd":true,"dsr":[49,null,2,0]}'><meta typeof="mw:Transclusion" about="#mwt2" data-parsoid='{"tmp":{"tagId":7,"bits":0,"tplarginfo":{"targetWt":"font","func":null,"href":"./แม่แบบ:Font","paramInfos":{"1":{"k":"1"},"css":{"k":"css","named":true}}}},"tsr":[51,98],"src":"{{font||css=background-color: rgb(255, 40, 0)}}","dsr":[51,98,null,null]}'/><span style="font-family:sans-serif;font-size:100%;color:black;background-color:transparent;;background-color: rgb(255, 40, 0)" data-parsoid='{"tmp":{"tagId":8,"bits":256},"stx":"html"}'>{{{text}}}</span></i>
<meta typeof="mw:Transclusion/End" about="#mwt2" data-parsoid='{"tmp":{"tagId":9,"bits":256},"tsr":[null,98],"dsr":[null,98,null,null]}'/>เ<i data-parsoid='{"tmp":{"tagId":10,"bits":0},"tsr":[102,104],"autoInsertedEnd":true,"dsr":[101,null,2,0]}'> <meta typeof="mw:Transclusion" about="#mwt3" data-parsoid='{"tmp":{"tagId":11,"bits":0,"tplarginfo":{"targetWt":"font","func":null,"href":"./แม่แบบ:Font","paramInfos":{"1":{"k":"1"},"css":{"k":"css","named":true}}}},"tsr":[105,153],"src":"{{font||css=background-color: rgb(200, 0, 255)}}","dsr":[105,153,null,null]}'/><span style="font-family:sans-serif;font-size:100%;color:black;background-color:transparent;;background-color: rgb(200, 0, 255)" data-parsoid='{"tmp":{"tagId":12,"bits":256},"stx":"html"}'>{{{text}}}</span></i>
<meta typeof="mw:Transclusion/End" about="#mwt3" data-parsoid='{"tmp":{"tagId":13,"bits":256},"tsr":[null,153],"dsr":[null,153,null,null]}'/>x</pre>
</body>
-------------------------

Wikimedia\Assert\InvariantException from line 231 of /home/subbu/work/wmf/parsoid/vendor/wikimedia/assert/src/Assert.php: Invariant failed: Bad UTF-8 at start of string
#0 /home/subbu/work/wmf/parsoid/src/Utils/PHPUtils.php(176): Wikimedia\Assert\Assert::invariant()
#1 /home/subbu/work/wmf/parsoid/src/Wt2Html/PP/Processors/DOMRangeBuilder.php(965): Wikimedia\Parsoid\Utils\PHPUtils::safeSubstr()
#2 /home/subbu/work/wmf/parsoid/src/Wt2Html/PP/Processors/DOMRangeBuilder.php(1273): Wikimedia\Parsoid\Wt2Html\PP\Processors\DOMRangeBuilder->encapsulateTemplates()
...

The <i> tag has dsr [101,null,2,0] even though its tsr is [102,104]. So, for some reason the DSR computation started it at 101 instead of 102.

Reproducible with this:

 {{1x|
}} ''{{1x|
}} เ'' {{1x|<span></span>
}} d

The problem seems to be that indentPreDSRCorrection doesn't handle the case when the newlines are triggered by templates and so introduces this off-by-1 error on the <i> tag.

Change 768281 had a related patch set uploaded (by Subramanya Sastry; author: Subramanya Sastry):

[mediawiki/services/parsoid@master] WIP: Fix off-by-1 dsr error by cleaning up PreHandler

https://gerrit.wikimedia.org/r/768281

Change 768281 merged by jenkins-bot:

[mediawiki/services/parsoid@master] Fix off-by-1 DSR error by cleaning up and simplifying PreHandler

https://gerrit.wikimedia.org/r/768281

The fixes should go out on the next train.

Change 776984 had a related patch set uploaded (by Arlolra; author: Arlolra):

[mediawiki/vendor@master] Bump parsoid to 0.16.0-a4

https://gerrit.wikimedia.org/r/776984

Change 776984 merged by jenkins-bot:

[mediawiki/vendor@master] Bump parsoid to 0.16.0-a4

https://gerrit.wikimedia.org/r/776984