Page MenuHomePhabricator

Off-by-1 DSR computation edge case error (that leads to subsequent invalid utf-8 errors in template wrapping)
Open, MediumPublic

Description

Reduced test case:

[subbu@earth:~/work/wmf/parsoid] cat /tmp/wt
*<div>{{1x|
*c
}}</div>d
[subbu@earth:~/work/wmf/parsoid] php bin/parse.php --wt2wt < /tmp/wt
*<div>{{1x|
*c
}}</div>dd

[subbu@earth:~/work/wmf/parsoid] php bin/parse.php < /tmp/wt
<ul about="#mwt1" typeof="mw:Transclusion" data-parsoid='{"dsr":[0,24,0,0],"firstWikitextNode":"UL","pi":[[{"k":"1"}]]}' data-mw='{"parts":["*&lt;div>",{"template":{"target":{"wt":"1x","href":"./Template:1x"},"params":{"1":{"wt":"\n*c\n"}},"i":0}},"&lt;/div>d"]}'><li><div></div></li>
<li>c<meta typeof="mw:Placeholder/StrippedTag"/></li></ul><span about="#mwt1">
</span><p data-parsoid='{"dsr":[24,24,0,0]}'>d</p>

The list has got offset [0,24] whereas it should be [0,23]. If that 'd' was a multi-byte character, then template wrapping would try to take a substring in the middle of the character causing it to trigger a UTF-8 error.

See below:

[subbu@earth:~/work/wmf/parsoid] cat /tmp/wt
*<div>{{1x|
*c
}}</div>の

[subbu@earth:~/work/wmf/parsoid] php bin/parse.php < /tmp/wt
Wikimedia\Assert\InvariantException from line 224 of /home/subbu/work/wmf/parsoid/vendor/wikimedia/assert/src/Assert.php: Invariant failed: Bad UTF-8 at end of string (3 byte sequence)
#0 /home/subbu/work/wmf/parsoid/src/Utils/PHPUtils.php(218): Wikimedia\Assert\Assert::invariant()
#1 /home/subbu/work/wmf/parsoid/src/Wt2Html/PP/Processors/WrapTemplates.php(963): Wikimedia\Parsoid\Utils\PHPUtils::safeSubstr()
#2 /home/subbu/work/wmf/parsoid/src/Wt2Html/PP/Processors/WrapTemplates.php(1226): Wikimedia\Parsoid\Wt2Html\PP\Processors\WrapTemplates::encapsulateTemplates()
#3 /home/subbu/work/wmf/parsoid/src/Wt2Html/PP/Processors/WrapTemplates.php(1239): Wikimedia\Parsoid\Wt2Html\PP\Processors\WrapTemplates::wrapTemplatesInTree()
#4 /home/subbu/work/wmf/parsoid/src/Wt2Html/DOMPostProcessor.php(158): Wikimedia\Parsoid\Wt2Html\PP\Processors\WrapTemplates->run()
#5 /home/subbu/work/wmf/parsoid/src/Wt2Html/DOMPostProcessor.php(837): Wikimedia\Parsoid\Wt2Html\DOMPostProcessor->Wikimedia\Parsoid\Wt2Html\{closure}()
#6 /home/subbu/work/wmf/parsoid/src/Wt2Html/DOMPostProcessor.php(887): Wikimedia\Parsoid\Wt2Html\DOMPostProcessor->doPostProcess()
#7 /home/subbu/work/wmf/parsoid/src/Wt2Html/DOMPostProcessor.php(904): Wikimedia\Parsoid\Wt2Html\DOMPostProcessor->process()
#8 /home/subbu/work/wmf/parsoid/src/Wt2Html/ParserPipeline.php(178): Wikimedia\Parsoid\Wt2Html\DOMPostProcessor->processChunkily()
#9 /home/subbu/work/wmf/parsoid/src/Wt2Html/ParserPipelineFactory.php(310): Wikimedia\Parsoid\Wt2Html\ParserPipeline->parseChunkily()
#10 /home/subbu/work/wmf/parsoid/src/Core/WikitextContentModelHandler.php(106): Wikimedia\Parsoid\Wt2Html\ParserPipelineFactory->parse()
#11 /home/subbu/work/wmf/parsoid/src/Parsoid.php(162): Wikimedia\Parsoid\Core\WikitextContentModelHandler->toDOM()
#12 /home/subbu/work/wmf/parsoid/src/Parsoid.php(194): Wikimedia\Parsoid\Parsoid->parseWikitext()
#13 /home/subbu/work/wmf/parsoid/bin/parse.php(327): Wikimedia\Parsoid\Parsoid->wikitext2html()
#14 /home/subbu/work/wmf/parsoid/bin/parse.php(584): Parse->wt2Html()
#15 /home/subbu/work/wmf/parsoid/tools/doMaintenance.php(53): Parse->execute()
#16 /home/subbu/work/wmf/parsoid/bin/parse.php(620): require_once('/home/subbu/wor...')
#17 {main}

This is probably another source of UTF-8 production error logged in T269749: WrapTemplates: UTF-8 errors

Event Timeline

ssastry triaged this task as Medium priority.Mar 14 2021, 5:42 PM

This looks more like a list-wrapping error where the list wrapper treats the </div> as part of the list item. This causes the newline to be swapped from inside the list to outside the list and confuses the DSR algo later. It even helpfully warns about this.

[subbu@earth:~/work/wmf/parsoid] cat /tmp/wt
*<div>
*a
</div>b
[subbu@earth:~/work/wmf/parsoid] php bin/parse.php --trace html < /tmp/wt
0-[HTML]       | {"type":"TagTk","name":"ul","attribs":[],"dataAttribs":{"tsr":[0,0],"tmp":{"tagId":1}}}
0-[HTML]       | {"type":"TagTk","name":"li","attribs":[],"dataAttribs":{"tsr":[0,1],"tmp":{"tagId":2}}}
0-[HTML]       | {"type":"TagTk","name":"div","attribs":[],"dataAttribs":{"tsr":[1,6],"stx":"html","tmp":{"tagId":3}}}
0-[HTML]       | {"type":"EndTagTk","name":"li","attribs":[],"dataAttribs":{"tmp":{}}}
0-[HTML]       | {"type":"NlTk","dataAttribs":{"tsr":[6,7],"tmp":{}}}
0-[HTML]       | {"type":"TagTk","name":"li","attribs":[],"dataAttribs":{"tsr":[7,8],"tmp":{"tagId":4}}}
0-[HTML]       | "a"
0-[HTML]       | {"type":"EndTagTk","name":"div","attribs":[],"dataAttribs":{"tsr":[10,16],"stx":"html","tmp":{}}}
0-[HTML]       | {"type":"EndTagTk","name":"li","attribs":[],"dataAttribs":{"tmp":{}}}
0-[HTML]       | {"type":"EndTagTk","name":"ul","attribs":[],"dataAttribs":{"tmp":{}}}
0-[HTML]       | {"type":"NlTk","dataAttribs":{"tsr":[9,10],"tmp":{}}}
0-[HTML]       | {"type":"TagTk","name":"p","attribs":[],"dataAttribs":{"tmp":{"tagId":5}}}
0-[HTML]       | "b"
0-[HTML]       | {"type":"EndTagTk","name":"p","attribs":[],"dataAttribs":{"tmp":{}}}
0-[HTML]       | {"type":"NlTk","dataAttribs":{"tsr":[17,18],"tmp":{}}}
0-[HTML]       | {"type":"EOFTk"}
[warn/dsr/inconsistent] DSR inconsistency: cs/s mismatch for node: li s:8 ; cs:9
<ul data-parsoid='{"dsr":[0,16,0,0]}'><li data-parsoid='{"dsr":[0,6,1,0]}'><div data-parsoid='{"stx":"html","autoInsertedEnd":true,"dsr":[1,6,5,0]}'></div></li>
<li data-parsoid='{"dsr":[7,16,1,0]}'>a<meta typeof="mw:Placeholder/StrippedTag" data-parsoid='{"src":"&lt;/div>","name":"div","dsr":[10,16,null,null]}'/></li></ul>
<p data-parsoid='{"dsr":[17,17,0,0]}'>b</p>

Change 672161 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/services/parsoid@master] WIP: ListHandler: when in EOL state, close lists always

https://gerrit.wikimedia.org/r/672161

Change 672161 merged by jenkins-bot:
[mediawiki/services/parsoid@master] ListHandler: when in EOL state, close lists always

https://gerrit.wikimedia.org/r/672161

Change 675310 had a related patch set uploaded (by Subramanya Sastry; author: Subramanya Sastry):
[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.13.0-a30

https://gerrit.wikimedia.org/r/675310

Change 675738 had a related patch set uploaded (by C. Scott Ananian; author: Subramanya Sastry):
[mediawiki/vendor@wmf/1.36.0-wmf.37] Bump wikimedia/parsoid to 0.13.0-a30

https://gerrit.wikimedia.org/r/675738

Change 675310 merged by jenkins-bot:
[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.13.0-a30

https://gerrit.wikimedia.org/r/675310

Change 675738 merged by jenkins-bot:
[mediawiki/vendor@wmf/1.36.0-wmf.37] Bump wikimedia/parsoid to 0.13.0-a30

https://gerrit.wikimedia.org/r/675738