Page MenuHomePhabricator

PHP-parser + Remex combo output differs from PHP-parser + Tidy combo on some dl-dt wikitext snippets
Closed, ResolvedPublic

Event Timeline

ssastry triaged this task as Medium priority.Sep 5 2017, 11:46 PM

See transcript below. This is not a Remex bug after all. Tidy does overeager fixup of the HTML that is not mandated by the HTML5 spec. The reason Parsoid doesn't have the error is because it closes the </dt> tag and opens a <dl> when it encounters a newline, which in turns mimics PHP-parser +Tidy behavior. So, this is more like a PHP parser bug. This will need a fix to the PHP parser output to close the </dt> tag and open a <dl> tag like Parsoid does.

----- PARSOID -----
[subbu@earth maintenance] echo ';a\n:*b' | parse.js --trace html --normalize
0-[HTML]       | {"type":"TagTk","name":"dl","attribs":[],"dataAttribs":{"tsr":[0,0],"tmp":{"tagId":1}}}
0-[HTML]       | {"type":"TagTk","name":"dt","attribs":[],"dataAttribs":{"tsr":[0,1],"tmp":{"tagId":2}}}
0-[HTML]       | "a"
0-[HTML]       | {"type":"EndTagTk","name":"dt","attribs":[],"dataAttribs":{"tmp":{}}}
0-[HTML]       | {"type":"NlTk","dataAttribs":{"tsr":[2,3],"tmp":{}}}
0-[HTML]       | {"type":"TagTk","name":"dd","attribs":[],"dataAttribs":{"tsr":[3,4],"tmp":{"tagId":3}}}
0-[HTML]       | {"type":"TagTk","name":"ul","attribs":[],"dataAttribs":{"tsr":[4,4],"tmp":{"tagId":4}}}
0-[HTML]       | {"type":"TagTk","name":"li","attribs":[],"dataAttribs":{"tsr":[4,5],"tmp":{"tagId":5}}}
0-[HTML]       | "b"
0-[HTML]       | {"type":"EndTagTk","name":"li","attribs":[],"dataAttribs":{"tmp":{}}}
0-[HTML]       | {"type":"EndTagTk","name":"ul","attribs":[],"dataAttribs":{"tmp":{}}}
0-[HTML]       | {"type":"EndTagTk","name":"dd","attribs":[],"dataAttribs":{"tmp":{}}}
0-[HTML]       | {"type":"EndTagTk","name":"dl","attribs":[],"dataAttribs":{"tmp":{}}}
0-[HTML]       | {"type":"NlTk","dataAttribs":{"tsr":[6,7],"tmp":{}}}
0-[HTML]       | {"type":"EOFTk"}

<dl>
<dt>a</dt>
<dd>
<ul>
<li>b</li>
</ul>
</dd>
</dl>

----- PHP PARSER WITHOUT TIDY -----
[subbu@earth maintenance] echo ';a\n:*b' | php parse.php 
<div class="mw-parser-output"><dl><dt>a
<ul><li>b</li></ul></dt></dl>
</div>

----- PHP PARSER WITH TIDY -----
[subbu@earth maintenance] echo ';a\n:*b' | php parse.php --tidy
<div class="mw-parser-output"><dl>
<dt>a</dt>
<dd>
<ul>
<li>b</li>
</ul>
</dd>
</dl>

</div>

----- PARSOID when run on PHP PARSER's output -----
[subbu@earth maintenance] echo '<dl><dt>a\n<ul><li>b</li></ul></dt></dl>' | parse.js --normalize
<dl>
<dt>a
<ul>
<li>b</li>
</ul>
</dt>
</dl>

`

ssastry renamed this task from Remex HTML5 tree building bug in dl-dt handling? to PHP-parser + Remex combo output differs from PHP-parser + Tidy combo on some dl-dt wikitext snippets.Sep 6 2017, 10:17 PM
ssastry updated the task description. (Show Details)

Change 376446 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/core@master] WIP: Fix bug in dl-dt list output generation

https://gerrit.wikimedia.org/r/376446

Change 376446 merged by jenkins-bot:
[mediawiki/core@master] Fix bug in dl-dt list output generation

https://gerrit.wikimedia.org/r/376446

ssastry claimed this task.