Page MenuHomePhabricator

Parsoid / legacy parser disagree whether to include extension content in TOC
Closed, ResolvedPublic

Description

Discovered when looking at rt testing diffs on metawiki:Diff_(blog)

The content in question there was, <rss max="3" templatename="Diff-RSS">https://diff.wikimedia.org/feed/</rss> which produces span wrapped literal heading tags, which resulted in crashers that required https://gerrit.wikimedia.org/r/c/mediawiki/services/parsoid/+/990630 for T352467

However, the already deployed output differs in TOC content,
https://meta.wikimedia.org/wiki/Diff_(blog)?useparsoid=0&useskin=Vector
vs
https://meta.wikimedia.org/wiki/Diff_(blog)?useparsoid=1&useskin=Vector

A simplified test case could be,

== 1 ==

== 2 ==

== 3 ==


hmmm <ref>
<h4> haha </h4>
</ref>

The unfortunate consequence of https://gerrit.wikimedia.org/r/c/mediawiki/services/parsoid/+/989618 is that it places the TOC meta inside the extension of the above metawiki page, since the extension content has the first header for findTOCInsertionPoint

Event Timeline

There are two bugs here: one is the location of the TOC insertion point, and I propose to use this particular task to handle that. The other is "how to merge section data generated by extensions" and that's /probably/ related to T327429: Merging of ParserOutput::getSections() / ::getTOCData() is not well defined. I say "probably" because you could also say that all heading extraction for the TOC should be a postprocessing pass, and that would mean we never have to merge TOC data, we just have to ensure we traverse the extension content in the post processor. But the alternative is that subparses generate section data in their ParserOutput which we then need to merge into the parent at the proper place.

For Parsoid, T327429 is not relevant. The Section wrapping computation code generates sections, and associated TOC data for the entire document by processing sections as they are generated. So, that will also be handled within Parsoid independent of what happens to T327429. So, Parosid's TOC generation is indeed a postprocessing pass.

Change 991647 had a related patch set uploaded (by Subramanya Sastry; author: Subramanya Sastry):

[mediawiki/services/parsoid@master] Skip over extension content while looking for TOC insertion point

https://gerrit.wikimedia.org/r/991647

Change 991650 had a related patch set uploaded (by Subramanya Sastry; author: Subramanya Sastry):

[mediawiki/services/parsoid@master] WIP: Don't add TOC data for sections in extensions

https://gerrit.wikimedia.org/r/991650

Change 991647 merged by jenkins-bot:

[mediawiki/services/parsoid@master] Skip over extension content while looking for TOC insertion point

https://gerrit.wikimedia.org/r/991647

Change 991650 merged by jenkins-bot:

[mediawiki/services/parsoid@master] Don't add TOC data for sections in extensions

https://gerrit.wikimedia.org/r/991650

Change 992095 had a related patch set uploaded (by Isabelle Hurbain-Palatin; author: Isabelle Hurbain-Palatin):

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.19.0-a14

https://gerrit.wikimedia.org/r/992095

Change 992095 merged by jenkins-bot:

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.19.0-a14

https://gerrit.wikimedia.org/r/992095

For Parsoid, T327429 is not relevant. The Section wrapping computation code generates sections, and associated TOC data for the entire document by processing sections as they are generated. So, that will also be handled within Parsoid independent of what happens to T327429. So, Parosid's TOC generation is indeed a postprocessing pass.

Currently, sure. The question is what happens if we parse an extension and get back a ParserOutput with Section data set -- as happens with special page transclusions, for instance. Do we throw away that data and recompute sections from scratch (which is what Parsoid is doing right now, and in which case this is purely a question of postprocessing) or do we somehow "merge" the section data -- if the extension were to add TOC entries which *don't* correspond with Parsoid MediaWiki DOM HTML content, then this would become an acute question. At the moment that's not happening though, in which case both approaches ought to be equivalent.

Just saying that the two tasks are related in that sense. As you say, if we choose to define TOC generation purely as a postprocessing stage, then the proper metadata merge behavior is "throw it away", which is a valid merge type.