Parsoid / legacy parser disagree whether to include extension content in TOC
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Arlolra
	Jan 15 2024, 10:15 PM

Description

Discovered when looking at rt testing diffs on metawiki:Diff_(blog)

The content in question there was, <rss max="3" templatename="Diff-RSS">https://diff.wikimedia.org/feed/</rss> which produces span wrapped literal heading tags, which resulted in crashers that required https://gerrit.wikimedia.org/r/c/mediawiki/services/parsoid/+/990630 for T352467

However, the already deployed output differs in TOC content,
https://meta.wikimedia.org/wiki/Diff_(blog)?useparsoid=0&useskin=Vector
vs
https://meta.wikimedia.org/wiki/Diff_(blog)?useparsoid=1&useskin=Vector

A simplified test case could be,

== 1 ==

== 2 ==

== 3 ==


hmmm <ref>
<h4> haha </h4>
</ref>

The unfortunate consequence of https://gerrit.wikimedia.org/r/c/mediawiki/services/parsoid/+/989618 is that it places the TOC meta inside the extension of the above metawiki page, since the extension content has the first header for findTOCInsertionPoint

Details

Subject	Repo	Branch	Lines +/-
Bump wikimedia/parsoid to 0.19.0-a14	mediawiki/vendor	master	+358 -113
Skip over extension content while looking for TOC insertion point	mediawiki/services/parsoid	master	+4 -0
Don't add TOC data for sections in extensions	mediawiki/services/parsoid	master	+146 -32

Customize query in gerrit

Related Objects

Mentioned In: T359450: Parsoid is not adding headings to TOC entries in some templated content scenarios
T359221: Parsoid's TOC handling needs to deal with HTML-returning components (exts, SPTs, parser funcs returning HTML) consistently
Mentioned Here: T359221: Parsoid's TOC handling needs to deal with HTML-returning components (exts, SPTs, parser funcs returning HTML) consistently
T355704: PHP Notice: Trying to get property 'start' of non-object
T327429: Merging of ParserOutput::getSections() / ::getTOCData() is not well defined
T352467: Parsoid TOC edge case bugs

Event Timeline

Arlolra created this task.Jan 15 2024, 10:15 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 15 2024, 10:15 PM

MSantos moved this task from Needs Triage to Bugs & Crashers on the Parsoid board.Jan 18 2024, 3:16 PM

MSantos added a project: Parsoid-Read-Views.

MSantos added a project: Content-Transform-Team-WIP.Jan 18 2024, 3:31 PM

There are two bugs here: one is the location of the TOC insertion point, and I propose to use this particular task to handle that. The other is "how to merge section data generated by extensions" and that's /probably/ related to T327429: Merging of ParserOutput::getSections() / ::getTOCData() is not well defined. I say "probably" because you could also say that all heading extraction for the TOC should be a postprocessing pass, and that would mean we never have to merge TOC data, we just have to ensure we traverse the extension content in the post processor. But the alternative is that subparses generate section data in their ParserOutput which we then need to merge into the parent at the proper place.

For Parsoid, T327429 is not relevant. The Section wrapping computation code generates sections, and associated TOC data for the entire document by processing sections as they are generated. So, that will also be handled within Parsoid independent of what happens to T327429. So, Parosid's TOC generation is indeed a postprocessing pass.

Change 991647 had a related patch set uploaded (by Subramanya Sastry; author: Subramanya Sastry):

[mediawiki/services/parsoid@master] Skip over extension content while looking for TOC insertion point

https://gerrit.wikimedia.org/r/991647

gerritbot added a project: Patch-For-Review.Jan 18 2024, 8:26 PM

Change 991650 had a related patch set uploaded (by Subramanya Sastry; author: Subramanya Sastry):

[mediawiki/services/parsoid@master] WIP: Don't add TOC data for sections in extensions

https://gerrit.wikimedia.org/r/991650

ssastry triaged this task as Low priority.Jan 18 2024, 8:39 PM

ssastry claimed this task.Jan 18 2024, 9:27 PM

ssastry moved this task from Backlog to Code Review on the Content-Transform-Team-WIP board.

Change 991647 merged by jenkins-bot:

[mediawiki/services/parsoid@master] Skip over extension content while looking for TOC insertion point

https://gerrit.wikimedia.org/r/991647

Change 991650 merged by jenkins-bot:

[mediawiki/services/parsoid@master] Don't add TOC data for sections in extensions

https://gerrit.wikimedia.org/r/991650

Maintenance_bot removed a project: Patch-For-Review.Jan 19 2024, 6:30 PM

ssastry moved this task from Code Review to To Deploy on the Content-Transform-Team-WIP board.Jan 19 2024, 6:33 PM

Change 992095 had a related patch set uploaded (by Isabelle Hurbain-Palatin; author: Isabelle Hurbain-Palatin):

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.19.0-a14

https://gerrit.wikimedia.org/r/992095

gerritbot added a project: Patch-For-Review.Jan 22 2024, 10:20 AM

Change 992095 merged by jenkins-bot:

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.19.0-a14

https://gerrit.wikimedia.org/r/992095

Maintenance_bot removed a project: Patch-For-Review.Jan 22 2024, 6:31 PM

Investigating T355704, I noticed that the same as reported in this task is true for special page transclusions and presumably all parser functions returning html.

For example,
https://www.mediawiki.org/w/index.php?title=User:Arlolra/sandbox&oldid=6330524&useskin=Vector&useparsoid=0
vs
https://www.mediawiki.org/w/index.php?title=User:Arlolra/sandbox&oldid=6330524&useskin=Vector&useparsoid=1

In T355092#9470734, @ssastry wrote:

For Parsoid, T327429 is not relevant. The Section wrapping computation code generates sections, and associated TOC data for the entire document by processing sections as they are generated. So, that will also be handled within Parsoid independent of what happens to T327429. So, Parosid's TOC generation is indeed a postprocessing pass.

Currently, sure. The question is what happens if we parse an extension and get back a ParserOutput with Section data set -- as happens with special page transclusions, for instance. Do we throw away that data and recompute sections from scratch (which is what Parsoid is doing right now, and in which case this is purely a question of postprocessing) or do we somehow "merge" the section data -- if the extension were to add TOC entries which *don't* correspond with Parsoid MediaWiki DOM HTML content, then this would become an acute question. At the moment that's not happening though, in which case both approaches ought to be equivalent.

Just saying that the two tasks are related in that sense. As you say, if we choose to define TOC generation purely as a postprocessing stage, then the proper metadata merge behavior is "throw it away", which is a valid merge type.

MSantos moved this task from Uncategorized to Phase 2 - testwiki Main namespace support on the Parsoid-Read-Views board.Jan 25 2024, 3:24 PM

MSantos edited projects, added Parsoid-Read-Views (Phase 2 - testwiki Main namespace support); removed Parsoid-Read-Views.

In T355092#9483046, @Arlolra wrote:

Investigating T355704, I noticed that the same as reported in this task is true for special page transclusions and presumably all parser functions returning html.

For example,
https://www.mediawiki.org/w/index.php?title=User:Arlolra/sandbox&oldid=6330524&useskin=Vector&useparsoid=0
vs
https://www.mediawiki.org/w/index.php?title=User:Arlolra/sandbox&oldid=6330524&useskin=Vector&useparsoid=1

Resolving this task since I created T359221 for this.

ssastry mentioned this in T359450: Parsoid is not adding headings to TOC entries in some templated content scenarios.Mar 6 2024, 8:50 PM

Parsoid / legacy parser disagree whether to include extension content in TOCClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Parsoid / legacy parser disagree whether to include extension content in TOC
Closed, ResolvedPublic
Actions