Parsoid assigns wrong anchor in TOCData when there are duplicate IDs
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	cscott
	Feb 24 2024, 12:13 AM

Description

Test case:
https://en.wikipedia.org/wiki/German-suited_playing_cards?useparsoid=1
where it manifests as floating section links. The wikitext looks like:

{{anchor|Ruimpf cards|Ruimpfkaten|Rümpffkarte|Rümpfkarte|Rumpfkarte|Rümpfspiel|Ruimpf|Ruimpfspiel}}
===== Ruimpf cards =====

which is generating:

<span id="Ruimpf_cards"></span> ....
<h2 id="Ruimpf_cards_2">

and we're wrapping the <span> not the <h2>.

Note that the legacy parser generates the follow HTML, which is *technically* invalid because of the duplicate IDs:

<span id="Ruimpf_cards"></span> ....
<h2 id="Ruimpf_cards">...

Parsoid is doing the right thing by deduplicating IDs and giving the deduplicated ID Ruimpf_cards_2 to the <h2>, but apparently we're creating the TOCData before the deduplication occurs (or not updating the TOCData when the heading is deduplicated) and so our TOCData ends up incorrect.

Another useful test case is:

==Foo==
==Foo==

Even the legacy parser deduplicates this case correctly (the second heading gets id Foo_2). I'm not sure what TOCData Parsoid generates for this case, but I strongly suspect we have the same bug -- ie, correctly deduplicate the HTML but have anchor=Foo for both SectionMetadata objects in our TOCData.

We could also consider emitting lint errors when we deduplicate IDs.

Details

	Subject	Repo	Branch	Lines +/-
	Bump wikimedia/parsoid to 0.19.0-a22	mediawiki/vendor	master	+2 K -1 K
	Combine the anchor processing from WrapSectionState and Headings	mediawiki/services/parsoid	master	+122 -103

Customize query in gerrit

Related Objects

Mentioned In: T359896: Parsoid EditSectionLink pass adding edit section links in the middle of lists in some cases
Mentioned Here: T200517: Emit lint error or category when a page uses duplicate HTML IDs

Event Timeline

cscott created this task.Feb 24 2024, 12:13 AM

Restricted Application added a project: Parsoid. · View Herald TranscriptFeb 24 2024, 12:13 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

cscott updated the task description. (Show Details)Feb 24 2024, 12:14 AM

(Worth noting that I did semi-deliberately make the "match by the ID listed in SectionMetadata" fragile, thinking that this would be a good way to smoke out errors in TOCData -- and in fact that turned out to be the case.)

We could also consider emitting lint errors when we deduplicate IDs.

T200517: Emit lint error or category when a page uses duplicate HTML IDs

Arlolra claimed this task.Feb 26 2024, 6:51 PM

Arlolra added a project: Content-Transform-Team-WIP.

ABreault-WMF triaged this task as Medium priority.Mar 7 2024, 7:40 PM

ABreault-WMF moved this task from Backlog to In Progress on the Content-Transform-Team-WIP board.

Change 1009618 had a related patch set uploaded (by Arlolra; author: Arlolra):

[mediawiki/services/parsoid@master] [WIP] Duplicate heading ids

https://gerrit.wikimedia.org/r/1009618

gerritbot added a project: Patch-For-Review.Mar 7 2024, 10:52 PM

WrapSectionsState::computeSectionMetadata seems blissfully unaware of cases where the reusedId flag is set. For example,

<h3 id="asdf">odd</h3>

produces TOCData,

 Sections:
- h3 index: toclevel:1 number:1 title:NULL off:NULL anchor/linkAnchor:asdf line:odd
+ h3 index: toclevel:1 number:1 title:NULL off:NULL anchor/linkAnchor:odd line:odd

ssastry moved this task from Needs Triage to Bugs & Crashers on the Parsoid board.Mar 8 2024, 9:46 PM

Change 1009618 merged by jenkins-bot:

[mediawiki/services/parsoid@master] Combine the anchor processing from WrapSectionState and Headings

https://gerrit.wikimedia.org/r/1009618

Arlolra moved this task from In Progress to To Deploy on the Content-Transform-Team-WIP board.Mar 9 2024, 3:43 PM

Maintenance_bot removed a project: Patch-For-Review.Mar 9 2024, 4:30 PM

Change 1010238 had a related patch set uploaded (by Arlolra; author: Arlolra):

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.19.0-a22

https://gerrit.wikimedia.org/r/1010238

gerritbot added a project: Patch-For-Review.Mar 11 2024, 4:52 PM

Change 1010238 merged by jenkins-bot:

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.19.0-a22

https://gerrit.wikimedia.org/r/1010238

Maintenance_bot removed a project: Patch-For-Review.Mar 11 2024, 8:30 PM

cscott mentioned this in T359896: Parsoid EditSectionLink pass adding edit section links in the middle of lists in some cases.Mar 12 2024, 4:14 PM

ssastry merged a task: T359896: Parsoid EditSectionLink pass adding edit section links in the middle of lists in some cases.Mar 14 2024, 3:14 PM

ssastry moved this task from To Deploy to To Verify on the Content-Transform-Team-WIP board.

ssastry added subscribers: ssastry, Arlolra.

Post-deploy and ?action=purge, this seems fine now,
https://en.wikipedia.org/wiki/German-suited_playing_cards?useparsoid=1#Ruimpf_cards_2

Parsoid assigns wrong anchor in TOCData when there are duplicate IDsClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Parsoid assigns wrong anchor in TOCData when there are duplicate IDs
Closed, ResolvedPublic
Actions