Page MenuHomePhabricator

Parsoid assigns wrong anchor in TOCData when there are duplicate IDs
Closed, ResolvedPublic

Description

Test case:
https://en.wikipedia.org/wiki/German-suited_playing_cards?useparsoid=1
where it manifests as floating section links. The wikitext looks like:

{{anchor|Ruimpf cards|Ruimpfkaten|Rümpffkarte|Rümpfkarte|Rumpfkarte|Rümpfspiel|Ruimpf|Ruimpfspiel}}
===== Ruimpf cards =====

which is generating:

<span id="Ruimpf_cards"></span> ....
<h2 id="Ruimpf_cards_2">

and we're wrapping the <span> not the <h2>.

Note that the legacy parser generates the follow HTML, which is *technically* invalid because of the duplicate IDs:

<span id="Ruimpf_cards"></span> ....
<h2 id="Ruimpf_cards">...

Parsoid is doing the right thing by deduplicating IDs and giving the deduplicated ID Ruimpf_cards_2 to the <h2>, but apparently we're creating the TOCData before the deduplication occurs (or not updating the TOCData when the heading is deduplicated) and so our TOCData ends up incorrect.

Another useful test case is:

==Foo==
==Foo==

Even the legacy parser deduplicates this case correctly (the second heading gets id Foo_2). I'm not sure what TOCData Parsoid generates for this case, but I strongly suspect we have the same bug -- ie, correctly deduplicate the HTML but have anchor=Foo for both SectionMetadata objects in our TOCData.

We could also consider emitting lint errors when we deduplicate IDs.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

(Worth noting that I did semi-deliberately make the "match by the ID listed in SectionMetadata" fragile, thinking that this would be a good way to smoke out errors in TOCData -- and in fact that turned out to be the case.)

ABreault-WMF moved this task from Backlog to In Progress on the Content-Transform-Team-WIP board.

Change 1009618 had a related patch set uploaded (by Arlolra; author: Arlolra):

[mediawiki/services/parsoid@master] [WIP] Duplicate heading ids

https://gerrit.wikimedia.org/r/1009618

WrapSectionsState::computeSectionMetadata seems blissfully unaware of cases where the reusedId flag is set. For example,

<h3 id="asdf">odd</h3>

produces TOCData,

 Sections:
- h3 index: toclevel:1 number:1 title:NULL off:NULL anchor/linkAnchor:asdf line:odd
+ h3 index: toclevel:1 number:1 title:NULL off:NULL anchor/linkAnchor:odd line:odd

Change 1009618 merged by jenkins-bot:

[mediawiki/services/parsoid@master] Combine the anchor processing from WrapSectionState and Headings

https://gerrit.wikimedia.org/r/1009618

Change 1010238 had a related patch set uploaded (by Arlolra; author: Arlolra):

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.19.0-a22

https://gerrit.wikimedia.org/r/1010238

Change 1010238 merged by jenkins-bot:

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.19.0-a22

https://gerrit.wikimedia.org/r/1010238