Derive heading ids from heading name, the same way MW core does
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• GWicke
	Jun 12 2015, 12:41 AM

Description

MediaWiki core creates ids per heading based on the content of that heading. This is used to link to the given section from the table of contents, and is also often used by users to reference specific sections.

It would be great if Parsoid implemented the same ids, possibly in addition to the current random ids (using a meta tag?).

See also the email thread starting with https://lists.wikimedia.org/pipermail/mobile-l/2015-October/009886.html. As @Tgr points out the ids need to be unique, too.

Related Objects

Mentioned In: T154279: Cannot load a saved translation, JS error from Sizzle at $section.data( 'source' )
T150213: Unknown contentmodels
T149209: Parsoid serialised an edit to a wikitext table adding a /n without stripping the double-pipes, breaking the table format (`\n|| align="right" | …`)
T151570: Create Wikivoyage Finnish
T150112: Internal links pointing to interwikis are not encoded at all
T94949: Interwiki links to other MediaWiki wikis in the same cluster don't encode section fragment
T110910: Implement <gallery> extension natively inside Parsoid
T148645: Content service doesn't handle URL fragments when redirecting
T116876: Provide same anchor ids for sections as Core does
Mentioned Here: T154279: Cannot load a saved translation, JS error from Sizzle at $section.data( 'source' )
T152540: Migrate to HTML5 section ids
rGPAR3cf19c6be98b: Place typeof="mw:Image" on gallery image span wrappers
T94949: Interwiki links to other MediaWiki wikis in the same cluster don't encode section fragment
T110910: Implement <gallery> extension natively inside Parsoid
T149209: Parsoid serialised an edit to a wikitext table adding a /n without stripping the double-pipes, breaking the table format (`\n|| align="right" | …`)
T150112: Internal links pointing to interwikis are not encoded at all
T150213: Unknown contentmodels
T151570: Create Wikivoyage Finnish
T59252: Cite: Reference extension outputs unescaped fragments in parsoid

Event Timeline

• GWicke created this task.Jun 12 2015, 12:41 AM

• GWicke raised the priority of this task from to Needs Triage.

• GWicke updated the task description. (Show Details)

• GWicke added a project: Parsoid.

• GWicke subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 12 2015, 12:41 AM

Arlolra triaged this task as Medium priority.Jul 7 2015, 2:56 AM

Arlolra subscribed.

• ssastry mentioned this in T116876: Provide same anchor ids for sections as Core does.Oct 28 2015, 4:20 PM

• ssastry merged a task: T116876: Provide same anchor ids for sections as Core does.

• ssastry updated the task description. (Show Details)

• ssastry set Security to None.

• ssastry added subscribers: • ssastry, Tgr, • bearND.

• GWicke mentioned this in T148645: Content service doesn't handle URL fragments when redirecting.Oct 26 2016, 3:38 PM

See also T59252 for a discussion of the historic fragment ID encoding vs. HTML5.

In MCS we added an anchorencode function[1] that is used when it builds section anchors[2]. Maybe that could be useful as a starting point?

[1] https://phabricator.wikimedia.org/diffusion/GMOA/browse/master/lib/anchorencode.js
[2] https://phabricator.wikimedia.org/diffusion/GMOA/browse/master/lib/parsoid-access.js;f872894780f8388e472a0eb3e6aa46da1f8a6269$102

So, heading anchors are generated from wikitext of a heading, not the innerHTML of the heading. So, how is MCS doing this right now?

I ask because Parsoid doesn't have easy access to the wikitext of individual nodes that come from template content. It only has this for top-level content. So, I am trying to figure out how to deal with this. Of course, most headings are plain text, and innerHTML of the heading will be equivalent to the wikitext especially since the formatHeadings code strips HTML tags and normalizees whitespace. But, this fails for headings that have quotes, links, or templates.

In T102209#2786711, @ssastry wrote:

So, heading anchors are generated from wikitext of a heading, not the innerHTML of the heading. So, how is MCS doing this right now?

I ask because Parsoid doesn't have easy access to the wikitext of individual nodes that come from template content. It only has this for top-level content. So, I am trying to figure out how to deal with this. Of course, most headings are plain text, and innerHTML of the heading will be equivalent to the wikitext especially since the formatHeadings code strips HTML tags and normalizees whitespace. But, this fails for headings that have quotes, links, or templates.

Scratch all that. I was wrong. It looks like the heading anchors are generated from the HTML, not the wikitext. I misinterpreted $text (param to formatHeadings) as being wikitext, but it is actually html.

IIRC sometimes the {{}} of the template is visible in the ID; it depends on which of several anchor encoding code paths are called. (I think there is one for the actual page HTML, one for the links in page history / recentchanges, and one for returning you where you were after clicking on a section edit link and saving? I might be confusing things, it was a long time ago I looked at this.)

In T102209#2787222, @Tgr wrote:

IIRC sometimes the {{}} of the template is visible in the ID; it depends on which of several anchor encoding code paths are called. (I think there is one for the actual page HTML, one for the links in page history / recentchanges, and one for returning you where you were after clicking on a section edit link and saving? I might be confusing things, it was a long time ago I looked at this.)

In the enwp sandbox, I tested == {{1x|1=moo and ''boo'' and [[gah]] and {{1x|1=wtf}} x}} == and the heading anchor is <span class="mw-headline" id="moo_and_boo_and_gah_and_wtf_x"> which doesn't have any of the wikitext chars in it.

So, it turns out that inserting a <meta id="foo" /> does not help. https://.../title#foo does not take you to the anchor. You are still left at the top of the page. <h2 id="foo"> works as expected, but we cannot add ids to heading tags since we are using ids for data-parsoid. So, realistic options are using <span id="foo"></span> or <a name="foo"></a> both of which work as expected.

Named anchors solution seems like a "clean" solution. However, this can potentially impact clients that inspect a-tags. So, needs some discussion with them.

We could use empty spans, but it feels like a hack.

Thoughts?

In T102209#2787310, @ssastry wrote:

So, it turns out that inserting a <meta id="foo" /> does not help. https://.../title#foo does not take you to the anchor. You are still left at the top of the page. <h2 id="foo"> works as expected, but we cannot add ids to heading tags since we are using ids for data-parsoid. So, realistic options are using <span id="foo"></span> or <a name="foo"></a> both of which work as expected.

Named anchors solution seems like a "clean" solution. However, this can potentially impact clients that inspect a-tags. So, needs some discussion with them.

We could use empty spans, but it feels like a hack.

Thoughts?

Or, Parsoid could generate a <h2><span id="..">heading here </span></h2> like the PHP parser does. In any case, no matter which solution we go with, Parsoid's serialization code needs to be fixed up to ignore these new elements.

One other thing I discovered is that the core code does not deduplicate ids if the heading ids are present elsewhere on some other element. For example:

<div id='x'>foo</div>
==x==

assigns id='x' to the heading as well which is broken. Since we are going to dedupe ids for headings, we will dedupe it across the board in Parsoid HTML.

In T102209#2787310, @ssastry wrote:

<h2 id="foo"> works as expected, but we cannot add ids to heading tags since we are using ids for data-parsoid.

Could you elaborate on why you can not reuse those human-readable ids, provided that they are unique? This case seems to be equivalent to reusing other user-supplied ids.

In T102209#2787419, @GWicke wrote:

In T102209#2787310, @ssastry wrote:

<h2 id="foo"> works as expected, but we cannot add ids to heading tags since we are using ids for data-parsoid.

Could you elaborate on why you can not reuse those human-readable ids, provided that they are unique? This case seems to be equivalent to reusing other user-supplied ids.

No reason .. I just didn't think about that option. ;-) But, yes, using these ids in headings can also work for data-parsoid.

In T102209#2787431, @ssastry wrote:

In T102209#2787419, @GWicke wrote:

In T102209#2787310, @ssastry wrote:

<h2 id="foo"> works as expected, but we cannot add ids to heading tags since we are using ids for data-parsoid.

Could you elaborate on why you can not reuse those human-readable ids, provided that they are unique? This case seems to be equivalent to reusing other user-supplied ids.

No reason .. I just didn't think about that option. ;-) But, yes, using these ids in headings can also work for data-parsoid.

But, the reason I started with the meta tag is the discussion we had about multiple ids at one point and the mention of the meta tag in the description.

In T102209#2787433, @ssastry wrote:

But, the reason I started with the meta tag is the discussion we had about multiple ids at one point and the mention of the meta tag in the description.

Makes sense. Do you plan to keep the old escaping, or go to cleaner utf8 fragments?

In T102209#2787464, @GWicke wrote:

In T102209#2787433, @ssastry wrote:

But, the reason I started with the meta tag is the discussion we had about multiple ids at one point and the mention of the meta tag in the description.

Makes sense. Do you plan to keep the old escaping, or go to cleaner utf8 fragments?

In https://gerrit.wikimedia.org/r/#/c/96892/, I am already generating html5 compliant ids which are the cleaner ones. But, in T59252: Cite: Reference extension outputs unescaped fragments in parsoid, there is discussion about older html4 style fragment ids ... and that there might be links to those out in the wild. I don't know whether it is worth generating additional empty-span tags (cannot be meta as we discovered) to keep those links unbroken.

Change 320929 had a related patch set uploaded (by Subramanya Sastry):
WIP: T102209: Assign ids to headings to match core's section anchors

https://gerrit.wikimedia.org/r/320929

gerritbot added a project: Patch-For-Review.Nov 10 2016, 10:59 PM

I don't know whether it is worth generating additional empty-span tags (cannot be meta as we discovered) to keep those links unbroken.

My gut feeling would be no. Direct links to sections are likely less used in projects where the encoding would actually change for most titles, and they'd probably be fixed up fairly quickly. In any case, the experience would degrade somewhat gracefully, by still reaching the linked page.

The way to find out would be to drop the old-style escaping from IDs generated by the PHP parser.

In T102209#2787326, @ssastry wrote:

In T102209#2787310, @ssastry wrote:

So, it turns out that inserting a <meta id="foo" /> does not help. https://.../title#foo does not take you to the anchor. You are still left at the top of the page. <h2 id="foo"> works as expected, but we cannot add ids to heading tags since we are using ids for data-parsoid. So, realistic options are using <span id="foo"></span> or <a name="foo"></a> both of which work as expected.

Named anchors solution seems like a "clean" solution. However, this can potentially impact clients that inspect a-tags. So, needs some discussion with them.

We could use empty spans, but it feels like a hack.

Thoughts?

Or, Parsoid could generate a <h2><span id="..">heading here </span></h2> like the PHP parser does.

Since we want to get Parsoid HTML to read views, and given that there might be gadgets, bots, scripts, etc. that might rely on output of the PHP parser, what is the argument for not generating <h*><span class="mw-headline" id="..">..</span></h*> like the PHP parser generates?

The id can go on the <h*> tag or the <span> tag, but I am talking about adding back the inner <span> tag ... and at that point, we might as well put the id tag on the <span>.

The main reason for adding those spans was displaying section edit links next to the headings with 2004 browser technology. These days, it seems likely that section edit links could be added without those span wrappers, and in any case we wouldn't want to add those links in Parsoid output.

Considering that Parsoid HTML is aimed more at a clean structural representation of content, I think it would be preferable to avoid legacy UI artifacts leaking into it without a concrete need. While you are right that there will likely be some code expecting those span wrappers, the cost of migration should be more than offset by avoiding the cost inflicted on every new consumer who would be expecting headings to contain regular heading content, rather than a mix of edit UI & content.

In T102209#2793520, @GWicke wrote:

The main reason for adding those spans was displaying section edit links next to the headings with 2004 browser technology. ...

I see ..

While you are right that there will likely be some code expecting those span wrappers, the cost of migration should be more than offset by avoiding the cost inflicted on every new consumer who would be expecting headings to contain regular heading content, rather than a mix of edit UI & content.

I am going to maintain https://www.mediawiki.org/wiki/Parsoid/Known_Differences_With_PHP_Parser_Output and start filling it up with proposed resolutions for each scenario (somewhat like https://www.mediawiki.org/wiki/Parsing/Replacing_Tidy#Test_result_notes). In this case, the proposal is to have any bots and scripts that reference the specific HTML to be fixed up.

• ssastry claimed this task.Nov 17 2016, 5:59 PM

• ssastry moved this task from Needs Triage to In Progress on the Parsoid board.

Change 320929 merged by jenkins-bot:
T102209: Assign ids to headings to match core's section anchors

https://gerrit.wikimedia.org/r/320929

This code has been merged and rt-tested. This will be live on the next deploy (probably next week).

Mentioned in SAL (#wikimedia-operations) [2016-12-07T21:36:38Z] <arlolra> updated Parsoid to version 3cf19c6b (T110910, T102209, T94949, T150112, T151570, T149209, T150213)

Stashbot mentioned this in T94949: Interwiki links to other MediaWiki wikis in the same cluster don't encode section fragment.Dec 7 2016, 9:36 PM

Stashbot mentioned this in T150112: Internal links pointing to interwikis are not encoded at all.

Stashbot mentioned this in T151570: Create Wikivoyage Finnish.

Stashbot mentioned this in T149209: Parsoid serialised an edit to a wikitext table adding a /n without stripping the double-pipes, breaking the table format (`\n|| align="right" | …`).

Stashbot mentioned this in T150213: Unknown contentmodels.

santhosh mentioned this in T154279: Cannot load a saved translation, JS error from Sizzle at $section.data( 'source' ).Jan 5 2017, 6:32 AM

Amire80 subscribed.Jan 5 2017, 7:53 AM

@ssastry , this caused some issues for ContentTranslation. See T154279: Cannot load a saved translation, JS error from Sizzle at $section.data( 'source' )

Arlolra merged a task: T138753: Spaces in section names in internal links are encoded incorrectly by Parsoid.Jul 24 2017, 6:10 PM

Arlolra added subscribers: He7d3r, Catrope, Zppix.

Derive heading ids from heading name, the same way MW core doesClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Derive heading ids from heading name, the same way MW core does
Closed, ResolvedPublic
Actions