Page MenuHomePhabricator

Derive heading ids from heading name, the same way MW core does
Closed, ResolvedPublic

Description

MediaWiki core creates ids per heading based on the content of that heading. This is used to link to the given section from the table of contents, and is also often used by users to reference specific sections.

It would be great if Parsoid implemented the same ids, possibly in addition to the current random ids (using a meta tag?).

See also the email thread starting with https://lists.wikimedia.org/pipermail/mobile-l/2015-October/009886.html. As @Tgr points out the ids need to be unique, too.

Related Objects

Event Timeline

GWicke raised the priority of this task from to Needs Triage.
GWicke updated the task description. (Show Details)
GWicke added a project: Parsoid.
GWicke subscribed.
Arlolra triaged this task as Medium priority.Jul 7 2015, 2:56 AM
Arlolra subscribed.
GWicke added subscribers: cscott, Jdforrester-WMF.

See also T59252 for a discussion of the historic fragment ID encoding vs. HTML5.

So, heading anchors are generated from wikitext of a heading, not the innerHTML of the heading. So, how is MCS doing this right now?

I ask because Parsoid doesn't have easy access to the wikitext of individual nodes that come from template content. It only has this for top-level content. So, I am trying to figure out how to deal with this. Of course, most headings are plain text, and innerHTML of the heading will be equivalent to the wikitext especially since the formatHeadings code strips HTML tags and normalizees whitespace. But, this fails for headings that have quotes, links, or templates.

So, heading anchors are generated from wikitext of a heading, not the innerHTML of the heading. So, how is MCS doing this right now?

I ask because Parsoid doesn't have easy access to the wikitext of individual nodes that come from template content. It only has this for top-level content. So, I am trying to figure out how to deal with this. Of course, most headings are plain text, and innerHTML of the heading will be equivalent to the wikitext especially since the formatHeadings code strips HTML tags and normalizees whitespace. But, this fails for headings that have quotes, links, or templates.

Scratch all that. I was wrong. It looks like the heading anchors are generated from the HTML, not the wikitext. I misinterpreted $text (param to formatHeadings) as being wikitext, but it is actually html.

IIRC sometimes the {{}} of the template is visible in the ID; it depends on which of several anchor encoding code paths are called. (I think there is one for the actual page HTML, one for the links in page history / recentchanges, and one for returning you where you were after clicking on a section edit link and saving? I might be confusing things, it was a long time ago I looked at this.)

IIRC sometimes the {{}} of the template is visible in the ID; it depends on which of several anchor encoding code paths are called. (I think there is one for the actual page HTML, one for the links in page history / recentchanges, and one for returning you where you were after clicking on a section edit link and saving? I might be confusing things, it was a long time ago I looked at this.)

In the enwp sandbox, I tested == {{1x|1=moo and ''boo'' and [[gah]] and {{1x|1=wtf}} x}} == and the heading anchor is <span class="mw-headline" id="moo_and_boo_and_gah_and_wtf_x"> which doesn't have any of the wikitext chars in it.

So, it turns out that inserting a <meta id="foo" /> does not help. https://.../title#foo does not take you to the anchor. You are still left at the top of the page. <h2 id="foo"> works as expected, but we cannot add ids to heading tags since we are using ids for data-parsoid. So, realistic options are using <span id="foo"></span> or <a name="foo"></a> both of which work as expected.

Named anchors solution seems like a "clean" solution. However, this can potentially impact clients that inspect a-tags. So, needs some discussion with them.

We could use empty spans, but it feels like a hack.

Thoughts?

So, it turns out that inserting a <meta id="foo" /> does not help. https://.../title#foo does not take you to the anchor. You are still left at the top of the page. <h2 id="foo"> works as expected, but we cannot add ids to heading tags since we are using ids for data-parsoid. So, realistic options are using <span id="foo"></span> or <a name="foo"></a> both of which work as expected.

Named anchors solution seems like a "clean" solution. However, this can potentially impact clients that inspect a-tags. So, needs some discussion with them.

We could use empty spans, but it feels like a hack.

Thoughts?

Or, Parsoid could generate a <h2><span id="..">heading here </span></h2> like the PHP parser does. In any case, no matter which solution we go with, Parsoid's serialization code needs to be fixed up to ignore these new elements.

One other thing I discovered is that the core code does not deduplicate ids if the heading ids are present elsewhere on some other element. For example:

<div id='x'>foo</div>
==x==

assigns id='x' to the heading as well which is broken. Since we are going to dedupe ids for headings, we will dedupe it across the board in Parsoid HTML.

<h2 id="foo"> works as expected, but we cannot add ids to heading tags since we are using ids for data-parsoid.

Could you elaborate on why you can not reuse those human-readable ids, provided that they are unique? This case seems to be equivalent to reusing other user-supplied ids.

<h2 id="foo"> works as expected, but we cannot add ids to heading tags since we are using ids for data-parsoid.

Could you elaborate on why you can not reuse those human-readable ids, provided that they are unique? This case seems to be equivalent to reusing other user-supplied ids.

No reason .. I just didn't think about that option. ;-) But, yes, using these ids in headings can also work for data-parsoid.

<h2 id="foo"> works as expected, but we cannot add ids to heading tags since we are using ids for data-parsoid.

Could you elaborate on why you can not reuse those human-readable ids, provided that they are unique? This case seems to be equivalent to reusing other user-supplied ids.

No reason .. I just didn't think about that option. ;-) But, yes, using these ids in headings can also work for data-parsoid.

But, the reason I started with the meta tag is the discussion we had about multiple ids at one point and the mention of the meta tag in the description.

But, the reason I started with the meta tag is the discussion we had about multiple ids at one point and the mention of the meta tag in the description.

Makes sense. Do you plan to keep the old escaping, or go to cleaner utf8 fragments?

But, the reason I started with the meta tag is the discussion we had about multiple ids at one point and the mention of the meta tag in the description.

Makes sense. Do you plan to keep the old escaping, or go to cleaner utf8 fragments?

In https://gerrit.wikimedia.org/r/#/c/96892/, I am already generating html5 compliant ids which are the cleaner ones. But, in T59252: Cite: Reference extension outputs unescaped fragments in parsoid, there is discussion about older html4 style fragment ids ... and that there might be links to those out in the wild. I don't know whether it is worth generating additional empty-span tags (cannot be meta as we discovered) to keep those links unbroken.

Change 320929 had a related patch set uploaded (by Subramanya Sastry):
WIP: T102209: Assign ids to headings to match core's section anchors

https://gerrit.wikimedia.org/r/320929

I don't know whether it is worth generating additional empty-span tags (cannot be meta as we discovered) to keep those links unbroken.

My gut feeling would be no. Direct links to sections are likely less used in projects where the encoding would actually change for most titles, and they'd probably be fixed up fairly quickly. In any case, the experience would degrade somewhat gracefully, by still reaching the linked page.

The way to find out would be to drop the old-style escaping from IDs generated by the PHP parser.

So, it turns out that inserting a <meta id="foo" /> does not help. https://.../title#foo does not take you to the anchor. You are still left at the top of the page. <h2 id="foo"> works as expected, but we cannot add ids to heading tags since we are using ids for data-parsoid. So, realistic options are using <span id="foo"></span> or <a name="foo"></a> both of which work as expected.

Named anchors solution seems like a "clean" solution. However, this can potentially impact clients that inspect a-tags. So, needs some discussion with them.

We could use empty spans, but it feels like a hack.

Thoughts?

Or, Parsoid could generate a <h2><span id="..">heading here </span></h2> like the PHP parser does.

Since we want to get Parsoid HTML to read views, and given that there might be gadgets, bots, scripts, etc. that might rely on output of the PHP parser, what is the argument for not generating <h*><span class="mw-headline" id="..">..</span></h*> like the PHP parser generates?

The id can go on the <h*> tag or the <span> tag, but I am talking about adding back the inner <span> tag ... and at that point, we might as well put the id tag on the <span>.

The main reason for adding those spans was displaying section edit links next to the headings with 2004 browser technology. These days, it seems likely that section edit links could be added without those span wrappers, and in any case we wouldn't want to add those links in Parsoid output.

Considering that Parsoid HTML is aimed more at a clean structural representation of content, I think it would be preferable to avoid legacy UI artifacts leaking into it without a concrete need. While you are right that there will likely be some code expecting those span wrappers, the cost of migration should be more than offset by avoiding the cost inflicted on every new consumer who would be expecting headings to contain regular heading content, rather than a mix of edit UI & content.

The main reason for adding those spans was displaying section edit links next to the headings with 2004 browser technology. ...

I see ..

While you are right that there will likely be some code expecting those span wrappers, the cost of migration should be more than offset by avoiding the cost inflicted on every new consumer who would be expecting headings to contain regular heading content, rather than a mix of edit UI & content.

I am going to maintain https://www.mediawiki.org/wiki/Parsoid/Known_Differences_With_PHP_Parser_Output and start filling it up with proposed resolutions for each scenario (somewhat like https://www.mediawiki.org/wiki/Parsing/Replacing_Tidy#Test_result_notes). In this case, the proposal is to have any bots and scripts that reference the specific HTML to be fixed up.

Change 320929 merged by jenkins-bot:
T102209: Assign ids to headings to match core's section anchors

https://gerrit.wikimedia.org/r/320929

This code has been merged and rt-tested. This will be live on the next deploy (probably next week).