Maniphest T238385

Links in table of contents don't work for certain characters
Closed, ResolvedPublic
Actions

Description

Steps to reproduce

Visit https://en.wikipedia.beta.wmflabs.org/wiki/Test_ascii_toc
Click on the 😭abc123_~"`!^*<>(){}#;:@&=+-$,\/?#[]%F0%9F%98%AD存在%😭 link in the Table of Contents

Expected results

Browser scrolls to top of section that the TOC link corresponds to

Actual results

Browser does nothing

Different browsers display different behavior depending on the link clicked in the aforementioned TOC:

TOC link	chrome 78	safari 13.0.3	firefox 65.0
😭abc123_~"`!^*<>(){}#;:@&=+-$,\/?#[]%F0%9F%98%AD存在%😭	doesn't work	doesn't work	doesn't work
`%F0	doesn't work	works	doesn't work

Please Note: This bug is closely related to T238364 where we are trying to register Overlay routes that match the .mw-headline id attribute but are having a hard time because of the way different browsers encode different characters in the location address. This may likely impact T18691 as well.

Details

Subject	Repo	Branch	Lines +/-
Make id attributes not include ascii whitespace per spec	mediawiki/core	master	+12 -1
Fix unit tests for whitespace change in Html5 fragments	mediawiki/extensions/Cite	master	+1 -2
Make id attributes not include ascii whitespace per spec	mediawiki/services/parsoid	master	+5 -1
Escape % sign if form valid percent-encoding in fragment identifiers	mediawiki/core	master	+32 -13

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	MaxSem	T75092 Anchors to section names for non-ASCII letters are encoded in the URL
Resolved	matmarex	T172493 mw.util.escapeId should log deprecation warning
Resolved	MaxSem	T152540 Migrate to HTML5 section ids
Resolved	Bawolff	T238385 Links in table of contents don't work for certain characters

Event Timeline

• nray created this task.Nov 14 2019, 10:48 PM

Restricted Application removed a project: Patch-For-Review. · View Herald TranscriptNov 14 2019, 10:48 PM

• nray mentioned this in T238364: [Bug] Simplified Talk Page Does Not Support Sections With Non-ascii Characters/Several ascii characters aren't supported either.Nov 14 2019, 10:49 PM

Aklapper removed projects: TechCom-RFC (TechCom-RFC-Closed), User-notice, Mobile-Content-Service, Community-Wishlist-Survey-2016, Community-Tech, ContentTranslation.Nov 14 2019, 11:36 PM

Lucas_Werkmeister_WMDE unsubscribed.Nov 15 2019, 11:29 AM

• Jhernandez removed a project: Product-Infrastructure-Team-Backlog-Deprecated.Nov 20 2019, 4:49 PM

MaxSem removed MaxSem as the assignee of this task.Nov 22 2019, 11:07 PM

Krinkle unsubscribed.Nov 22 2019, 11:35 PM

Krinkle removed subscribers: • GWicke, santhosh, Amire80 and 2 others.Nov 22 2019, 11:37 PM

Snaevar unsubscribed.Nov 24 2019, 8:42 PM

Hey @Mooeypoo
This is the bug I mentioned in the office. It also seems to be what's causing T238364

The issue that brought this about seems to be T152540. I saw that Max worked on that as part of the wishlist so I think it's in Community Tech's court - but maybe core platform is a better fit.
Either way I am keen to work out where to direct this bug to but it seems out of scope for my team.

Note this issue is also likely to impact T18691 so added Tech Comm for more visibility.

At a glance, "certain characters" is the percent sign, which MediaWiki fails to escape in IDs. And T238364 seems unrelated, that's the routing logic's failure to decode (and then probably denormalize) hash fragments.

And T238364 seems unrelated, that's the routing logic's failure to decode (and then probably denormalize) hash fragments.

We are decoding them. The problem is the browser seems to be accounting for the % sign.

You can see this for yourself

var test = '😭abc123_~"`!^*<>(){}#;:@&=+-$,\/?#[]%F0%9F%98%AD存在%😭';
encodeURIComponent(test) === $( '<a>' ).attr( 'href', '#' + test )[ 0 ].hash.slice( 1 )

which should return true but returns false

It's tripped up by the following characters (and others):
^#:;@&={}

The difference is encodeURIComponent will encode a { to %7B whereas in a hash fragment this does not get encoded.

My inkling is that the table of contents link is not incorrect, but the id it refers to doesn't match it.

Thus, it seems sensible to limit these known problematic characters at the parser level.

There are multiple valid ways to encode an URL, you should compare the decoded string (and possibly denormalize for good measure). Also, might be safer to use location.href - location.hash should also be encoded, but in the past this wasn't uniformly done across all browsers.

The two TOC links that don't work *are* incorrect - they contain an unencoded percent sign, which is a spec violation and results in ambiguity (as you can see when comparing how the long section title looks on the page and in the URL).

For extra confusion, Chrome at least refuses to decode the backtick when rendering it in the URL. Not sure what's up with that (might be a bug, or some weird security feature to make it harder to fool users into copying dangerous things) but it doesn't affect navigation, it's only a display thing.

Anomie moved this task from Inbox to Triage Meeting Inbox on the Platform Engineering board.Nov 26 2019, 9:43 PM

pmiazga subscribed.Nov 27 2019, 6:20 PM

We do not think this is Platform Engineering work. Please retag if you need review.

@Jdlrobson hm, this seems to have been worked on in the 2016 wishlist and seemed to have worked since; is this a new / recent failure? I remember there was conversation about problems with UTF characters when we migrated to PHP7.2, could that be the problem as well?

Directly scope-wise, this isn't really in our realm anymore, as Comm Tech does tactical changes and it is not realistic for us to own every product we've ever touched. That said, I will try to see if anyone in the team has any historical knowledge about, perhaps, the challenges that were involved here. @kaldari might be able to share some knowledge since he seemed to have been involved in the previous ticket.

Krinkle renamed this task from Table of Contents Links With Certain Characters Don't Work to Links in table of contents don't work for certain characters.Nov 27 2019, 8:59 PM

daniel moved this task from Inbox to Watching on the TechCom board.Nov 27 2019, 9:22 PM

Thanks for the input. I'm guessing this lies with the parsing team (possibly related to php7). The fix for this will lie in the parser since the table of contents and headings are generated by both.

• ssastry removed projects: Parsoid, Parsing-Team--ARCHIVED.Feb 14 2020, 11:53 PM

So this is kind of an interesting bug.

The two TOC links that don't work *are* incorrect - they contain an unencoded percent sign, which is a spec violation and results in ambiguity (as you can see when comparing how the long section title looks on the page and in the URL).

The spec is a little ambigious on that. rfc3986 would probably say this entire effort (non-unicode characters in fragments) is a spec violation. HTML5 on the other hand seems to say they are opaque identifiers that are not to be interpreted at all. (If I'm reading that right).

Anyways, in practice it seems like the issue is combining percent encoding with characters that would normally be percent encoded. You can have one or the other (Oddly enough, a percent sign by itself seems to be always ok in my testing). So for example, 😭abc123_~"!^*<>(){}#;:@&=+-$,\/?#[]存在%😭 with no percent encoded characters works fine and so does abc123_~!^*(){}#;:@&=+-$,\/?#[]%F0%9F%98%AD%` with no uri reserved characters (Note, the lone percent works fine in both of them).

Anyways, best fix seems to be anything that can form a valid percent encoding should be escaped as %25.

In my other testing, it seems like carriage returns aren't handled properly either, although i suppose that's a pretty extreme edge case.

Change 572403 had a related patch set uploaded (by Brian Wolff; owner: Brian Wolff):
[mediawiki/core@master] Escape % sign if form valid percent-encoding in fragment identifiers

https://gerrit.wikimedia.org/r/572403

gerritbot added a project: Patch-For-Review.Feb 15 2020, 9:35 AM

Change 572405 had a related patch set uploaded (by Brian Wolff; owner: Brian Wolff):
[mediawiki/core@master] Make id attributes not include ascii whitespace per spec

https://gerrit.wikimedia.org/r/572405

Change 572403 merged by jenkins-bot:
[mediawiki/core@master] Escape % sign if form valid percent-encoding in fragment identifiers

https://gerrit.wikimedia.org/r/572403

ReleaseTaggerBot added a project: MW-1.35-notes (1.35.0-wmf.21; 2020-02-25).Feb 25 2020, 12:00 PM

Change 574736 had a related patch set uploaded (by Brian Wolff; owner: Brian Wolff):
[mediawiki/extensions/Cite@master] Fix unit tests for whitespace change in Html5 fragments

https://gerrit.wikimedia.org/r/574736

Change 577610 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/services/parsoid@master] Make id attributes not include ascii whitespace per spec

https://gerrit.wikimedia.org/r/577610

Change 577610 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Make id attributes not include ascii whitespace per spec

https://gerrit.wikimedia.org/r/577610

Change 572405 merged by jenkins-bot:
[mediawiki/core@master] Make id attributes not include ascii whitespace per spec

https://gerrit.wikimedia.org/r/572405

Change 574736 merged by jenkins-bot:
[mediawiki/extensions/Cite@master] Fix unit tests for whitespace change in Html5 fragments

https://gerrit.wikimedia.org/r/574736

ReleaseTaggerBot edited projects, added MW-1.35-notes (1.35.0-wmf.23; 2020-03-10); removed MW-1.35-notes (1.35.0-wmf.21; 2020-02-25).Mar 7 2020, 2:42 PM

Does this need testing/resolving?

Jdlrobson mentioned this in T230293: [Chrome] On mobile web, links to sections whose names include double quote " work incorrectly.Mar 31 2020, 4:31 PM

Jdlrobson added a project: Desktop Improvements (Vector 2022).Mar 30 2021, 7:06 PM

Seems to have been resolved by the patches a year ago.

Lahwaacz mentioned this in T201909: $wgFragmentMode='html5' breaks links to titles with tabs.Aug 30 2021, 8:03 PM

Jdrewniak mentioned this in T297614: ToC: Allow the ToC to bold the correct section and subsection of the Table of Contents.Jan 11 2022, 7:11 PM

Links in table of contents don't work for certain charactersClosed, ResolvedPublicActions