Page MenuHomePhabricator

Links in table of contents don't work for certain characters
Closed, ResolvedPublic

Description

Steps to reproduce

  1. Visit https://en.wikipedia.beta.wmflabs.org/wiki/Test_ascii_toc
  2. Click on the 😭abc123_~"`!^*<>(){}#;:@&=+-$,\/?#[]%F0%9F%98%AD存在%😭 link in the Table of Contents

Expected results

  • Browser scrolls to top of section that the TOC link corresponds to

Actual results

  • Browser does nothing

Different browsers display different behavior depending on the link clicked in the aforementioned TOC:

TOC linkchrome 78safari 13.0.3firefox 65.0
😭abc123_~"`!^*<>(){}#;:@&=+-$,\/?#[]%F0%9F%98%AD存在%😭doesn't workdoesn't workdoesn't work
`%F0doesn't workworksdoesn't work

Please Note: This bug is closely related to T238364 where we are trying to register Overlay routes that match the .mw-headline id attribute but are having a hard time because of the way different browsers encode different characters in the location address. This may likely impact T18691 as well.

Event Timeline

Jdlrobson added a subscriber: Mooeypoo.

Hey @Mooeypoo
This is the bug I mentioned in the office. It also seems to be what's causing T238364

The issue that brought this about seems to be T152540. I saw that Max worked on that as part of the wishlist so I think it's in Community Tech's court - but maybe core platform is a better fit.
Either way I am keen to work out where to direct this bug to but it seems out of scope for my team.

Note this issue is also likely to impact T18691 so added Tech Comm for more visibility.

At a glance, "certain characters" is the percent sign, which MediaWiki fails to escape in IDs. And T238364 seems unrelated, that's the routing logic's failure to decode (and then probably denormalize) hash fragments.

And T238364 seems unrelated, that's the routing logic's failure to decode (and then probably denormalize) hash fragments.

We are decoding them. The problem is the browser seems to be accounting for the % sign.

You can see this for yourself

var test = '😭abc123_~"`!^*<>(){}#;:@&=+-$,\/?#[]%F0%9F%98%AD存在%😭';
encodeURIComponent(test) === $( '<a>' ).attr( 'href', '#' + test )[ 0 ].hash.slice( 1 )

which should return true but returns false

It's tripped up by the following characters (and others):
^#:;@&={}

The difference is encodeURIComponent will encode a { to %7B whereas in a hash fragment this does not get encoded.

My inkling is that the table of contents link is not incorrect, but the id it refers to doesn't match it.

Thus, it seems sensible to limit these known problematic characters at the parser level.

There are multiple valid ways to encode an URL, you should compare the decoded string (and possibly denormalize for good measure). Also, might be safer to use location.href - location.hash should also be encoded, but in the past this wasn't uniformly done across all browsers.

The two TOC links that don't work *are* incorrect - they contain an unencoded percent sign, which is a spec violation and results in ambiguity (as you can see when comparing how the long section title looks on the page and in the URL).

For extra confusion, Chrome at least refuses to decode the backtick when rendering it in the URL. Not sure what's up with that (might be a bug, or some weird security feature to make it harder to fool users into copying dangerous things) but it doesn't affect navigation, it's only a display thing.

CCicalese_WMF subscribed.

We do not think this is Platform Engineering work. Please retag if you need review.

@Jdlrobson hm, this seems to have been worked on in the 2016 wishlist and seemed to have worked since; is this a new / recent failure? I remember there was conversation about problems with UTF characters when we migrated to PHP7.2, could that be the problem as well?

Directly scope-wise, this isn't really in our realm anymore, as Comm Tech does tactical changes and it is not realistic for us to own every product we've ever touched. That said, I will try to see if anyone in the team has any historical knowledge about, perhaps, the challenges that were involved here. @kaldari might be able to share some knowledge since he seemed to have been involved in the previous ticket.

Krinkle renamed this task from Table of Contents Links With Certain Characters Don't Work to Links in table of contents don't work for certain characters.Nov 27 2019, 8:59 PM

Thanks for the input. I'm guessing this lies with the parsing team (possibly related to php7). The fix for this will lie in the parser since the table of contents and headings are generated by both.

So this is kind of an interesting bug.

The two TOC links that don't work *are* incorrect - they contain an unencoded percent sign, which is a spec violation and results in ambiguity (as you can see when comparing how the long section title looks on the page and in the URL).

The spec is a little ambigious on that. rfc3986 would probably say this entire effort (non-unicode characters in fragments) is a spec violation. HTML5 on the other hand seems to say they are opaque identifiers that are not to be interpreted at all. (If I'm reading that right).

Anyways, in practice it seems like the issue is combining percent encoding with characters that would normally be percent encoded. You can have one or the other (Oddly enough, a percent sign by itself seems to be always ok in my testing). So for example, 😭abc123_~"!^*<>(){}#;:@&=+-$,\/?#[]存在%😭 with no percent encoded characters works fine and so does abc123_~!^*(){}#;:@&=+-$,\/?#[]%F0%9F%98%AD%` with no uri reserved characters (Note, the lone percent works fine in both of them).

Anyways, best fix seems to be anything that can form a valid percent encoding should be escaped as %25.

In my other testing, it seems like carriage returns aren't handled properly either, although i suppose that's a pretty extreme edge case.

Change 572403 had a related patch set uploaded (by Brian Wolff; owner: Brian Wolff):
[mediawiki/core@master] Escape % sign if form valid percent-encoding in fragment identifiers

https://gerrit.wikimedia.org/r/572403

Change 572405 had a related patch set uploaded (by Brian Wolff; owner: Brian Wolff):
[mediawiki/core@master] Make id attributes not include ascii whitespace per spec

https://gerrit.wikimedia.org/r/572405

Change 572403 merged by jenkins-bot:
[mediawiki/core@master] Escape % sign if form valid percent-encoding in fragment identifiers

https://gerrit.wikimedia.org/r/572403

Change 574736 had a related patch set uploaded (by Brian Wolff; owner: Brian Wolff):
[mediawiki/extensions/Cite@master] Fix unit tests for whitespace change in Html5 fragments

https://gerrit.wikimedia.org/r/574736

Change 577610 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/services/parsoid@master] Make id attributes not include ascii whitespace per spec

https://gerrit.wikimedia.org/r/577610

Change 577610 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Make id attributes not include ascii whitespace per spec

https://gerrit.wikimedia.org/r/577610

Change 572405 merged by jenkins-bot:
[mediawiki/core@master] Make id attributes not include ascii whitespace per spec

https://gerrit.wikimedia.org/r/572405

Change 574736 merged by jenkins-bot:
[mediawiki/extensions/Cite@master] Fix unit tests for whitespace change in Html5 fragments

https://gerrit.wikimedia.org/r/574736

Does this need testing/resolving?

matmarex assigned this task to Bawolff.

Seems to have been resolved by the patches a year ago.