Page MenuHomePhabricator

Investigation: Section headings for non-Latin languages
Closed, ResolvedPublic3 Estimated Story Points

Description

Gergo researched this problem in T75092: Anchors to section names for non-ASCII letters are encoded in the URL, as part of his work on MediaViewer.

Investigation: Is there a workable solution to the problem? What still needs to be done?

Event Timeline

IMO we should reach out to browser vendors. If they are willing to follow the path taken by Firefox and handle percent-encoding in the fragment part of the URL just like they do in the path/query part, we should go with that. (It's not an entirely trivial decision for them, even apart from committing resources, since the URI standard defines an URL with percent-encoded path/query string as equal to the original, but defines fragments as different unless they are bytewise equal, which brings up some usability problems with users copy/pasting URLs.)

If that does not work, then it's probably best to come up with our own fragment encoding scheme which leaves alphanumerical Unicode characters unchanged but encodes those which could cause security issues or break autolinking (which would require some research into how various tools do autolinking).

kaldari set the point value for this task to 3.Dec 20 2016, 10:52 PM
kaldari moved this task from Needs Discussion to Up Next on the Community-Tech board.

See T152540#2932496 for a good summary of the current situation with browser support.

This ticket has taught me so much. :)

Here's a brief introduction to the problem we're trying to solve:

From T152540#2932645, test 1 and test 2, the best solution seems to be switching to Unicode IDs and links. Although using Percent-encoded links is standard, Chrome unfortunately displays them as percent-encoded links in the browser bar. Firefox, Chrome and Safari all handle unicode links to unicode ids perfectly.

For the example above, if I change the link and anchor ID to the unicode word, I get the section link as https://ru.wikipedia.org/wiki/Вторая_мировая_война#Территории

Next steps:
Need to make sure that the unicode ids and links solution works for IE, Opera etc. As for the migration options:

I'm leaning towards having an empty span to hold the backwards-compatible ID, since I'm not convinced that JS mapping would be sufficiently robust. We can keep the backwards compatible spans for many years, I don't think there's any rush to migrate links away from them.

This would need more discussion but the idea for an automatic migration script (after exhaustive testing) for migrating internal references along with a span for the backwards-compatible ID seems like a good place to start.


Feel free to point out if I've missed something.

  • This is because the IETF standards require the fragment part to have unicode characters encoded as %hh

The IETF standard (RFC 3986 / URI) requires all parts to be encoded as that. Most platforms (including MediaWiki) do it silently behind your back - you don't notice it because browsers have gotten very good at hiding it. That's actually a relatively recent development - in 2010 Russian Wikipedia URLs looked like a percent soup in anything except Firefox.

There is another IETF standard (RFC 3987 / IRI) which does allow unescaped Unicode characters, in all parts of the address. I haven't found any good information on what exactly it would mean to be compatible with RFC 3987 but not RFC 3986. Fragments are easier than the rest of the address in that they are not sent to the server, so we don't need to worry about compatibility with old web servers, only old browsers, but that's still something to worry about.

More problematically, there are various issues with how links with raw unicode fragments behave when used outside the browser. For example, you want to share the article and copy the text from the address bar to an email - will the recipient's email client recognize where that link ends? Say we want to add a feature that shares section links to Twitter - will Twitter correctly identify where the link ends or break it at the first comma or such? What happens when that link goes through some system that is not Unicode-aware? What happens if the fragment contains a <script> tag - are all the places where we put links ready for handling that securely?

Also (while this is unlikely to happen in practice) you get double-encoding problems if the section title happens to look like a percent-encoded string. E.g. if you have a section called %25 and use that without encoding as the fragment and id, it will not work in Firefox which will decode it to % and look for an id with that.

This would need more discussion but the idea for an automatic migration script (after exhaustive testing) for migrating internal references along with a span for the backwards-compatible ID seems like a good place to start.

Why bother with a migration? We need to keep the old IDs anyway for backwards compatibility with links on external sites.

More problematically, there are various issues with how links with raw unicode fragments behave when used outside the browser. For example, you want to share the article and copy the text from the address bar to an email - will the recipient's email client recognize where that link ends? Say we want to add a feature that shares section links to Twitter - will Twitter correctly identify where the link ends or break it at the first comma or such? What happens when that link goes through some system that is not Unicode-aware? What happens if the fragment contains a <script> tag - are all the places where we put links ready for handling that securely?

I didn't even think of all these issues! I guess the best we can do is to test how some of the websites/apps behave with unicode fragments. I kinda feel like by doing this we'll be setting a precedent for other websites and it'll serve as an encouragement for wider adoption of unicode fragment in urls. (Twitter would, I hope, feel foolish if Wikipedia links break because it can't correctly handle them and work to fix it)

This would need more discussion but the idea for an automatic migration script (after exhaustive testing) for migrating internal references along with a span for the backwards-compatible ID seems like a good place to start.

Why bother with a migration? We need to keep the old IDs anyway for backwards compatibility with links on external sites.

Hmm, I thought we'd do away with the spans eventually but yeah we can push that for later.

So it sounds like the consensus is to use unencoded raw Unicode section ids and percent-encoded fragments in URLs (to maximize compatibility with other software). (And also file a bug against Chrome to have them decode the fragment in the URL bar like Firefox and Safari do.) The only unresolved question is how will we handle backwards-compatibility. The 2 options are: add an empty span with the old encoded id or have JavaScript handle scrolling the user to the right place. I agree with Tim that using an empty span is probably the best solution as this utilizes the browser's built in anchor handling. The JavaScript solution wouldn't be 100% effective, as I don't think it would be able to handle the case of a person focusing the address bar on a page that was already loaded and hitting return to jump back to the section (since there is no way in JavaScript to detect the address bar having focus and this action doesn't re-trigger the onload event). Any last thoughts or opinions before we close this?

I kinda feel like by doing this we'll be setting a precedent for other websites and it'll serve as an encouragement for wider adoption of unicode fragment in urls.

I would like it more if we set a precedent by following web standards than the opposite :)

Also, figuring out where a Unicode string ends is tricky. https://hu.wikipedia.org/wiki/1848%E2%80%9349-es_forradalom_%C3%A9s_szabads%C3%A1gharc#1848._március_15. is a link to a section ID which ends in a dot (date formats in some languages do that so it's not uncommon). Other date formats don't so maybe I'm just closing my sentence with https://en.wikipedia.org/wiki/Timeline_of_the_Fukushima_Daiichi_nuclear_disaster#March_2011. How do you tell which is the case?

Then there are applications which have a form field for an URL and use the URI standard to validate it (say, PHP's FILTER_VALIDATE_URL builtin). These would start rejecting Wikipedia section links.

Using IRIs would open us up to a world of hurt, IMO.

The JavaScript solution wouldn't be 100% effective, as I don't think it would be able to handle the case of a person focusing the address bar on a page that was already loaded and hitting return to jump back to the section (since there is no way in JavaScript to detect the address bar having focus and this action doesn't re-trigger the onload event).

Javascript in-page navigation usually feels less natural due to different timing (over which we wouldn't have great control as it depends on JS loading order), especially with stuff that changes the position of the section like big CentralNotice banners or collapsible things. (Those tend to result in poor UX with normal browser positioning as well, but even worse with JS-based positioning.) Plus we don't allow Javascript for IE 6-8 and I'm pretty sure those would not get section titles right.

Also, JS-based navigation is tricky to get right. What if click on the TOC link, scroll up and click on the same link again? What if I click on a different link, then press the back button? What if I navigate back to the previous page (where I had a section title in the URL, but wasn't actually looking at that section)? It's not impossible to get all these right with JS but it's a PITA. Reimplementing browser behavior should be avoided whenever possible.