Migrate to HTML5 section ids
Open, NormalPublic

Description

Current PHP parser section ids are derived from heading content using an old, ad-hoc escaping scheme invented by yours truly to satisfy the unreasonable demands of XHTML4. This scheme percent-encodes the section id, and then replaces percent signs with a full stop ..

The purpose of making this change is to satisfy the #3 wish on this year's Community Wishlist:
"Non-Latin section headings are displayed terribly in URL anchors and can't be reached directly":
https://meta.wikimedia.org/wiki/2016_Community_Wishlist_Survey/Categories/Miscellaneous#Non-Latin_section_headings_are_displayed_terribly_in_URL_anchors_and_can.27t_be_reached_directly

HTML5 supports UTF-8 with only minimal escaping of the hash sign in particular. Switching to HTML5 section anchors would have several advantages:

  • More readable section links, especially in non-ascii languages.
  • Simplified editing clients without a need to implement a legacy algorithm for deriving section ids from headings.
  • Simplified / more accurate DOM spec documentation.

Disadvantages include:

  • Break incoming links to sections from the internet.
  • Potentially break internal links to sections.

Migration options

Client-side fall-back

Client-side JS can look at the URL hash, and check if the section was found. If it was not found, it can encode headings using the old-style escaping algorithm, and check if the hash matches any of those. If it does, rewrite the hash to the matching new-style section.

As a result, users would be encouraged to fix links to new-style section ids. Existing links (both internal and external) would continue to work as long as the fall-back JS is active.

Automatic migration of internal references

Idea: Recognize old style escape pattern (/\.[0-9A-F]{2}/), and rewrite the full stop to %, then decode.

Issues:

  • Chance of false positives and conversion failures.
  • Only helps with internal links.

Add old style & new style section ids for a while

Pros:

  • Keeps links working during transition period

Cons:

  • Complicates HTML
  • Does not encourage a migration / surface correct section ID

Related Objects

There are a very large number of changes, so older changes are hidden. Show Older Changes

@Krinkle The problem statement is the #3 wish on this year's Community Wishlist:
"Non-Latin section headings are displayed terribly in URL anchors and can't be reached directly":
https://meta.wikimedia.org/wiki/2016_Community_Wishlist_Survey/Categories/Miscellaneous#Non-Latin_section_headings_are_displayed_terribly_in_URL_anchors_and_can.27t_be_reached_directly

I added that to the top of the ticket, sorry that wasn't clear.

The patch implementing this has been merged, so at this point I'm confused what's the point of continuing to go through this formal process.

  1. No, you understood this incorrectly - primary ID's location remains the same, no matter what encoding it uses. Primary is primary, secondary is secondary.
  1. There's a patchset implementing this, but it's current status is "buried in bikeshedding".

The first thing is solved only for Firefox...

Actually, it's solved for every browser except Chrome.

Sorry, where did you get that info from? It isn't solved for Opera, @MaxSem stated at T157729#3462443 isn't solved for IE/Edge, @Tgr in 2016 stated at T75092#2812110 it's solved only for Firefox. I think I saw stats page somewhere, but can't find it.

@Krinkle The problem statement is the #3 wish on this year's Community Wishlist:
"Non-Latin section headings are displayed terribly in URL anchors and can't be reached directly":
https://meta.wikimedia.org/wiki/2016_Community_Wishlist_Survey/Categories/Miscellaneous#Non-Latin_section_headings_are_displayed_terribly_in_URL_anchors_and_can.27t_be_reached_directly

Yeah, I've written it, and this is why now I'm confused by what the solving of this problem has led to as for now, since the current solution is answering neither "displayed terribly in URL anchors" part nor "can't be reached directly".

@MaxSem has raised the following question at https://lists.wikimedia.org/pipermail/wikitech-l/2017-August/088559.html:

do we really need to percent-encode the IDs? There is
extensive discussion of that in the aforementioned task, concluding that
percent-encoding is probably more "correct". However, not escaping it gives
much better browser compatibility (close to 100%). We can change this at
any time because no links will be broken due to the way browsers handle
percent-encoded fragments.

Is there a solid argument why it is "more "correct""? Links that may end up broken, maybe? Was it investigated? To what extent does it justify having fragments percent-encoded?

Joe added a subscriber: Joe.Aug 30 2017, 10:23 AM

IRC meeting on this happening NOW in the #wikimedia-office channel

Change 375099 had a related patch set uploaded (by Krinkle; owner: MaxSem):
[mediawiki/core@master] Don't percent-encode HTML5 IDs

https://gerrit.wikimedia.org/r/375099

Change 375099 had a related patch set uploaded (by Krinkle; owner: MaxSem):
[mediawiki/core@master] Don't percent-encode HTML5 IDs

https://gerrit.wikimedia.org/r/375099

daniel added a comment.EditedSep 1 2017, 5:53 PM

IRC meeting happend on August 30. Log: https://tools.wmflabs.org/meetbot/wikimedia-office/2017/wikimedia-office.2017-08-30-21.02.log.html

Thank you Tim for chairing!

Summary (according to me):

  • move element with legacy ID into the header (<h...>) tag for DOM compatibility.
  • do not percent-encode, since not all browsers show this nicely in the address bar.
  • move element with legacy ID into the header tag for DOM compatibility.

When I first read this I thought you meant the <head> or <header> tags. After reading the transcript I think you mean the heading tags (<h1>, <h2>, ...) instead.

When I first read this I thought you meant the <head> or <header> tags. After reading the transcript I think you mean the heading tags (<h1>, <h2>, ...) instead.

Oh, I see, sorry. I updated my comment.

Change 377659 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/services/parsoid@master] [WIP] Update Parsoid to generate modern IDs w/ legacy fallback

https://gerrit.wikimedia.org/r/377659

Change 378357 had a related patch set uploaded (by MaxSem; owner: MaxSem):
[operations/mediawiki-config@master] Start migration to Unicode sections everywhere

https://gerrit.wikimedia.org/r/378357

Seb35 added a subscriber: Seb35.Sep 17 2017, 11:16 AM

I was curious to see the result and didn’t find live examples with MediaWiki, so I created one on https://test.wikipedia.org/wiki/HTML_5_section_IDs.

As a French speaker, thanks a lot for this feature!

Change 378357 merged by jenkins-bot:
[operations/mediawiki-config@master] Start migration to Unicode sections everywhere

https://gerrit.wikimedia.org/r/378357

There seems to be an interaction with LanguageConverter when the new HTML5 ids end up containing -{...}- markup.

Previously we could rely on the fact that LanguageConverter markup was *not* expanded when creating the section IDs, ie:

$ (echo '==-{foo}-==' ) | php maintenance/parse.php --quiet
<div class="mw-parser-output"><h2><span id="-.7Bfoo.7D-"></span><span class="mw-headline" id="-{foo}-">-{foo}-</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/~cananian/mediawiki/index.php?title=CLIParser&amp;veaction=edit&amp;section=1" class="mw-editsection-visualeditor" title="Edit section: -{foo}-">edit</a><span class="mw-editsection-divider"> | </span><a href="/~cananian/mediawiki/index.php?title=CLIParser&amp;action=edit&amp;section=1" title="Edit section: -{foo}-">edit source</a><span class="mw-editsection-bracket">]</span></span></h2>
</div>

But in some situations the id="-{foo}-" is getting fed back to language converter again, resulting in the ID being munged: T176176: HTML5 ids seems to change how wikilink fragments are parsed (when LanguageConverter is enabled)

Change 383473 had a related patch set uploaded (by MaxSem; owner: MaxSem):
[operations/mediawiki-config@master] Switch test wikis to HTML5 fragment mode in links

https://gerrit.wikimedia.org/r/383473

Change 383473 merged by jenkins-bot:
[operations/mediawiki-config@master] Switch test wikis to HTML5 fragment mode in links

https://gerrit.wikimedia.org/r/383473

Mentioned in SAL (#wikimedia-operations) [2017-10-12T23:13:12Z] <dereckson@tin> Synchronized wmf-config/InitialiseSettings.php: Switch test wikis to HTML5 fragment mode in links (T152540) (duration: 00m 47s)

@MaxSem: Any bug reports from the test roll-out? Seems to work pretty well: https://test.wikipedia.org/wiki/Motörhead#The_rise_of_Motörhead.

When should we expect a roll-out to Russian Wikipedia?

Today was all packed, but I grabbed a window tomorrow.

ssastry moved this task from Backlog to In Progress on the Parsoid board.Nov 6 2017, 6:51 PM

I see this was already deployed to ru-wiki, what about other languages? Is there a schedule?

As soon as we fix all the problems discovered on Russian projects.

@MaxSem i'm curious. Do you have any links explaining these problems?

Updated the dependency tree with these.

Change 394104 had a related patch set uploaded (by MaxSem; owner: MaxSem):
[operations/mediawiki-config@master] Switch all wikis to HTML5 section IDs

https://gerrit.wikimedia.org/r/394104

Krinkle edited projects, added TechCom-RFC (TechCom-Approved); removed TechCom-RFC.
Krinkle moved this task from Backlog to In progress on the TechCom-RFC (TechCom-Approved) board.

Change 394104 merged by jenkins-bot:
[operations/mediawiki-config@master] Switch all wikis to HTML5 section IDs

https://gerrit.wikimedia.org/r/394104

I didn't actually deploy the above change because the wikis are still running wmf.8, it'll have to wait.

Change 394460 had a related patch set uploaded (by MaxSem; owner: MaxSem):
[operations/mediawiki-config@master] Switch all wikis to HTML5 section IDs

https://gerrit.wikimedia.org/r/394460

Change 394460 merged by jenkins-bot:
[operations/mediawiki-config@master] Switch all wikis to HTML5 section IDs

https://gerrit.wikimedia.org/r/394460

Change 377659 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Update Parsoid to generate modern HTML5 IDs w/ legacy fallback

https://gerrit.wikimedia.org/r/377659

Change 395831 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/extensions/VisualEditor@master] Strip legacy section IDs from inside headings

https://gerrit.wikimedia.org/r/395831

@MaxSem The last Google Chrome itself started to encode IDs - is there any news on this?

I wonder if that's a permanent change, or if it's just a temporary fix while they update their list of spoofable characters? Unfortunately, the details of their security bug(s) are still embargoed, so it doesn't look like I can find out more details yet.

I wonder if the technology here (invisible <span> tags) could also help with T160952: Generate Language-converted section anchors/ids? Strawman: $wgFragmentMode could include one or more language variant codes along with the special 'html5' and 'legacy' values; similarly something similar to $wgExternalInterwikiFragmentMode (except for internal wikilinks) could select whether the "canonical" ID is used for a wikilink (matching source text) or a language-converted variant is preferred?

See https://bugs.chromium.org/p/chromium/issues/detail?id=789163. The change in Google Chrome behaviour is possible.

Thanks for following up with the Chrome team on that one! The relevant patch looks to have been merged in Chrome a few hours ago, and (recapping from the link above) it's expected to be shipped in Chrome "65 or even 64", which will be the stable branch of Chrome the week of Jan 23rd, 2018 or Mar 6th, 2018.

Fito added a subscriber: Fito.Dec 15 2017, 8:18 PM

Change 405391 had a related patch set uploaded (by MaxSem; owner: MaxSem):
[mediawiki/core@master] Set default fragment mode to [ 'legacy', 'html5' ]

https://gerrit.wikimedia.org/r/405391