Page MenuHomePhabricator

Migrate to HTML5 section ids
Closed, ResolvedPublic

Description

Current PHP parser section ids are derived from heading content using an old, ad-hoc escaping scheme invented by yours truly to satisfy the unreasonable demands of XHTML4. This scheme percent-encodes the section id, and then replaces percent signs with a full stop ..

The purpose of making this change is to satisfy the #3 wish on this year's Community Wishlist:
"Non-Latin section headings are displayed terribly in URL anchors and can't be reached directly":
https://meta.wikimedia.org/wiki/2016_Community_Wishlist_Survey/Categories/Miscellaneous#Non-Latin_section_headings_are_displayed_terribly_in_URL_anchors_and_can.27t_be_reached_directly

HTML5 supports UTF-8 with only minimal escaping of the hash sign in particular. Switching to HTML5 section anchors would have several advantages:

  • More readable section links, especially in non-ascii languages.
  • Simplified editing clients without a need to implement a legacy algorithm for deriving section ids from headings.
  • Simplified / more accurate DOM spec documentation.

Disadvantages include:

  • Break incoming links to sections from the internet.
  • Potentially break internal links to sections.

Migration options

Client-side fall-back

Client-side JS can look at the URL hash, and check if the section was found. If it was not found, it can encode headings using the old-style escaping algorithm, and check if the hash matches any of those. If it does, rewrite the hash to the matching new-style section.

As a result, users would be encouraged to fix links to new-style section ids. Existing links (both internal and external) would continue to work as long as the fall-back JS is active.

Automatic migration of internal references

Idea: Recognize old style escape pattern (/\.[0-9A-F]{2}/), and rewrite the full stop to %, then decode.

Issues:

  • Chance of false positives and conversion failures.
  • Only helps with internal links.

Add old style & new style section ids for a while

Pros:

  • Keeps links working during transition period

Cons:

  • Complicates HTML
  • Does not encourage a migration / surface correct section ID

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 383473 had a related patch set uploaded (by MaxSem; owner: MaxSem):
[operations/mediawiki-config@master] Switch test wikis to HTML5 fragment mode in links

https://gerrit.wikimedia.org/r/383473

Change 383473 merged by jenkins-bot:
[operations/mediawiki-config@master] Switch test wikis to HTML5 fragment mode in links

https://gerrit.wikimedia.org/r/383473

Mentioned in SAL (#wikimedia-operations) [2017-10-12T23:13:12Z] <dereckson@tin> Synchronized wmf-config/InitialiseSettings.php: Switch test wikis to HTML5 fragment mode in links (T152540) (duration: 00m 47s)

@MaxSem: Any bug reports from the test roll-out? Seems to work pretty well: https://test.wikipedia.org/wiki/Motörhead#The_rise_of_Motörhead.

When should we expect a roll-out to Russian Wikipedia?

Today was all packed, but I grabbed a window tomorrow.

I see this was already deployed to ru-wiki, what about other languages? Is there a schedule?

As soon as we fix all the problems discovered on Russian projects.

@MaxSem i'm curious. Do you have any links explaining these problems?

Updated the dependency tree with these.

Change 394104 had a related patch set uploaded (by MaxSem; owner: MaxSem):
[operations/mediawiki-config@master] Switch all wikis to HTML5 section IDs

https://gerrit.wikimedia.org/r/394104

Change 394104 merged by jenkins-bot:
[operations/mediawiki-config@master] Switch all wikis to HTML5 section IDs

https://gerrit.wikimedia.org/r/394104

I didn't actually deploy the above change because the wikis are still running wmf.8, it'll have to wait.

Change 394460 had a related patch set uploaded (by MaxSem; owner: MaxSem):
[operations/mediawiki-config@master] Switch all wikis to HTML5 section IDs

https://gerrit.wikimedia.org/r/394460

Change 394460 merged by jenkins-bot:
[operations/mediawiki-config@master] Switch all wikis to HTML5 section IDs

https://gerrit.wikimedia.org/r/394460

Change 377659 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Update Parsoid to generate modern HTML5 IDs w/ legacy fallback

https://gerrit.wikimedia.org/r/377659

Change 395831 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/extensions/VisualEditor@master] Strip legacy section IDs from inside headings

https://gerrit.wikimedia.org/r/395831

@MaxSem The last Google Chrome itself started to encode IDs - is there any news on this?

I wonder if that's a permanent change, or if it's just a temporary fix while they update their list of spoofable characters? Unfortunately, the details of their security bug(s) are still embargoed, so it doesn't look like I can find out more details yet.

I wonder if the technology here (invisible <span> tags) could also help with T160952: Generate Language-converted section anchors/ids? Strawman: [$wgFragmentMode](https://www.mediawiki.org/wiki/Manual:$wgFragmentMode) could include one or more language variant codes along with the special 'html5' and 'legacy' values; similarly something similar to [$wgExternalInterwikiFragmentMode](https://www.mediawiki.org/wiki/Manual:$wgExternalInterwikiFragmentMode) (except for internal wikilinks) could select whether the "canonical" ID is used for a wikilink (matching source text) or a language-converted variant is preferred?

See https://bugs.chromium.org/p/chromium/issues/detail?id=789163. The change in Google Chrome behaviour is possible.

Thanks for following up with the Chrome team on that one! The relevant patch looks to have been merged in Chrome a few hours ago, and (recapping from the link above) it's expected to be shipped in Chrome "65 or even 64", which will be the stable branch of Chrome the week of Jan 23rd, 2018 or Mar 6th, 2018.

Change 405391 had a related patch set uploaded (by MaxSem; owner: MaxSem):
[mediawiki/core@master] Set default fragment mode to [ 'legacy', 'html5' ]

https://gerrit.wikimedia.org/r/405391

Change 405391 merged by jenkins-bot:
[mediawiki/core@master] Set default fragment mode to [ 'legacy', 'html5' ]

https://gerrit.wikimedia.org/r/405391

@Niharika: The migration isn't totally complete as we still want to phase out the legacy IDs at some point. Probably need to make a new subtask for that.

@Niharika: The migration isn't totally complete as we still want to phase out the legacy IDs at some point. Probably need to make a new subtask for that.

HTML5 ids are added everywhere. So, it appears this task itself is complete. Maybe create a separate task (vs subtask) to phase out legacy IDs?

More followup: T186272: Convert all parser tests to html5 ids

Also, checking up on the chromium bug above: it turns out that the original fix for the regression was backed out, fixed, and then relanded. The fix appears in Chrome 65 (to be released March 6, 2018), so Chrome 63 (released Dec 5, 2017) and Chrome 64 display "broken" html5 anchors (sigh).

I know I'm a little late asking this, but is there any plan on how to deal with section IDs that could interfere with fragments used by code? There is some of that already (MobileFrontend, MediaViewer) and I'm sure there will be more in the future. Generally fragments handled by client-side code can be differentiated by starting with /.

Change 548906 had a related patch set uploaded (by Nray; owner: Nray):
[mediawiki/extensions/MobileFrontend@master] Add support for non-ascii characters in OverlayManager#_matchRoute

https://gerrit.wikimedia.org/r/548906

Change 551304 had a related patch set uploaded (by MaxSem; owner: MaxSem):
[mediawiki/core@master] Set $wgFragmentMode to [ 'html5', 'legacy' ] by default

https://gerrit.wikimedia.org/r/551304

Change 551304 abandoned by Winston Sung:

[mediawiki/core@master] Set $wgFragmentMode to [ 'html5', 'legacy' ] by default

Reason:

Replaced-by: I8780bb589002a4f836ba90bd18093a56cddc3ddf

https://gerrit.wikimedia.org/r/551304

Change 551304 restored by Winston Sung:

[mediawiki/core@master] Set $wgFragmentMode to [ 'html5', 'legacy' ] by default

https://gerrit.wikimedia.org/r/551304

Change 551304 abandoned by Winston Sung:

[mediawiki/core@master] Set $wgFragmentMode to [ 'html5', 'legacy' ] by default

Reason:

Replaced-by: If6696fb33ef95cbd29c944b48588918e8077e9f9

https://gerrit.wikimedia.org/r/551304