Migrate to HTML5 section ids
Closed, ResolvedPublic

Description

Current PHP parser section ids are derived from heading content using an old, ad-hoc escaping scheme invented by yours truly to satisfy the unreasonable demands of XHTML4. This scheme percent-encodes the section id, and then replaces percent signs with a full stop ..

The purpose of making this change is to satisfy the #3 wish on this year's Community Wishlist:
"Non-Latin section headings are displayed terribly in URL anchors and can't be reached directly":
https://meta.wikimedia.org/wiki/2016_Community_Wishlist_Survey/Categories/Miscellaneous#Non-Latin_section_headings_are_displayed_terribly_in_URL_anchors_and_can.27t_be_reached_directly

HTML5 supports UTF-8 with only minimal escaping of the hash sign in particular. Switching to HTML5 section anchors would have several advantages:

  • More readable section links, especially in non-ascii languages.
  • Simplified editing clients without a need to implement a legacy algorithm for deriving section ids from headings.
  • Simplified / more accurate DOM spec documentation.

Disadvantages include:

  • Break incoming links to sections from the internet.
  • Potentially break internal links to sections.

Migration options

Client-side fall-back

Client-side JS can look at the URL hash, and check if the section was found. If it was not found, it can encode headings using the old-style escaping algorithm, and check if the hash matches any of those. If it does, rewrite the hash to the matching new-style section.

As a result, users would be encouraged to fix links to new-style section ids. Existing links (both internal and external) would continue to work as long as the fall-back JS is active.

Automatic migration of internal references

Idea: Recognize old style escape pattern (/\.[0-9A-F]{2}/), and rewrite the full stop to %, then decode.

Issues:

  • Chance of false positives and conversion failures.
  • Only helps with internal links.

Add old style & new style section ids for a while

Pros:

  • Keeps links working during transition period

Cons:

  • Complicates HTML
  • Does not encourage a migration / surface correct section ID

Related Objects

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 377659 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/services/parsoid@master] [WIP] Update Parsoid to generate modern IDs w/ legacy fallback

https://gerrit.wikimedia.org/r/377659

Change 378357 had a related patch set uploaded (by MaxSem; owner: MaxSem):
[operations/mediawiki-config@master] Start migration to Unicode sections everywhere

https://gerrit.wikimedia.org/r/378357

Seb35 added a subscriber: Seb35.Sep 17 2017, 11:16 AM

I was curious to see the result and didn’t find live examples with MediaWiki, so I created one on https://test.wikipedia.org/wiki/HTML_5_section_IDs.

As a French speaker, thanks a lot for this feature!

Change 378357 merged by jenkins-bot:
[operations/mediawiki-config@master] Start migration to Unicode sections everywhere

https://gerrit.wikimedia.org/r/378357

There seems to be an interaction with LanguageConverter when the new HTML5 ids end up containing -{...}- markup.

Previously we could rely on the fact that LanguageConverter markup was *not* expanded when creating the section IDs, ie:

$ (echo '==-{foo}-==' ) | php maintenance/parse.php --quiet
<div class="mw-parser-output"><h2><span id="-.7Bfoo.7D-"></span><span class="mw-headline" id="-{foo}-">-{foo}-</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/~cananian/mediawiki/index.php?title=CLIParser&amp;veaction=edit&amp;section=1" class="mw-editsection-visualeditor" title="Edit section: -{foo}-">edit</a><span class="mw-editsection-divider"> | </span><a href="/~cananian/mediawiki/index.php?title=CLIParser&amp;action=edit&amp;section=1" title="Edit section: -{foo}-">edit source</a><span class="mw-editsection-bracket">]</span></span></h2>
</div>

But in some situations the id="-{foo}-" is getting fed back to language converter again, resulting in the ID being munged: T176176: HTML5 ids seems to change how wikilink fragments are parsed (when LanguageConverter is enabled)

Change 383473 had a related patch set uploaded (by MaxSem; owner: MaxSem):
[operations/mediawiki-config@master] Switch test wikis to HTML5 fragment mode in links

https://gerrit.wikimedia.org/r/383473

Change 383473 merged by jenkins-bot:
[operations/mediawiki-config@master] Switch test wikis to HTML5 fragment mode in links

https://gerrit.wikimedia.org/r/383473

Mentioned in SAL (#wikimedia-operations) [2017-10-12T23:13:12Z] <dereckson@tin> Synchronized wmf-config/InitialiseSettings.php: Switch test wikis to HTML5 fragment mode in links (T152540) (duration: 00m 47s)

@MaxSem: Any bug reports from the test roll-out? Seems to work pretty well: https://test.wikipedia.org/wiki/Motörhead#The_rise_of_Motörhead.

When should we expect a roll-out to Russian Wikipedia?

Today was all packed, but I grabbed a window tomorrow.

ssastry moved this task from Backlog to In Progress on the Parsoid board.Nov 6 2017, 6:51 PM

I see this was already deployed to ru-wiki, what about other languages? Is there a schedule?

As soon as we fix all the problems discovered on Russian projects.

@MaxSem i'm curious. Do you have any links explaining these problems?

Updated the dependency tree with these.

Change 394104 had a related patch set uploaded (by MaxSem; owner: MaxSem):
[operations/mediawiki-config@master] Switch all wikis to HTML5 section IDs

https://gerrit.wikimedia.org/r/394104

Change 394104 merged by jenkins-bot:
[operations/mediawiki-config@master] Switch all wikis to HTML5 section IDs

https://gerrit.wikimedia.org/r/394104

I didn't actually deploy the above change because the wikis are still running wmf.8, it'll have to wait.

Change 394460 had a related patch set uploaded (by MaxSem; owner: MaxSem):
[operations/mediawiki-config@master] Switch all wikis to HTML5 section IDs

https://gerrit.wikimedia.org/r/394460

Change 394460 merged by jenkins-bot:
[operations/mediawiki-config@master] Switch all wikis to HTML5 section IDs

https://gerrit.wikimedia.org/r/394460

Change 377659 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Update Parsoid to generate modern HTML5 IDs w/ legacy fallback

https://gerrit.wikimedia.org/r/377659

Change 395831 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/extensions/VisualEditor@master] Strip legacy section IDs from inside headings

https://gerrit.wikimedia.org/r/395831

@MaxSem The last Google Chrome itself started to encode IDs - is there any news on this?

I wonder if that's a permanent change, or if it's just a temporary fix while they update their list of spoofable characters? Unfortunately, the details of their security bug(s) are still embargoed, so it doesn't look like I can find out more details yet.

I wonder if the technology here (invisible <span> tags) could also help with T160952: Generate Language-converted section anchors/ids? Strawman: $wgFragmentMode could include one or more language variant codes along with the special 'html5' and 'legacy' values; similarly something similar to $wgExternalInterwikiFragmentMode (except for internal wikilinks) could select whether the "canonical" ID is used for a wikilink (matching source text) or a language-converted variant is preferred?

See https://bugs.chromium.org/p/chromium/issues/detail?id=789163. The change in Google Chrome behaviour is possible.

Thanks for following up with the Chrome team on that one! The relevant patch looks to have been merged in Chrome a few hours ago, and (recapping from the link above) it's expected to be shipped in Chrome "65 or even 64", which will be the stable branch of Chrome the week of Jan 23rd, 2018 or Mar 6th, 2018.

Fito added a subscriber: Fito.Dec 15 2017, 8:18 PM

Change 405391 had a related patch set uploaded (by MaxSem; owner: MaxSem):
[mediawiki/core@master] Set default fragment mode to [ 'legacy', 'html5' ]

https://gerrit.wikimedia.org/r/405391

Change 405391 merged by jenkins-bot:
[mediawiki/core@master] Set default fragment mode to [ 'legacy', 'html5' ]

https://gerrit.wikimedia.org/r/405391

@kaldari @MaxSem Can we close this task now?

@Niharika: The migration isn't totally complete as we still want to phase out the legacy IDs at some point. Probably need to make a new subtask for that.

@Niharika: The migration isn't totally complete as we still want to phase out the legacy IDs at some point. Probably need to make a new subtask for that.

HTML5 ids are added everywhere. So, it appears this task itself is complete. Maybe create a separate task (vs subtask) to phase out legacy IDs?

cscott added a comment.Feb 1 2018, 9:53 PM

More followup: T186272: Convert all parser tests to html5 ids

Also, checking up on the chromium bug above: it turns out that the original fix for the regression was backed out, fixed, and then relanded. The fix appears in Chrome 65 (to be released March 6, 2018), so Chrome 63 (released Dec 5, 2017) and Chrome 64 display "broken" html5 anchors (sigh).

Tgr added a comment.Feb 2 2018, 1:34 AM

I know I'm a little late asking this, but is there any plan on how to deal with section IDs that could interfere with fragments used by code? There is some of that already (MobileFrontend, MediaViewer) and I'm sure there will be more in the future. Generally fragments handled by client-side code can be differentiated by starting with /.