Page MenuHomePhabricator

Non-breaking space in header ID breaks anchor
Closed, ResolvedPublic3 Story Points

Description

Header and link with spaces:

== Nbsp test ==
[[#Nbsp test]]
id="Nbsp_test"
href="#Nbsp_test"

Header and link with non-breaking spaces:

== Nbsp test ==
[[#Nbsp test]]
id="Nbsp.C2.A0test"
href="#Nbsp_test"

We have non-breaking space → space replacement for a long time since T17248. We need to make exactly the same replacement for headers.

Details

Related Gerrit Patches:

Event Timeline

putnik created this task.Feb 26 2015, 6:37 PM
putnik raised the priority of this task from to Needs Triage.
putnik updated the task description. (Show Details)
putnik added a project: Parsoid.
putnik added a subscriber: putnik.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 26 2015, 6:37 PM
putnik updated the task description. (Show Details)Feb 26 2015, 6:42 PM
putnik set Security to None.
ssastry triaged this task as Medium priority.Mar 3 2015, 8:52 PM
Vort added a subscriber: Vort.Oct 2 2016, 12:28 PM

This problem is regularly occurring on Russian Wikipedia due to its active use of non-breaking space before dashes and other entities, and many topics on the technical forum cover it, so it's better be fixed.

Arbnos added a subscriber: Arbnos.Oct 8 2016, 1:21 PM
cscott added a subscriber: cscott.Nov 1 2017, 7:46 PM

This change would have to be made in both Parsoid and PHP core. We can leave the legacy IDs alone; we should probably only normalize the HTML5 ids. With $wgFragmentMode=['html5','legacy'];:

$ (echo '==Foo bar==' ; echo '[[#Foo bar]] [[#Ж]]' ) | pance/parse.php
<div class="mw-parser-output"><h2><span id="Foo.C2.A0bar"></span><span class="mw-headline" id="Foo bar">Foo&#160;bar</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/~cananian/mediawiki/index.php?title=CLIParser&amp;veaction=edit&amp;section=1" class="mw-editsection-visualeditor" title="Edit section: Foo bar">edit</a><span class="mw-editsection-divider"> | </span><a href="/~cananian/mediawiki/index.php?title=CLIParser&amp;action=edit&amp;section=1" title="Edit section: Foo bar">edit source</a><span class="mw-editsection-bracket">]</span></span></h2>
<p><a href="#Foo_bar">#Foo&#160;bar</a> <a href="#Ж">#Ж</a>

Note that there's an underscore substitution happening in the [[#...]] processing, but it's not happening in the generated HTML5 id. (The thing that looks like a space in the HTML5 id is actually the non-breaking space &#160;.)

cscott added a subscriber: MaxSem.Nov 1 2017, 7:48 PM

Added @MaxSem, as the above inconsistency in links and headings will make html5 ids break.

kaldari set the point value for this task to 3.Nov 1 2017, 10:42 PM

Change 388365 had a related patch set uploaded (by MaxSem; owner: MaxSem):
[mediawiki/core@master] Remove nbsp and similar characters from section IDs

https://gerrit.wikimedia.org/r/388365

MaxSem claimed this task.Nov 3 2017, 2:36 AM

Change 388365 merged by jenkins-bot:
[mediawiki/core@master] Remove nbsp and similar characters from section IDs

https://gerrit.wikimedia.org/r/388365

kaldari closed this task as Resolved.Nov 8 2017, 12:10 AM
kaldari moved this task from Needs Review/Feedback to Q1 2018-19 on the Community-Tech-Sprint board.
DannyH moved this task from Estimated to Archive on the Community-Tech board.Dec 19 2017, 1:13 AM

Change 416890 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/services/parsoid@master] Remove &nbsp; and similar characters from section IDs

https://gerrit.wikimedia.org/r/416890

Change 416890 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Remove &nbsp; and similar characters from section IDs

https://gerrit.wikimedia.org/r/416890

Pols12 added a subscriber: Pols12.Apr 25 2019, 6:49 PM

We can leave the legacy IDs alone; we should probably only normalize the HTML5 ids.

On fr.wp, « X » is automatically replaced with «&nbsp;X&nbsp;».

Legacy IDs seem to have been turned from .C2.AB.C2.A0X.C2.A0.C2.BB into .C2.AB_X_.C2.BB. Can you confirm that? Is that reversible?

@Pols12 that seems like possibly an interaction with french space armoring (T197902).

I can confirm that in legacy mode the .C2.A0 are being converted to underscore:

$ (echo '« X »' ; echo; echo '[[#« X »]]' ; echo; echo '==« X »==') | php maintenance/parse.php 
parse.php: warning: reading wikitext from STDIN. Press CTRL+D to parse.

<p>«&#160;X&#160;»
</p><p><a href="#.C2.AB_X_.C2.BB">#«&#160;X&#160;»</a>
</p>
<h2><span class="mw-headline" id=".C2.AB_X_.C2.BB">«&#160;X&#160;»</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/~cananian/mediawiki/index.php?title=CLIParser&amp;veaction=edit&amp;section=1" class="mw-editsection-visualeditor" title="Edit section: « X »">edit</a><span class="mw-editsection-divider"> | </span><a href="/~cananian/mediawiki/index.php?title=CLIParser&amp;action=edit&amp;section=1" title="Edit section: « X »">edit source</a><span class="mw-editsection-bracket">]</span></span></h2>

But if I am understanding correctly, prior to this patch we would generate id=".C2.AB.C2.A0..." but href=".C2.AB_.... That is, the tag ids didn't match the href attributes and all the links were broken. This patch fixed the situation, by changing the ids to match the hrefs we'd been generating all along.

OK, thanks!

The link I’ve seen was created from a gadget which read directly the source code to get the ID, so it let generating correct URLs. But the gadget is currently used by only 2025 (including 242 active) users, so I think there is not a lot of issues of this kind.