Page MenuHomePhabricator

Broken links between marker and footnote when name contains certain HTML entities
Closed, ResolvedPublicBUG REPORT

Description

The label (e.g. "[9]") of a <ref> tag normally (on en.wiki anyway) if clicked will scroll to the reference text, and if hovered over will either pop up a box with the reference text (if the reference is offscreen) or highlight the reference text (if the reference is onscreen). However if the <ref> tag has a name parameter with any of various character sequences, such as %20, then none of these responses will happen. This is a problem for a reader wanting to see the reference information.

Known problem sequences

Some of these are more frequent than you might expect. There likely are other character sequences that fail in the same way. For example, the %c8%98 represents a Latin letter with diacritical marks, and there are lots of such letters. The basic problem seems to be that the html generated uses different identifiers for the link to the reference and the target. Apparently the name value is used to build the identifier, but the identifier is built in two different ways, one for the link and one for the target. Seems like a sure recipe for bugs.

Note that if one has accidentally used such a name value for a ref tag, it will not be apparent that there is a problem. No one goes around hovering over every reference label in an article. Also, some automated or "semi-automated" edits may change a ref name in a way that starts the ref label failing in this way. All the more reason the software should generate the html identifier in one place and not two so that mismatches won't occur.

List of steps to reproduce:

  • Add lines like these to a page:
The mouse.<ref name="A&nbsp;- Z"> Animals A to Z </ref>
== References ==
* <references />
  • Save the page
  • Hover the cursor over the reference label which follows "The mouse.".

What happens?:

  • Nothing

What should have happened instead?:

  • The reference line with "Animals A to Z" should be highlighted (or a popup with that text shown).

Event Timeline

Aklapper removed a subscriber: Cite.

Cite itself shows no popups. On English Wikipedia that is the "Reference Tooltips" gadget; see https://en.wikipedia.org/wiki/Special:Preferences#mw-prefsection-gadgets . The problem also happens with the "Reference Previews" extension from https://en.wikipedia.org/wiki/Special:Preferences#mw-prefsection-betafeatures .
The problem does not happen when using "Navigation popups" on https://en.wikipedia.org/wiki/Special:Preferences#mw-prefsection-gadgets

Aklapper renamed this task from Dead ref label (no link, popup or highlight) depending on ref name to No reference popup shown when ref name contains certain HTML entities.Dec 25 2021, 11:40 AM

"Navigation popups" has problems with this also, but in its own way. When hovering over the ref label (in the %2E case, at least), it does give a popup, but the popup does not contain the normal contents it is supposed to deliver, namely the text displayed in the reference note. (Instead it just shows the plain popup for the page, which is the text of the intro.) You can see this by hovering over the %2E example label [8], contrasted with hovering over [7], which does not have problem characters.

thiemowmde subscribed.

TL;DR: This bug is caused by a change in core, but needs to be fixed in the Cite codebase. It affects the Reference Previews feature of the Popups extension and all other kinds of reference tooltips, but can't be fixed in these codebases.

Here is a minimized version of the examples from the task description:

<ref name="nature%20phylo">A</ref>
<ref name="Mininova%2E26%2E11%2E2009">B</ref>
<ref name="%c8%98tiri_2019">C</ref>
<ref name="play%21">D</ref>
<ref name="Terry+O%26rsquo%3BN…</ref">E</ref>
<ref name="The Bill moves to 9&nbsp;pm">F</ref>
<references />

Cite generates bogus HTML for these examples. For example, href="#cite_note-nature_phylo-1" and id="cite_note-nature%20phylo-1" don't match,

Cite is hardly to blame. There is only a single method that generates this id. It's used in two places. Both are wikitext, one <li id="…"> and one [[#…]]. Both wikitext snippets are parsed via Parser::recursiveTagParse.

The problem is that the parser behaves very different depending on the $wgFragmentMode, see https://www.mediawiki.org/wiki/Manual:$wgFragmentMode. The fragment in [[#…]] is url-decoded (this happens in Parser::handleInternalLinks2), but the id in <li id="…"> is not. The $wgFragmentMode default changed with MediaWiki 1.37, just a few months ago in September 2021, see https://gerrit.wikimedia.org/r/720427, tracked via T186267.

Note this issue is not related to anything the WMDE-TechWish team did, even if it affects one of our projects. Anybody should feel free to pick this up. Just let us know.

Warning: Simply url-decoding the input doesn't work when it contains something like %2522 that appears double-encoded. Decoding this to %22 to early would still trigger the same issue.

Change 751119 had a related patch set uploaded (by Thiemo Kreuz (WMDE); author: Thiemo Kreuz (WMDE)):

[mediawiki/extensions/Cite@master] Add parser & unit test cases for different $wgFragmentMode's

https://gerrit.wikimedia.org/r/751119

Change 751131 had a related patch set uploaded (by Thiemo Kreuz (WMDE); author: Thiemo Kreuz (WMDE)):

[mediawiki/extensions/Cite@master] [POC] Use different id/fragment encoding

https://gerrit.wikimedia.org/r/751131

Change 751149 had a related patch set uploaded (by Thiemo Kreuz (WMDE); author: Thiemo Kreuz (WMDE)):

[mediawiki/extensions/Cite@master] Use correct Sanitizer method for id/fragment escaping

https://gerrit.wikimedia.org/r/751149

Change 751119 merged by jenkins-bot:

[mediawiki/extensions/Cite@master] Add parser & unit test cases for different $wgFragmentMode's

https://gerrit.wikimedia.org/r/751119

Change 751149 merged by jenkins-bot:

[mediawiki/extensions/Cite@master] Use correct Sanitizer method for id/fragment escaping

https://gerrit.wikimedia.org/r/751149

Change 977091 had a related patch set uploaded (by Thiemo Kreuz (WMDE); author: Thiemo Kreuz (WMDE)):

[mediawiki/extensions/Cite@master] Correctly encode non-breaking spaces in reference names

https://gerrit.wikimedia.org/r/977091

We dug a bit deeper into this last week and found:

  • The remaining issue with non-breaking spaces might be surprisingly easy to fix, see https://gerrit.wikimedia.org/r/977091.
  • Turns out the web browsers require slightly different encodings in the href="#…" attribute vs. the target id="…".
    • The id attribute is pretty much plain text. Obviously " must be encoded as &quot;, and any other character can be encoded as an &…; HTML entity as well. But that's normal HTML behavior and mostly unrelated. What matters is that % don't have a special meaning in the id.
    • This is different in the href where % do have a special meaning. The browser decodes %-sequences in the href once and only then tries to find the matching target id.
  • This can be extra confusing because not every % does have a special meaning. Only % that are followed by a valid hex sequence.
  • The Cite code generates the id once and uses it in both places, the href an the target id. What's notable is that Cite doesn't do any encoding. Instead it uses the MediaWiki parser to parse the two wikitext snippets [[#…]] and <li id="…">.
  • The behavior of the wikitext parser is quite different when it comes to %-encoded entities. This is not wrong. It's just not the same behavior as in a pure HTML <a> tag.
    • Again, an id attribute is plain text.
    • MediaWiki tries to actively decode some %-encoded characters in a [[…]] wikitext link. The relevant code seems to be a self::normalizeUrlComponent() call in Parser.php. Ideally we can ignore this. The well-intended purpose of this code is to decode %-sequences that are irrelevant for the browser anyway and don't make a difference. The question is: Is the assumption MediaWiki makes here correct?
  • We probably need to add an extra urlencode somewhere to fight one or both MediaWiki's as well as the browser's %-decoding.

Change 977678 had a related patch set uploaded (by Thiemo Kreuz (WMDE); author: Thiemo Kreuz (WMDE)):

[mediawiki/extensions/Cite@master] [POC] Skip URL encoding in id="…" attributes that aren't URLs

https://gerrit.wikimedia.org/r/977678

thiemowmde renamed this task from No reference popup shown when ref name contains certain HTML entities to Broken links between marker and footnote when name contains certain HTML entities.Nov 28 2023, 2:26 PM

Change 978642 had a related patch set uploaded (by Thiemo Kreuz (WMDE); author: Thiemo Kreuz (WMDE)):

[mediawiki/extensions/Cite@master] Split off separate key normalization function

https://gerrit.wikimedia.org/r/978642

Change 979672 had a related patch set uploaded (by Thiemo Kreuz (WMDE); author: Thiemo Kreuz (WMDE)):

[mediawiki/extensions/Cite@master] Re-arrange code in preparation for T298278

https://gerrit.wikimedia.org/r/979672

Change 978642 merged by jenkins-bot:

[mediawiki/extensions/Cite@master] Split off separate key normalization function

https://gerrit.wikimedia.org/r/978642

Change 751131 abandoned by Thiemo Kreuz (WMDE):

[mediawiki/extensions/Cite@master] [POC] Use different id/fragment encoding

Reason:

This hack shouldn't be needed any more with I5a64ac4.

https://gerrit.wikimedia.org/r/751131

Change 979672 merged by jenkins-bot:

[mediawiki/extensions/Cite@master] Re-arrange code in preparation for T298278

https://gerrit.wikimedia.org/r/979672

Change 977091 merged by jenkins-bot:

[mediawiki/extensions/Cite@master] Correctly encode non-breaking spaces in reference names

https://gerrit.wikimedia.org/r/977091

Change 977678 merged by jenkins-bot:

[mediawiki/extensions/Cite@master] Skip URL encoding in id="…" attributes that aren't URLs

https://gerrit.wikimedia.org/r/977678

Change 983445 had a related patch set uploaded (by Thiemo Kreuz (WMDE); author: Thiemo Kreuz (WMDE)):

[mediawiki/services/parsoid@master] Use correct Sanitizer method for id/fragment escaping

https://gerrit.wikimedia.org/r/983445

Change 983445 merged by jenkins-bot:

[mediawiki/services/parsoid@master] Use correct Sanitizer method for id/fragment escaping

https://gerrit.wikimedia.org/r/983445

WMDE-Fisch removed thiemowmde as the assignee of this task.

Change 985169 had a related patch set uploaded (by Isabelle Hurbain-Palatin; author: Isabelle Hurbain-Palatin):

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.19.0-a10

https://gerrit.wikimedia.org/r/985169

Change 985169 merged by jenkins-bot:

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.19.0-a10

https://gerrit.wikimedia.org/r/985169