Page MenuHomePhabricator

Enforce whitespace normalization in reference names much earlier
Open, Needs TriagePublic

Description

Legacy

Cite does have a very old quirk about how the name="…" of a <ref> is not necessarily unique. This is related to how spaces and/or underscores are normalized in the legacy parser. We started documenting this unintentional behavior better in https://gerrit.wikimedia.org/r/1139785.

As of now, <ref name="a b"> and <ref name="a__b"> are two different references, but become both identical #cite_note-a_b-… in the browser's address bar when you jump back and forth between the marker in the text and the footnote at the bottom of the document. The legacy Cite code had to invent and expose an additional unique number to work around this problem (now called "global id" in the code).

The fundamental reason for this odd behavior is legacy Cite abusing the wikitext parser in unexpected ways. The system messages cite_reference_link, cite_references_link_one, and cite_references_link_many_format contain wikitext snippets like [[#$1]]. Cite abuses this syntax for something it was not designed for. The [[#…]] wikitext syntax is meant to link to headlines in the text. To make this work more reliable the target anchors for headlines get additional normalization. Sequence of whitespaces and underscores are normalized to single underscores.

Cite technically doesn't need this normalization but can't disable it because it is a hard-coded feature of the [[#…]] syntax. The only thing Cite can do is to make sure the same normalization happens on both ends and matches between the wikitext [[#…]] and the HTML id="…". However, that was unaccounted for and broken for years because the original authors of legacy Cite haven't been aware of it. It was the TechWish team that fixed this much later, see T352179 and related.

Parsoid

Parsoid doesn't use the mentioned messages and would – in theory – not have the same problem. But Parsoid needed to generate the same HTML and needed re-implemented the behavior because of this.

The proposal

We would like to enforce the whitespace/underscore normalization much earlier (in both parsers) so that <ref name="a b"> and <ref name="a__b"> become the same reference and there is no discrepancy to the (normalized) HTML ids any more.

Effects
  • Names are unique and the additional global id is not needed any more.
    • This allows us to do T406858, which is a requirement for several fundamental refactoring efforts.
    • The behavior of the code is more predictable and the logic more consistent between the two parsers.
  • We believe this is also beneficial for users:
    • Just from looking at names like "Meyer_2011" vs. "Meyer 2011" we think it makes much more sense to make them behave as one reference instead of two.
    • It should be extremely rare that users ever run into this. One would need to have two refs with almost identical names on one page. Such a situation would already be confusing, as shown in the example above. We tried to use GlobalSearch to find real-world examples but couldn't find any, so far.
    • Even if such pages exist, the problem would be easy to fix and generally improve the quality of the article. The moment we make this change refs with conflicting names get merged, but marked with an error when they have different content. This is easily visible in the article and fixable by separating the two refs better with distinct names. Most communities actively track such errors and fix them in a timely manner.
    • Users are aware of the whitespace/underscore normalization from how headlines work.
    • The anchors in the browsers address bar are more compact and more predictable when they are shown as e.g. a_b instead of a%20b.
    • We don't need to expose the internal global id any more in anchors and the browsers address bar (at least for named refs).