Page MenuHomePhabricator

MediaWiki fails to parse – (en dash)
Closed, DeclinedPublic

Description

Steps to reproduce: enter – in the wikitext of the page
Expected result: an en dash () is displayed.
Actual result: – is displayed (on the HTML level, & gets encoded into &.

The relevant spec is consume a character reference in HTML5, which has a compatibility table for frequently used character codes which are not Unicode code points. (En dash is U+2013 so the straightforward representation is – but the HTML5 standard also acknowledges Windows-1250 codes for a number of characters which when interpreted as Unicode would result in unprintable characters - U-0096 is "start of guarded area".) MediaWiki should respect that compatibility table when encoding references.

Event Timeline

Tgr raised the priority of this task from to Needs Triage.
Tgr updated the task description. (Show Details)
Tgr added a project: MediaWiki-Parser.
Tgr subscribed.

This behaviour is obviously caused by T106578.

It is implemented by function validateCodepoint within rMW/includes/Sanitizer.php

My personal view: Undefined Unicode values, either raw or as entity, should be removed from wiki project source texts and not being supported and encouraged in eternity. German WP assumes CP1252, others CP1250; the cyrillic author intended something entirely different and it always became visible on Russian WP in local style, but changing its meaning according to this task.

German WP assumes CP1252, others CP1250

The HTML5 spec clearly specifies CP1250. There is no reason we have to go with the standard but that requires the least effort.

Our authors in Greece inserted that thing and saw CP1253 the last decade, colleagues in Moscow got CP1251 displayed, someone in Istanbul was happy with presentation as CP1254.

This went on for 15 years now, and our texts have been acquired to this long before HTML 5.1 invented that new behaviour.

German WP is currently eliminating all occurrences; we regard them as CP1252.

The behaviour implemented by Tim is exactly how conversion should happen: An ambiguous entity is presented in the article, leaving it to the authors to replace that by valid UCS. It must not displayed a 1250 Š while the author intended and always saw a Љ (8A/1251).

MediaWiki should not maintain dubios codes; this is not an HTML editor. If bad codes are inserted by c&p, they should attract attention immediately, not silently converted from Vietnamese to east European.

Tgr claimed this task.

Fair point.

Thank you for withdrawal.

One more aspect: If I am sitting in Minsk and insert a string within source text editing by c&p, I can see a Љ in the edit area as expected. On preview I am looking at Š. Quite spooky. Remains miraculous even when re-editing the page. A Š displayed will give an IT freak a pretty good hint.

Greetings

Change 504621 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/services/parsoid@master] Ensure PHP and JS are consistent wrt allowed entities

https://gerrit.wikimedia.org/r/504621

Change 504621 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Ensure PHP and JS are consistent wrt allowed entities

https://gerrit.wikimedia.org/r/504621