Page MenuHomePhabricator

Interwiki from UTF-8 to Latin-1 wiki is broken for U+0161
Closed, ResolvedPublic

Description

There is an article called [[Beneš decrees]] on the English Wikipedia. Its URL is
http://en.wikipedia.org/wiki/Bene%9A_decrees. The Czech Wikipedia contains the
article called [[Benešovy dekrety]]; I have tried to insert an interwiki link to
en: using [[en:Beneš decrees]]. The link pointed to
http://en.wikipedia.org/wiki/Bene%C5%A1_decrees,
which is IMHO a proper UTF-8 encoding of the title, which should be decoded on
the en: side. But, it is not -- instead, the link goes to [[Bene]]!

Afterwards, I checked de:, which uses [[en:Beneš decrees]], where a control
character
U+009A (SINGLE CHARACTER INTRODUCER) is literally (!) included; this gets encoded to
http://en.wikipedia.org/wiki/Bene%C2%9A_decrees, which is "correctly" decoded
to [[Bene%9A_decrees]]. So I tried [[en:Beneš decrees]] on cs:, which seems
to work
correctly for now. But -- I don't think it is a correct behavior.

A probable cause could be that IIANM the %9A character is not defined in proper
ISO 8859-1,
only in the windows-1252 enhancements, so that maybe the article on en: should
not have
that name at all. But, in that case, MediaWiki should probably guard against that.


Version: 1.4.x
Severity: normal
URL: http://en.wikipedia.org/wiki/Bene%C5%A1_decrees

Details

Reference
bz1679

Revisions and Commits

Related Objects

StatusSubtypeAssignedTask
InvalidNone
ResolvedNone
ResolvedNone

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 8:15 PM
bzimport set Reference to bz1679.
bzimport added a subscriber: Unknown Object (MLST).

plugwash wrote:

as you say that page has a char in its title that is not in iso-8859-1 but is in
windows-1252 and most browsers treat iso-8859-1 as windows-1252

however mediawiki can't handle your inbound interwiki because it can't convert
U+161 to iso-8859-1

there are three possible fixes to this

1: convert those incoming interwikis to windows-1252
2: eliminate windows-1252 chars from en (they shouldn't really be there anyway
especially not in article titles)
3: convert en to unicode taking account of the windows-1252 chars

I suspect the reason there was a literal control code in de was a conversion
from iso-8859-1 to utf-8 that did not take account of the possibility that
windows-1252 chars may be present.

zigger wrote:

*** Bug 2472 has been marked as a duplicate of this bug. ***

Non-ISO-8859-1 character in title, of course it doesn't work.

Non-issue with 1.5 and utf-8 conversion.

plugwash wrote:

1.5 is not in use yet. the real issue is that people were allowed to create
articles on iso-8859-1 wikis with titles using chars that iso-8859-1 allocates
to reserved control codes in the first place. and that browsers interpret
iso-8859-1 as windows-1252.

also you say moving to utf-8 is a soloution but see bug 1881 for why this is not
really the case!

epriestley added a commit: Unknown Object (Diffusion Commit).Mar 4 2015, 8:22 AM