Author: foenyx
Description:
<FoeNyx> the « » (U+2003 em space) should be an unvalid article name, no ?
<zwitter> all whitespace other than U+0020 should be
Version: unspecified
Severity: normal
Author: foenyx
Description:
<FoeNyx> the « » (U+2003 em space) should be an unvalid article name, no ?
<zwitter> all whitespace other than U+0020 should be
Version: unspecified
Severity: normal
Title | Reference | Author | Source Branch | Dest Branch | |
---|---|---|---|---|---|
Configure GitLab CI | repos/ci-tools/libup!1 | taavi | taavi/ci | master | |
Enable trusted runners for LibUp | repos/releng/gitlab-trusted-runner!62 | taavi | libup | main | |
jobs-api: fix cluster domain | repos/cloud/toolforge/toolforge-deploy!79 | dcaro | jobs_api_fix_cluster_domain | main | |
api-gateway: bump to 0.0.17 | repos/cloud/toolforge/toolforge-deploy!78 | dcaro | api_gateway_bump | main | |
jobs-api: bump to 0.0.220 | repos/cloud/toolforge/toolforge-deploy!76 | dcaro | jobs_api_bump | main | |
builds-api: bump to 0.0.85-20230817105952-25c2b55f | repos/cloud/toolforge/toolforge-deploy!73 | dcaro | bump_builds_api | main | |
builds api allow cluster domain | repos/cloud/toolforge/toolforge-deploy!71 | dcaro | builds_api_allow_cluster_domain | main | |
certificate: use the internal domain for all certs | repos/cloud/toolforge/builds-api!35 | dcaro | configure_cluster_domain_name | main | |
envvars-api: declare the internal cluster domains | repos/cloud/toolforge/toolforge-deploy!70 | dcaro | envvars_api_allow_cluster_domain | main | |
certificate: use internal cluster domain for both certs | repos/cloud/toolforge/envvars-api!10 | dcaro | fix_certificates | main | |
cert: use the project name for the local cluster name | repos/cloud/toolforge/envvars-api!7 | dcaro | add_cluster_local_altname | main | |
add cert dnsnames | repos/cloud/toolforge/jobs-api!15 | dcaro | add_cert_dnsnames | main | |
jobs-api: bump to 0.0.216 | repos/cloud/toolforge/toolforge-deploy!56 | dcaro | jobs-api_bump_0.0.216 | main | |
envvars-api: bump to 0.0.22-20230710124735-c3a7ee79 | repos/cloud/toolforge/toolforge-deploy!55 | dcaro | envvar-api_bump_0.0.22 | main | |
cert: add cluster.local dns alt name | repos/cloud/toolforge/builds-api!29 | dcaro | add_cluster_local_altname | main | |
cert: add cluster.local alt dns name | repos/cloud/toolforge/envvars-api!6 | dcaro | add_cluster_local_altname | main | |
cert: use dnsName to support api gateway per-backend checks | repos/cloud/toolforge/jobs-api!14 | dcaro | add_cert_dnsnames | main |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Declined | None | T5969 Unicode (UTF-8, utf8) compatibility (tracking) | |||
Resolved | None | T3414 Unicode whitespaces allowed in article title |
comment from bug 1971:
Moving a page to a title like [[« Pour l'Ukraine unie ! »]] create page with non
breakable space in the title, page move has been done here :
http://fr.wikipedia.org/w/index.php?title=Pour_une_Ukraine_unie_%21&action=history,
resulting page is
http://fr.wikipedia.org/wiki/%C2%AB%C2%A0Pour_l%27Ukraine_unie%C2%A0%21%C2%A0%C2%BB
Regarding the non-breaking space (U+00A0) specifically, it's generally transformed silently into U+0020 spaces when it goes
through the <textarea>->submit edit cycle and is not preserved, making it extra annoying.
I've just done a little research on Unicode whitespace handling; the Zs, Zl, and Zp character classes seem to be relevant, and the
set of them or some variant is what's counted by eg Java's Character.isSpace() and .NET's Char.isSpaceChar().
It might make sense to explicitly disallow the Zl and Zp chars (line separator and paragraph separator), and normalize all the Zs
chars to spaces (well, underscores) in title processing.
A quick grep of the current UnicodeData.txt database lists:
0020;SPACE;Zs;0;WS;;;;;N;;;;;
00A0;NO-BREAK SPACE;Zs;0;CS;<noBreak> 0020;;;;N;NON-BREAKING SPACE;;;;
1680;OGHAM SPACE MARK;Zs;0;WS;;;;;N;;;;;
180E;MONGOLIAN VOWEL SEPARATOR;Zs;0;WS;;;;;N;;;;;
2000;EN QUAD;Zs;0;WS;2002;;;;N;;;;;
2001;EM QUAD;Zs;0;WS;2003;;;;N;;;;;
2002;EN SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
2003;EM SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
2004;THREE-PER-EM SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
2005;FOUR-PER-EM SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
2006;SIX-PER-EM SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
2007;FIGURE SPACE;Zs;0;WS;<noBreak> 0020;;;;N;;;;;
2008;PUNCTUATION SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
2009;THIN SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
200A;HAIR SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
202F;NARROW NO-BREAK SPACE;Zs;0;CS;<noBreak> 0020;;;;N;;;;;
205F;MEDIUM MATHEMATICAL SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
3000;IDEOGRAPHIC SPACE;Zs;0;WS;<wide> 0020;;;;N;;;;;
2028;LINE SEPARATOR;Zl;0;WS;;;;;N;;;;;
2029;PARAGRAPH SEPARATOR;Zp;0;B;;;;;N;;;;;
There is another problem with UTF8 titles. The representation of a character in
a foreign codepage looks like a normal character in out codepage.
You may find examples in
http://de.wikipedia.org/w/index.php?title=Spezial:Log&type=delete&user=&page=&limit=500&offset=50
Look for entries in 1-may-2005 3:45 - 3:55 h ("K.D.St.V. CarοIus Маgnus").
Please view this text in html code. Examples:
<a
href="/w/index.php?title=%CE%9A.D.%D0%85t.V._%D0%A1arolu%D1%95_%D0%9Ca%C9%A1nu%D1%95&action=edit"
<a
href="/w/index.php?title=%CE%9A.D.%D0%85t.V._%D0%A1arolu%D1%95_Magnus&action=edit"
<a
href="/w/index.php?title=%CE%9A.D.%D0%85t.V._%D0%A1arolus_%CE%9Cagnu%D1%95&action=edit"
tsor (administrator of german WP)
foenyx wrote:
(In reply to comment #5)
There is another problem with UTF8 titles. The representation of a character in
a foreign codepage looks like a normal character in out codepage.
I reopened the bug 2042 as it's not exactly the same.
(this bug is a subset of bug 2042 only about homograph pair of whitespaces)
rickblock wrote:
Curly vs. straight quotes have been causing confusion at en lately as well.
foenyx wrote:
(In reply to comment #7)
Curly vs. straight quotes have been causing confusion at en lately as well.
this bug is for whitespace characters, the quotes confusion is probably more
suited for the bug 2042
ayg wrote:
Before anything is done on this, obviously a check needs to be run on the various wikis to see if they use these. It seems probable that IDEOGRAPHIC SPACE, for instance, should not be blacklisted. In general, there are various reasons to use various types of spaces, and I think it would be best if these were normalized for storage but not blacklisted, so you can't have two article names that differ only in the type or number of spaces used but you can still have unusual spaces in character titles. This should be part of the eventual move to case-insensitivity for titles (bug 453).