Page MenuHomePhabricator

feature request: provide a notification for irregular Unicode characters
Closed, ResolvedPublic

Description

Author: gangleri

Description:
Hallo!

This request proposes a synthesis solution for different bugs:
a) Bug 1414: Unicode whitespaces allowed in article title
b) Bug 1524: usernames should use unicode whitelist
c) Bug 2593: Non-printing characters allowed in registration
d) Bug 3819: strip phantom general punctuation characters from page titles

Requests and solutions can be "restrictive" but these would make it impossible
to use these characters at all. Personaly I do not like restrictive solutions.

The solution proposed here is to implement a notification for "action=submit"
(preview or save) indicating that saving would generate "irregular links", links
containing "irregular characters".

The notification should list *all* "irregular links" individualy (what would be
an irregular link should be defined in a .php include file) and a "save anyway"
buttom.

*notifications* are not new in MediaWiki:

  • Special:Upload notifies if the size of a file to be uploaded is above a limit.
  • Special:Upload notifies if a file would be uploaded with a title that is

already existing.
Both notifications are using [[MediaWiki:Uploadwarning]] button:
[[MediaWiki:Savefile]] text: [[MediaWiki:Ignorewarning]] etc.

The proposed solution would meat the main goal:

  • generating a warning if somthing could happen what makes trouble
  • if the generation is intended then it is up to the user to generate the link

Benefit: The warning should prevent from generating "unintended" "irregular links".

The list of the "irregular links" should display the "irregular characters" as
HTML entities if such exist else in &#nnnn; notation and *not* as UTF-8 because
it would not be possible to see / distinguish many of them as UTF-8.

*main* "irregular characters" identified until now:

  • whitespace / non-printing characters
  • general punctuation characters

The notification should support all types of codings of the "irregular
characters": UTF-8, HTML entities (‎ rlm; ...) &#nnnn;, &#xnnnn; %XX%YY%ZZ
in links or their parameters (also inside {{localurl}}, {{fullurl}} ...).

The proposed solution would make it easy to identify such forms of vandalism or
mistakes caused by copy and paste or incorrect editing due to insertion /
deletion of such characters. Detecting and fixing them now is very time consuming.


*other* "irregular characters"
It should be evaluated if this function can be used for "Unicode character
normalisation" also. This is dealing with MediaWiki's conversion of Unicode
precomposed characters to a group of Unicode characters.

An optimal achievement would be to generate "proposals" "what to replace with
what" offering checkboxes beside the links.

Example:
A Unicode Character HEBREW LETTER ALEF WITH PATAH - U FB2E would be replaced
anyway by MediaWiki with the two characters HEBREW LETTER ALEF - U+05D0 and
HEBREW POINT PATAH - U+05B7. So if we change the characters in the build in
title normalisation why not being able to change also

  • the &#nnnn; representation אַ to אַ
  • the &#xnnnn; representation אַ to אַ
  • the %EF%AC%AE to %D7%90%D6%B7

in the source of the page?
It makes only trouble to keep these. See Bug 3860: links generated with
precombined characters show red despite the fact that the normalised links exist
testcase: [[wiktionary:yi:bugzilla/03860]]

Because changes would be controled by checkboxes it would still be possible to
maintain precombined characters for documentation, testing ... However fixing /
"converting to the standard" would be achieved with a "build in help" "knowledge
tool" and can save much time.

some bugs dealing with Unicode normalization:

  • Bug 1375: Unicode normalization leaves red links
  • Bug 1527: problem on URL with Devanagari characters
  • Bug 2399: Unicode normalization interferes with Hebrew and Arabic with vowels

Best regards reinhardt [[user:gangleri]]


Version: unspecified
Severity: enhancement
URL: http://test.wikipedia.org/wiki/Bugzilla_003696

Details

Reference
bz4185

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 8:58 PM
bzimport set Reference to bz4185.
bzimport added a subscriber: Unknown Object (MLST).

gangleri wrote:

(In reply to comment #0)

An optimal achievement would be to generate "proposals" "what to replace with

what" offering checkboxes beside the links.

This handles "character conversion".
adding blocks
Bug 3985: character conversion (tracking)

gangleri wrote:

*note*

This request handles only the occurence of "irregular characters" in links. For
the handling in the rest of the page source see
Bug 4012: feature request: add a felexible magic character conversion to the
build in editor

gangleri wrote:

*note*

Because this request is related to action=submit it should also make an analysis
of {{PAGENAME}}. This will prevent creating such pages and avert editors about
the problem.

However this request does specify to make an analysis of {{PAGENAME}} for other
actions as view, watch, history, move, delete, validate etc.

Problem characters would simply be forbidden. "Notification" is unnecessary.

gangleri wrote:

REOPENing this bug and changing title to
feature request: provide a notification for irregular Unicode characters

Dear friends;

http://test.wikipedia.org/wiki/Bugzilla_003696 describes how persistend and irritating *invisible* Unicode characters (as the General Unicode Punctuation characters) can be.

As a documentation text was copied and pasted from the page
http://aleph1.libnet.ac.il/F/?func=find-b&find_code=WSB&request=9657318130

General Unicode Punctuation characters *infected*

  1. http://test.wikipedia.org/w/index.php?diff=prev&oldid=43229
  2. http://test.wikipedia.org/w/index.php?diff=prev&oldid=43230
  3. http://test.wikipedia.org/w/index.php?diff=prev&oldid=43231

and whatever other pages, emails etc. which used these pages as a source.

[[user:Splarka]] made http://test.wikipedia.org/wiki/MediaWiki:Gadget-EvilUnicodeConverter
which is available for tests at http://test.wikipedia.org/wiki/Special:Gadgets

With this tool it is possible to identify a configurable set of "''Evil Unicode characters''".
The source of the page content is displayed as

Author  Title   Year    Library         Sysno

1

‫ לנסקי, אהרן,1955- ‬     ‫ נגד כיוון ההיסטוריה :הרפתקאותיו המופלאות של האיש שהצ ‬     2005    HAI Haifa U.    006639172

2

This is a very convenient way to eliminate all unvanted "''Evil Unicode characters''".

Please reconsider to include this or similar code as a standard function in MediaWiki.

Thanks in advance for all your efforts.

Best regards
Reinhardt [[user:Gangleri]]

Sounds a job for an extension or a gadget, which already seems to exist.