Page MenuHomePhabricator

strip phantom general punctuation characters from page titles
Closed, ResolvedPublic


Author: gangleri

Sorry for this!


a) I tested character normalisation which seams part of title normalisation.
Regarding precombined characters - NON-precombined characters this workes fine:
point to the same page despite different coding.

b) The bug's URL will list four different pages with "identical optical title".
There are "phantom" trailing general punctation characters generating different
URL's. Compare:
Unicode Character 'RIGHT-TO-LEFT EMBEDDING' (U+202B)

UTF-8 (hex) 0xE2 0x80 0xAB (e280ab)

The generated URL's are:

There are many aspects to this:
a) possible vandalism - suggestion: Please evaluate if "phantom" = unnecessary
heading or trailing punctuation should be stripped from database titles
++ this looks like a normalisation
b) garbage in - garbage out

Regards Reinhardt [[user:gangleri]]

P.S. I run into this because of textual ambiguosities at Wikipedia in Yiddish
relating to the usage of "tsvey vovn" versus "vov + vov", "tsvey-yudn": versus
"yud + yud" etc.

example 1: There is an article [[yi:וויץ]] but not [[yi:װיץ]] .

example 2: contains "vey iz (tsu) mir"
which is written *there* both with "vov + vov" and "yud + yud". Nevertheless translates with
"tsvey vovn" and "tsvey-yudn": ‫װײ איז (צו) מיר!

It seems that automatical character substitution is not possible because of
ambiguasities when three characters meet together as in at
farvunderung - פֿאַרווונדערונג , "farvundert" - פֿאַרווונדערט
and the other way around at
oyspruvn - אויספּרווון

Version: unspecified
Severity: trivial



Event Timeline

bzimport raised the priority of this task from to Normal.Nov 21 2014, 8:54 PM
bzimport added a project: Wikimedia-Rdbms.
bzimport set Reference to bz3819.
bzimport added a subscriber: Unknown Object (MLST).
bzimport created this task.Oct 28 2005, 5:01 PM

gangleri wrote:

You will find typical examples at the end of and at .

Summary is available at .

These pages where created because I have "compiled" the titles with "copy and
paste" (of hebrew characters) between different Firefox browsers on Windows.

A workaround is to use an usefull keyboard as described at
and avoid this silly "copy and pasts".
See : Yiddish Pasekh and Keyman
keyboard for Windows

Regards Reinhardt [[user:gangleri]]

gangleri wrote:


This bug can cause some confusion in a wiki. I assume that many contributors are
using "copy and paste" to insert a few hebrew characters.

As you can see from
%E2%80%AB can be

  • at the begining of a title
  • at the end of a title
  • (I assume also inside the title)

There would be different things to do:

  • avoid generation of such titles during editing, linking etc.
  • clear the database - this is a maintenance issue

Regards Reinhardt [[user:gangleri]]

gangleri wrote:


I found more incorect titles (only with heading RIGHT-TO-LEFT_EMBEDDING) in
other projects with

Beside RTL wiki's [[ar:]] [[fa:]] [[he:]] [[ur:]] [[yi:]] their wiktioaries etc.
all other projects can be affected.

These wrong titles at [[yi:]] have been created by 5 contributors. This shows
that it is a general problem. If contributors use "copy" from a web page and
copy it (as hebrew characters) into the URL from the browser (I use mainly
Firefox myself) they might copy / paste leading trailing punctuation characters
and the browser will *generate* these URL's.

Of course this is not the proper way to generate titles (one should use a
keyboard) and might be a Firefox issue (I do not know if it is reported at if not please do so) or not but is common praxis of a signifficant
amount of contributors to RTL projects.

You will find the affected titles at:

Best regards Reinhardt [[user:gangleri]]

gangleri wrote:

more characters:

I found
which contained originaty a trailing %E2%80%AC

Unicode Character 'RIGHT-TO-LEFT EMBEDDING' (U+202B)
UTF-8 (hex) 0xE2 0x80 0xAB (e280ab)

Compare also:
Unicode Character 'LEFT-TO-RIGHT EMBEDDING' (U+202A)
UTF-8 (hex) 0xE2 0x80 0xAA (e280aa)
UTF-8 (hex) 0xE2 0x80 0xAC (e280ac)
Unicode Character 'LEFT-TO-RIGHT OVERRIDE' (U+202D)
UTF-8 (hex) 0xE2 0x80 0xAD (e280ad)
Unicode Character 'RIGHT-TO-LEFT OVERRIDE' (U+202E)
UTF-8 (hex) 0xE2 0x80 0xAD (e280ae)

Variations / modifications of
are of limited use only because (theoreticaly) these characters can be included
anywhere in a title.

I will open another enhancement request about a special page alowing to instring
search of titles specifying %nn values.

gangleri wrote:

(In reply to comment #4)

I will open another enhancement request about a special page alowing to instring
search of titles specifying %nn values.

bug 3887: create a special page for instring search of titles specifying %nn values

gangleri wrote:

sorry for this


You may say: "garbague in garbague out"

But this seams to be a subsequent error. It "seams" to interfear with setup
about case sensitive / non case sensitive titles. The earlier this bug gets
fixed the less subsequent errors we get.

gangleri wrote:

sorry for this
this title is invalid because it starts with %E2%80%AB = Unicode Character

However it is a mess editing BiDi and generate pages like
and also taking care of all these !*%$$€@*# bugs.

These pages look fine but the titles they link to should be invalid and the
links should not show red. Best would be to let them with [[ and ]] brackets
same as invalid links.

Best regards Reinhardt [[user:gangleri]]

gangleri wrote:

(In reply to comment #7)

sorry for this
and also taking care of all these !*%$$€@*# bugs.

I fixed the involved links so the Whatlinkshere is no longer valid . Compare:
bug 3894 white space characters, BiDi control characters should show up in diff

gangleri wrote:

fixing this would require later a validation according to
bug 3904 disallow user pages and user_talk pages starting with lower case on
case sensitive wikis

adding blocks bug 3904

gangleri wrote:

Hi! The code on FiverAlpha is changing.
and bug 3888 comment 3

The category ilustrates that the
punctuation characters can be used for fraud and vandalim.

If you are not used to the punctuation topics you may *not* notice that
the edit of this *false account* contains punctuation characters in

  • one way to see these characters are verifying the URL; this is simple if most

of the contained characters are 7-bit ASCII;

  • onother way to see these characters is inserting the cursor in the text and

moving the cursor with the mouse trough the text area

  • another way to see these characters is to mark the text with the mouse

Because these characters make more trouble then providing benefit I suggest to
suppress the punctuation characters in titles until a solution could be provided
which could be generaly accepted. As it is now mimic accounts can be created.
This opens doors for fraud and vandalism.

regard reinhardt [[user:gangleri]]

gangleri wrote:

(In reply to comment #4)

more characters:

I found also
Unicode Character 'LEFT-TO-RIGHT MARK' (U+200E)
UTF-8 (hex) 0xE2 0x80 0x8E (e2808e)
Unicode Character 'RIGHT-TO-LEFT MARK' (U+200F)
UTF-8 (hex) 0xE2 0x80 0x8F (e2808f)


gangleri wrote:


I would like to CANCEL this request / draw it back. (There is no such MediaZilla

The request is to restrictive to me and other methods to avoid the problem / to
fix affected pages should be found.

Such tools are requested at

  • Bug 4012: feature request: add a felexible magic character conversion to the

build in editor
which would allow to identify these characters in the editor

  • Bug 4185: feature request: provide a notification for irregular links

which would avert users before submitting such links / such pages (either new or

Closing as requested

gangleri wrote:

as status is now this is more a DUPLICATE of

bug 3696 Unicode Control Characters should be restricted in title text (RLM, LRM, RLO, LRO, . . .)

*** This bug has been marked as a duplicate of bug 3696 ***

Restricted Application added a project: I18n. · View Herald TranscriptJun 2 2015, 2:21 PM