Page MenuHomePhabricator

Block using U+200C for pagenames in Odia
Open, LowPublicFeature

Description

Author: ansumang

Description:
Screenshot of exact titled page in Odia - One

Hello,

On Odia Wikipedia, we have encountered pages with exact same name (same title).

This makes difficulties to distinguish pages at odia Wikipedia. Any solution?

Link to Wikipedia pages, below:

http://or.wikipedia.org/s/5om and http://or.wikipedia.org/s/gj1

Also two article http://or.wikipedia.org/wiki/ବାଲେଶ୍ୱର_ଜିଲ୍ଲା and http://or.wikipedia.org/w/index.php?title=ବାଲେଶ୍ଵର_ଜିଲ୍ଲା&redirect=no with exact script, one redirected to another one!

Is this anything to do with fonts? Do we need to redefine unicode font? :D

Thanks.


Version: 1.22.0
Severity: enhancement
URL: http://or.wikipedia.org

Attached:

similar_title_1.PNG (494×829 px, 56 KB)

Details

Reference
bz50936

Event Timeline

bzimport raised the priority of this task from to Low.Nov 22 2014, 1:57 AM
bzimport set Reference to bz50936.
bzimport added a subscriber: Unknown Object (MLST).

ansumang wrote:

Screenshot of exact titled page in Odia - Two

Attached:

similar_title_2.PNG (359×993 px, 49 KB)

http://or.wikipedia.org/s/5om is for ସମୟ and has unicode code points:
U+0B38 U+0B2E U+0B5F

http://or.wikipedia.org/s/gj1 is ସମ‌ୟ
U+0B38 U+0B2E U+200C U+0B5F

As you can see both titles look same but differs in data with an extra U+200C

U+200C is ZERO WIDTH NON-JOINER an invisible character having different functionality in different scripts.

I am not sure whether 200C has valid usage in or. If this is unwanted, you need to consider it as a spelling mistake.

This request would probably turn into somehow blocking U+200C from being used in page names.

ansumang wrote:

(In reply to comment #2)

http://or.wikipedia.org/s/5om is for ସମୟ and has unicode code points:
U+0B38 U+0B2E U+0B5F

http://or.wikipedia.org/s/gj1 is ସମ‌ୟ
U+0B38 U+0B2E U+200C U+0B5F

As you can see both titles look same but differs in data with an extra U+200C

U+200C is ZERO WIDTH NON-JOINER an invisible character having different
functionality in different scripts.

Thanks.

I am not sure whether 200C has valid usage in or. If this is unwanted, you
need
to consider it as a spelling mistake.

Then how can we know whether or not 200C has valid usage? I couldn't find any 200C in Odia Unicode chart. We can ignore if this is rare, so far 3/4 cases. We could wait and see if we find more such cases.

(In reply to comment #4)

(In reply to comment #2)

http://or.wikipedia.org/s/5om is for ସମୟ and has unicode code points:
U+0B38 U+0B2E U+0B5F

http://or.wikipedia.org/s/gj1 is ସମ‌ୟ
U+0B38 U+0B2E U+200C U+0B5F

As you can see both titles look same but differs in data with an extra U+200C

U+200C is ZERO WIDTH NON-JOINER an invisible character having different
functionality in different scripts.

Thanks.

I am not sure whether 200C has valid usage in or. If this is unwanted, you
need
to consider it as a spelling mistake.

Then how can we know whether or not 200C has valid usage? I couldn't find any
200C in Odia Unicode chart. We can ignore if this is rare, so far 3/4 cases.
We
could wait and see if we find more such cases.

I guess U+200C would be required. When I type s+m+Y it resulted ସମ୍ୟ whereas s+m+_ (Shift dash "-")+ Y it resulted ସମୟ using typing tool Lekhani. In the latter case Shift - ("_") produces U+200C. Is there any other way to avoid this problem instead of blocking this as I feel for some spellings it would be needed.

Aklapper changed the subtype of this task from "Task" to "Feature Request".Feb 4 2022, 11:13 AM
Aklapper removed a subscriber: wikibugs-l-list.