Page MenuHomePhabricator

Question: Bidi overrides and Unicode spaces removal from titles: why not zero-width space and horizontal tab?
Closed, InvalidPublic

Description

This is not a bug, just a question.

Looking at Title::secureAndSplit at
http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/includes/Title.php?revision=104051&view=markup#l2722
(related to old bug 3696), I wonder why

  1. the zero-width space (U+200B, or UTF-8 E2 80 8B) is not stripped?
  1. the horizontal tab \t is not included in the whitespace regexp to be replaced by an underscore?

Oversights, or is there some reason? Are these stripped somewhere else already?


Version: 1.20.x
Severity: trivial
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=27446

Details

Reference
bz32717

Event Timeline

bzimport raised the priority of this task from to Low.Nov 22 2014, 12:05 AM
bzimport added a project: MediaWiki-Parser.
bzimport set Reference to bz32717.
bzimport added a subscriber: Unknown Object (MLST).

Adding Brion since he probably knows the answer and if this is a bug or not.

Zero-width space is required for some scripts, to insert a break between letters that would otherwise form ligatures.

Tab ... SHOULD be stripped, lemme check. :)

\t is just an outright forbidden char in titles.

(In reply to comment #2)

Zero-width space is required for some scripts, to insert a break between
letters that would otherwise form ligatures.

Maybe I misunderstand the purpose of these Unicode characters; I'm not a Unicode specialist. I thought that was the purpose of the zero-width non-joiner (U+200C)? Granted, I think the zero-width space (U+200B) also would need to have the same effect as the ZWNJ as it indicates an (invisible) word boundary, but I'd say that's just a side effect. Also, this normally invisible word boundary may be expanded into visible whitespace by text justification according to [[en:zero-width space]]. So right, stripping it would not be right, but maybe it should be treated as an underscore.

Anyway, thanks for the answer, I see the rationale now. Whether it's 100% correct is less important to me. And perhaps people are using U+200B where they should actually use U+200C, and it's thus more user-friendly to treat it that way. I was just trying to understand what the thoughts behind this were.

Tab ... SHOULD be stripped, lemme check. :)

"Outright forbidden": do I see this right that this is rejected at line 2834
http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/includes/Title.php?revision=104051&view=markup#l2834
and depends on the configuration of $wgLegalTitleChars?

So, is an installation allowed to define that \t was a legal title character, and if so, what happens then? (Or what would make most sense then?) Replace by underscore?

I see a question and discussion on behavior here, but not sure if there is a valid bug in this report...

Aklapper subscribed.

Closing per last comment