Page MenuHomePhabricator

Punctuation like ".", "?" and "!" at the end of page title in links not interpreted as part of the URL by various applications
Open, LowPublic

Description

I have page https://th.wikipedia.org/wiki/%E0%B8%8B.%E0%B8%95.%E0%B8%9E. in my watchlist and have set preference to send changes in my watchlist to e-mail. When a man changed that page, I got an email about that change typically. However, the link to the page is wrong; the dot at the ending of the link is missing. Here is a part of html code.

<a target="_blank" href="http://th.wikipedia.org/wiki/%E0%B8%8B.%E0%B8%95.%E0%B8%9E">
http://th.wikipedia.org/wiki/%
<wbr></wbr>
E0%B8%8B.%E0%B8%95.%E0%B8%9E
</a>

Details

Reference
bz48940

Event Timeline

bzimport raised the priority of this task from to Low.Nov 22 2014, 1:40 AM
bzimport added a project: MediaWiki-Email.
bzimport set Reference to bz48940.
bzimport added a subscriber: Unknown Object (MLST).

This is working for me locally (MediaWiki's bit is anyway - Thunderbird isn't parsing the '.' as part of the URL but it is there), are you sure your email client just isn't parsing the '.' as not part of the link? (The email is sent as plain text, any conversion to HTML is done by your email provider/client.)

T49160 and T40265 are similar - which email client is this about?
Might be a bug in the parsing of URLs in your email client.

I use Gmail on Firefox (Ubuntu).

Even though this is not MediaWiki bug, it seems this a number of email clients parse URL in this way. Just encode that dot, everything works properly, doesn't it?

Using str_replace to change '.' to '%2E' on $pageTitleUrl in UserMailer::composeCommonMailtext seems to make it detect and link the URL properly, but I don't like this solution at all.

Nemo_bis renamed this task from Links in emails: Full stop/dot at end of URL title not interpreted as part of the URL by email clients to Links in emails: Dot (.), "?" and "!" at end of username or title not interpreted as part of the URL by email clients.Dec 18 2014, 8:31 PM
Nemo_bis set Security to None.
Nemo_bis added subscribers: Billinghurst, jeremyb, tstarling and 4 others.

Using str_replace to change '.' to '%2E' on $pageTitleUrl in UserMailer::composeCommonMailtext seems to make it detect and link the URL properly, but I don't like this solution at all.

Percent-encoding everything wouldn't be absurd, we already do so for some parts of emails. However the result can be very ugly, especially for locales using non-latin scripts (cyrillic, devanagari etc.).

matmarex renamed this task from Links in emails: Dot (.), "?" and "!" at end of username or title not interpreted as part of the URL by email clients to Punctuation like ".", "?" and "!" at the end of page title in links not interpreted as part of the URL by various applications.Apr 5 2024, 5:15 PM

We have a similar problem on mentioning plain URL in wikitext, interpreting terminating interpunction characters as part of surrounding text rather than part of URL. The same goes for many messaging tools.

A common solution is to append a _ after a URL which is taken as following space on parsing the page name and will be cut down by wiki server, but will not be taken as text interpunction by tools which try to generate automatic links from http: within text.

Basically it cannot be decided by a tool whether a ? is indicating a question in a sentence terminated by a URL, or it may be part of the URL itself. A sentence may be terminated by period, therefore it is meaningful to separate it from URL when the . is followed by whitespace or text end.

In wikitext (,.;? followed by whitespace or text end will trigger URL termination just before.

Rather than URL encoding the _ is keeping a readable URL, especially if all other characters of the page name are from ASCII.

This is not just an email issue, it also breaks links when texting (SMS) sometimes.
I've been testing texting and emailing Wiki article links for articles that end in ")" - there are a lot of them that end in parenthesis btw.. https://en.wikipedia.org/wiki/Mark_Kelley_(bassist)
My android phone often cuts off the last ")" from the link, so when you click on the link, it doesn't go to the intended page. Seems like a pretty big issue potentially!

This task and T40265 should be merged since they seem to describe the same underlying problem, that if a character is treated as punctuation, it is not treated as part of the URL.

Technically speaking these characters should be percent encoded, but browsers no longer strictly require this, and we have moved away from encoding non ASCII characters because it makes URLs in non-Latin languages unreadable.

I think we can strike middle ground between encoding everything and nothing. For specific characters, such as characters that can be confused for punctuation, we encode those, and continue to not encode the rest. This would also fix T163314 and maybe T326365 as well.

If people agree to this I can submit a patch set to this effect.

I do not think that any Wiki server could solve the problem, at least no trivial patch is meaningful.

The problem is in messenger software, or office text document writing.

  • They regard a ),.;? terminating a URL as not part of that URL but surrounding text punctuation.
  • Even Wikisyntax will treat them this way.
  • The URL transferred to the Wiki server is simply missing the last character of the page name. This cannot be remedied in a sustainable way.

For any application it is easier to append a _ to the URL, which will solve all problems and is much easier then knowing percent encoding of the missing character.

  • On viewing a wiki page terminated by ),.;? the URL declared as official URL and ready for copy&paste by browser (and all occurring in wikitext links) could append a _ which will retrieve a safe URL from the browser. That is the only valid solution I can imagine.

In German Wikipedia, we check on redlink call to missing page whether a page terminated by additional ).? does exist, and we are suggesting to view that completed page if any.

  • However, this is only a suggestion and must not autocorrected anywhere to avoid duplicated page names with inconsistent spelling and missing characters. Actually, that would be a kind of redirect page which is not tracked. Very dangerous. Even more, there might exist two different target pages, one terminated with . and a different one terminated with ? etc.