Page MenuHomePhabricator

Dashes are parsed as part of autodetected URLs
Closed, DeclinedPublic

Description

Author: ayg

Description:
The provided URL demonstrates that the parser interprets any (or almost any)
character other than linebreak, tab, space, or nbsp as being part of a preceding
URL. Only alphanumerics and -_.~!*'();:@&=+$,/?%#[] can appear in a URI of any
kind.


Version: unspecified
Severity: minor
URL: http://en.wikipedia.org/wiki/User:Simetrical/Bug_6458

Details

Reference
bz6458

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 9:19 PM
bzimport set Reference to bz6458.
bzimport added a subscriber: Unknown Object (MLST).

Agree, but unicode URLs showed in 'nice form' (without % escaping) could be
affected. This shouldn't be allowed, but as several people asked here to not
escape urls on [[ ]], it's probably that some urls will break. Should we care?

ayg wrote:

Hrm, good point. MediaWiki does, of course, escape those automatically if input
as an actual URL (so em dash would be included but automatically translated to
%E2%80%95 for the href of the link). So this might actually be a feature, not a
bug. Note, however, that it only affects autodetected URLs, not single- or
double-bracketed links; it would be correct for those two to assume that invalid
characters are part of the link, because you know that the link only ends on a
space/] or pipe/]], respectively.

Regardless, it would be still better to just selectively exclude a few more
common punctuation marks (like em and en dash, the former of which is where I
noticed this at
http://en.wikipedia.org/w/index.php?title=Wikipedia:Image_copyright_tags&diff=prev&oldid=60640000#Other_public_domain_images)
and continue to parse, e.g., foreign-language characters as part of autodetected
URLs.

ayg wrote:

Patch

This patch treats trailing *or* internal 0xA0, 0x2000-0x200B, 0x200D-0x2015
(various Unicode spaces/dashes) as not being part of free links. Bracketed
links are unaffected.

attachment 6458b.patch ignored as obsolete

ayg wrote:

Patch

Fix a couple of trivial mistakes I just noticed in previous patch.

Attached:

This seems a weird exception. Aren't there thousands of other punctuation and control characters?

ayg wrote:

Yes, but many will either a) routinely occur in URLs (in encoded form) in any
language where they'd be likely to occur in text, or b) never legitimately occur
immediately after something intended to be a URL. If there are any other
characters to which neither of those points apply, those should probably be
added as well. U+2018 through U+2026 would probably be good candidates, and »
(U+BB).

(On the other hand, U+200D shouldn't be included, or at least I don't know what
precisely it does but have some vague idea it's common in some languages. I
must have included it by mistake.)

EN.WP.ST47 wrote:

This clearly has not been applied. The patch is rather restrictive and does not address many of the characters listed in the example, however many of those characters (hyphen, certainly, and exclamation point, and parenthesis) are found in URLs. Further, I can't think of any negative effect of allowing a rather liberal character set in urls, especially since the single bracket syntax allows the user to explicitly delimit the link and since there is no visual difference in that workaround. Taking the initiative to close this wontfix.

ayg wrote:

I explicitly said in comment 2 what the reason is for the patch. People will write stuff like "I like http://example.com/path/―do you?" with an unspaced em dash, and the autolinking will go too far. We already stop autolinking at spaces and brackets, there's no reason not to add a few more characters to the list.

Of course, you can always use brackets instead of a free link. But that goes both ways: if you have a URL that contains a dash and is parsed incorrectly with the change, you can always use square brackets to make the link work. The question is, will the change correctly guess more or fewer URLs than at present?

audreyt wrote:

Hi Aryeh, thank you for the patch!

As you may already know, MediaWiki is currently revamping its PHP-based parser into a "Parsoid" prototype component, to support the rich-text Visual Editor project:

https://www.mediawiki.org/wiki/Parsoid
https://www.mediawiki.org/wiki/Visual_editor

Folks interested in enhancing the parser's capabilities are very much welcome to join the Parsoid project, and contribute patches as Git branches:

https://www.mediawiki.org/wiki/Git/Tutorial#How_to_submit_a_patch

Compared to .diff attachments in Bugzilla tickets, Git branches are much easier for us to review, refine and merge features together.

Each change set has a distinct URL generated by the "git review" tool, which can be referenced in Bugzilla by pasting its gerrit.wikimedia.org URL as a comment.

If you run into any issues with the patch process, please feel free to ask on irc.freenode.net #wikimedia-dev and the wikitext-l mailing list. Thank you!

ayg wrote:

The patch is trivial, and doesn't really need review for correctness, but I never committed it because I was unsure if it would fix more pages than it broke. If someone else wants to try, they should feel free.

It's a good point to stop URL autodetection on dashes, but I think it'd be preferible to leave the rule simply as "urls run until whitespace" (current behavior), closing this as WONTFIX.

I agree with Platonides. How would a user make a link to the actual page with an em-dash in the title (not counting "per cent encoding" as a viable alternative since "nobody" knows about that).

For example, the Wikipedia article about en-dash:

https://en.wikipedia.org/wiki/–

Requiring whitespace to be between the url and another part of the sentence seems like a sane requirement to me. And if in the weird exception somebody prefers to have them next to each other without a space (even if that is grammatically correct, I don't know), then one can always use the full syntax:

"I like [http://example.com/path/ http://example.com/path]―do you?"

wicke wrote:

I agree with Platonides- better to have simple and relatively predictable rules, and a simple workaround for the complex cases. Adding exceptions for some characters while leaving out others would make the behavior harder to predict.