Page MenuHomePhabricator

Be more strict when validating URLs
Closed, ResolvedPublic

Description

validating url values currently allows for invalid URLs to pass, e.g.:

We should check at least that:

  • there is no white space in the URL
  • for http/https, that the ":" is followed by "//" and a non-empty host name.

Version: unspecified
Severity: normal

Details

Reference
bz52325

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 2:07 AM
bzimport set Reference to bz52325.
bzimport added a subscriber: Unknown Object (MLST).

Quick summary of a discussion with Denny:

  • definitly more checks for http/https
  • maybe have a setting for allowed protocols, separate from the protocols supported in wikitext.

Implementation idea: create a UrlValidator class that dispatches to a validator for each known protocol (and optionally to a special default validator for unknown protocols).

Probably having a separate setting sounds better. If we have both, we would take the intersection of both, I guess, which would be extremely confusing in some cases. So we should have them as two different settings.

And I suggest to start with a small list, just http(s) for now, but that's deployment question :)

"there is no white space in the URL"

You mean, you want to enforce people to paste encoded urls into the field (note that I believe browsers these days feed an unencoded url to the copy past board for readability of urls).

Change 77183 had a related patch set uploaded by Daniel Kinzler:
(bug 52325) validators for url schemes.

https://gerrit.wikimedia.org/r/77183

(In reply to comment #3)

"there is no white space in the URL"

Note that in wikitext, this is also true. It's actually the assumption that led to the syntax for external links as we use it now.

You mean, you want to enforce people to paste encoded urls into the field

I would like them to post a *valid* URL (or perhaps IRI), yes.

(note
that I believe browsers these days feed an unencoded url to the copy past
board
for readability of urls).

I can't even get firefox to show a URL with a space in it, it always gets converted to '+' right away. And Firefox will *show* https://ru.wikipedia.org/wiki/Вашингтон,_Джордж, but if you copy&paste it, you get https://ru.wikipedia.org/wiki/%D0%92%D0%B0%D1%88%D0%B8%D0%BD%D0%B3%D1%82%D0%BE%D0%BD,_%D0%94%D0%B6%D0%BE%D1%80%D0%B4%D0%B6. So your assumptions is wrong at least for firefox.

Now, i'd accept full unicode IRIs, but no spaces. Not sure yet if we should convert to true URL syntax internally, with encoded non-ascii characters. For now, we'll just save the URL as it comes in if it's valid.

Change 77183 merged by jenkins-bot:
(bug 52325) validators for url schemes.

https://gerrit.wikimedia.org/r/77183