Page MenuHomePhabricator

urlRegex frontend validation is fragile
Closed, ResolvedPublic

Description

The simplified regex that is used in the ui to validate URL inputs is fragile. I amended it in https://gerrit.wikimedia.org/r/c/wikimedia/toolhub/+/670413/12..13/vue/src/plugins/url-regex.js to add : as a valid character in the path segment of the URL, but I can see that the regex is going to consider many other valid urls invalid still.

Replacing this with a validation function like the one presented at https://stackoverflow.com/a/43467144/8171 may be more robust. This would require some additional refactoring as well.

Event Timeline

There seems to be some concerns regarding the stackoverflow answer linked above. that solution seems to accept certain false positives. E.g. we are only really interested in https or http but that solution will allow other protocols, and will accept a host of other weird things provided they are adhere to the rfc 3886 standard

There seems to be some concerns regarding the stackoverflow answer linked above. that solution seems to accept certain false positives. E.g. we are only really interested in https or http but that solution will allow other protocols, and will accept a host of other weird things provided they are adhere to the rfc 3886 standard

I'm not sure that the version I'm looking at has the problems you are concerned with:

function isValidHttpUrl(string) {
  let url;
  
  try {
    url = new URL(string);
  } catch (_) {
    return false;  
  }

  return url.protocol === "http:" || url.protocol === "https:";
}

Note the check of the protocol as part of the full validation.

bd808 changed the task status from Open to In Progress.Jan 6 2022, 8:56 PM
bd808 claimed this task.
bd808 moved this task from Backlog to In Progress on the Toolhub board.

T298725: Toolhub URL field doesn't accept Cyrillic characters makes this more immediately an issue and much less abstract.

I found a couple of tiny issues with the proposed implementation from T277979#7586574:

  • It allows whitespace in the URL's path and query string which the Django regex rejects.
    • This seems simple enough to add an additional check for in the js validation.
  • It allows a variety of values in the URL's host that are rejected by the Django regex.
    • I think our prior regex did as well actually.

Change 752042 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[wikimedia/toolhub@main] ui: Replace URL validation regex with native parser

https://gerrit.wikimedia.org/r/752042

Change 752042 merged by jenkins-bot:

[wikimedia/toolhub@main] ui: Replace URL validation regex with native parser

https://gerrit.wikimedia.org/r/752042

Change 770638 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[operations/deployment-charts@master] toolhub: Bump container version to 2022-03-15-002555-production

https://gerrit.wikimedia.org/r/770638

Change 770638 merged by jenkins-bot:

[operations/deployment-charts@master] toolhub: Bump container version to 2022-03-15-002555-production

https://gerrit.wikimedia.org/r/770638