Page MenuHomePhabricator

Equivalent URLs are not canonicalized and deduplicated
Closed, ResolvedPublic

Description

If I submit 'https://en.wikipedia.org/wiki/Main_Page' multiple times, I get the same shortened URL each time. Good. But if when I submitted 'https://en.wikipedia.org/wiki/Main_page' (lowercase 'p'), a novel short URL was generated. Since URL shortener is implemented as a MediaWiki extension, it seems like it would be trivial to canonicalize URLs prior to deduplication.

Event Timeline

ori raised the priority of this task from to Medium.
ori updated the task description. (Show Details)
ori added subscribers: Ricordisamoa, tstarling, Prtksxna and 24 others.

"Main page" is a redirect to "Main Page" (https://en.wikipedia.org/wiki/Main_page?redirect=no) and I don't think we should be following redirects since they can change.

Change 231748 had a related patch set uploaded (by Legoktm):
[WIP] Attempt to canonicalize URLs before storing them

https://gerrit.wikimedia.org/r/231748

kaldari removed a subscriber: kaldari.

I'm removing this as a deployment blocker because MediaWiki itself now redirects https://en.wikipedia.org/w/index.php?title=Main_Page to the canonical https://en.wikipedia.org/wiki/Main_Page form, meaning that a user would manually have to construct that URL to shorten it.

Change 231748 merged by jenkins-bot:
Attempt to canonicalize URLs before storing them

https://gerrit.wikimedia.org/r/231748

For the record, the information in T108602#2112942 no longer appears to be accurate, although the extension does now normalize such URLs as a result of the commit.

However, deduplicating in these cases might still be beneficial:

  • URL not followed by trailing slash (e.g. https://en.wikipedia.org/ and https://en.wikipedia.org; see T108565)
  • URL can be upgraded to HTTPS from HTTP
  • URL contains %20 (see T172976)

Deduplicating some of these could be somewhat useful, although the benefit might not be as justifiable for giving these priority:

  • URL is a special page which always redirects to another page on the same wiki, like Special:Diff/*
  • URL is a secure.wikimedia.org redirect

We should already be deduplicating HTTPS/HTTP.