Page MenuHomePhabricator

Equivalent URLs are not canonicalized and deduplicated
Open, NormalPublic

Description

If I submit 'https://en.wikipedia.org/wiki/Main_Page' multiple times, I get the same shortened URL each time. Good. But if when I submitted 'https://en.wikipedia.org/wiki/Main_page' (lowercase 'p'), a novel short URL was generated. Since URL shortener is implemented as a MediaWiki extension, it seems like it would be trivial to canonicalize URLs prior to deduplication.

Event Timeline

ori created this task.Aug 10 2015, 5:39 PM
ori raised the priority of this task from to Normal.
ori updated the task description. (Show Details)
ori added subscribers: Ricordisamoa, tstarling, Prtksxna and 24 others.

"Main page" is a redirect to "Main Page" (https://en.wikipedia.org/wiki/Main_page?redirect=no) and I don't think we should be following redirects since they can change.

Change 231748 had a related patch set uploaded (by Legoktm):
[WIP] Attempt to canonicalize URLs before storing them

https://gerrit.wikimedia.org/r/231748

kaldari set Security to None.Aug 18 2015, 7:57 PM
kaldari removed a subscriber: kaldari.
Krinkle removed a subscriber: Krinkle.Dec 1 2015, 5:14 PM
Jorm removed a subscriber: Jorm.Dec 26 2015, 7:30 PM

I'm removing this as a deployment blocker because MediaWiki itself now redirects https://en.wikipedia.org/w/index.php?title=Main_Page to the canonical https://en.wikipedia.org/wiki/Main_Page form, meaning that a user would manually have to construct that URL to shorten it.

Restricted Application added a subscriber: TerraCodes. · View Herald TranscriptApr 19 2016, 9:19 PM

Change 231748 merged by jenkins-bot:
Attempt to canonicalize URLs before storing them

https://gerrit.wikimedia.org/r/231748

Jc86035 added a subscriber: Jc86035.EditedApr 13 2019, 6:03 PM

For the record, the information in T108602#2112942 no longer appears to be accurate, although the extension does now normalize such URLs as a result of the commit.

However, deduplicating in these cases might still be beneficial:

  • URL not followed by trailing slash (e.g. https://en.wikipedia.org/ and https://en.wikipedia.org; see T108565)
  • URL can be upgraded to HTTPS from HTTP
  • URL contains %20 (see T172976)

Deduplicating some of these could be somewhat useful, although the benefit might not be as justifiable for giving these priority:

  • URL is a special page which always redirects to another page on the same wiki, like Special:Diff/*
  • URL is a secure.wikimedia.org redirect

We should already be deduplicating HTTPS/HTTP.