If I submit 'https://en.wikipedia.org/wiki/Main_Page' multiple times, I get the same shortened URL each time. Good. But if when I submitted 'https://en.wikipedia.org/wiki/Main_page' (lowercase 'p'), a novel short URL was generated. Since URL shortener is implemented as a MediaWiki extension, it seems like it would be trivial to canonicalize URLs prior to deduplication.
|mediawiki/extensions/UrlShortener : master||Attempt to canonicalize URLs before storing them|
|Open||None||T108602 Equivalent URLs are not canonicalized and deduplicated|
|Resolved||Legoktm||T220718 URLs with no slashes after domain name are "invalid" but are still shortened|
|Open||None||T172976 UrlShorterner should normalize underscore and %20 for MediaWiki links|
I'm removing this as a deployment blocker because MediaWiki itself now redirects https://en.wikipedia.org/w/index.php?title=Main_Page to the canonical https://en.wikipedia.org/wiki/Main_Page form, meaning that a user would manually have to construct that URL to shorten it.
For the record, the information in T108602#2112942 no longer appears to be accurate, although the extension does now normalize such URLs as a result of the commit.
However, deduplicating in these cases might still be beneficial:
- URL not followed by trailing slash (e.g. https://en.wikipedia.org/ and https://en.wikipedia.org; see T108565)
- URL can be upgraded to HTTPS from HTTP
- URL contains %20 (see T172976)
Deduplicating some of these could be somewhat useful, although the benefit might not be as justifiable for giving these priority:
- URL is a special page which always redirects to another page on the same wiki, like Special:Diff/*
- URL is a secure.wikimedia.org redirect