If I submit 'https://en.wikipedia.org/wiki/Main_Page' multiple times, I get the same shortened URL each time. Good. But if when I submitted 'https://en.wikipedia.org/wiki/Main_page' (lowercase 'p'), a novel short URL was generated. Since URL shortener is implemented as a MediaWiki extension, it seems like it would be trivial to canonicalize URLs prior to deduplication.
Description
Details
| Subject | Repo | Branch | Lines +/- | |
|---|---|---|---|---|
| Attempt to canonicalize URLs before storing them | mediawiki/extensions/UrlShortener | master | +59 -4 |
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Resolved | None | T108602 Equivalent URLs are not canonicalized and deduplicated | |||
| Resolved | Legoktm | T220718 URLs with no slashes after domain name are "invalid" but are still shortened | |||
| Open | None | T172976 UrlShorterner should normalize underscore and %20 for MediaWiki links |
Event Timeline
"Main page" is a redirect to "Main Page" (https://en.wikipedia.org/wiki/Main_page?redirect=no) and I don't think we should be following redirects since they can change.
Ah, right. But https://en.wikipedia.org/w/index.php?title=Main_Page and https://en.wikipedia.org/wiki/Main_Page could be de-duplicated, no?
Change 231748 had a related patch set uploaded (by Legoktm):
[WIP] Attempt to canonicalize URLs before storing them
I'm removing this as a deployment blocker because MediaWiki itself now redirects https://en.wikipedia.org/w/index.php?title=Main_Page to the canonical https://en.wikipedia.org/wiki/Main_Page form, meaning that a user would manually have to construct that URL to shorten it.
Change 231748 merged by jenkins-bot:
Attempt to canonicalize URLs before storing them
For the record, the information in T108602#2112942 no longer appears to be accurate, although the extension does now normalize such URLs as a result of the commit.
However, deduplicating in these cases might still be beneficial:
- URL not followed by trailing slash (e.g. https://en.wikipedia.org/ and https://en.wikipedia.org; see T108565)
- URL can be upgraded to HTTPS from HTTP
- URL contains %20 (see T172976)
Deduplicating some of these could be somewhat useful, although the benefit might not be as justifiable for giving these priority:
- URL is a special page which always redirects to another page on the same wiki, like Special:Diff/*
- URL is a secure.wikimedia.org redirect