Page MenuHomePhabricator

Special Characters (like $ at the end) used in short URL not parsed e.g. in Twitter (should be avoided)
Closed, ResolvedPublic

Description

The shortener creates URLs that cannot be detected correctly in one of the main use cases, which probably Twitter (or Tweetdeck):

Just created this one:

w.wiki/32$

which produces a clickable part like this:

w.wiki/32

which is not only wrong but leads to a completely different target.

Bildschirmfoto 2019-04-15 um 18.40.00.png (110×424 px, 19 KB)

I suggest identfying the critical characters and avoiding them in the short URL. Thanks.

Event Timeline

Aklapper renamed this task from Special Chars used in short URL not parsed e.g. in Twitter (should be avoided) to Special Characters (like $ at the end) used in short URL not parsed e.g. in Twitter (should be avoided).Apr 16 2019, 5:23 AM

Our current architecture doesn't allow us to change the character set being used...it's unfortunate that Twitter doesn't include $ in the link trail.

Do we know if other sites have similar problems? Can we identify the problematic characters to see how large of a problem this is, or just bad luck that it ends with $.

Selecting the above-mentioned w.wiki/32$ and right-clicking it also doesn’t work in Firefox (it doesn’t detect it as a valid URL and doesn’t give the option “Go To URL” in its context menu).

Possible roads to fix:

  1. Create a new database table for "exotic" short URL entries (entries with $).
  2. Disable short URL creation.
  3. Populate the table.
  4. Renumber the entries in the old table.
  5. Switch to the new config to read two tables and use new character set.
  6. Reenable short URL creation.

Note between 4 and 5 there will be occasion that short URL redirects to wrong URL. This may prevent it completely:

  1. Create a new database table for "exotic" short URL entries (entries with $).
  2. Disable short URL creation.
  3. Populate the table, initially add all short URL entries to the table.
  4. Switch to the new config to read new table only.
  5. Renumber the entries in the old table.
  6. Switch to the new config to read two tables and use new character set.
  7. Reenable short URL creation.
  8. Remove non-exotic entries from the new table.

One option is to add a usc_shortcode field to the urlshortcodes table, and to index on it. Remove UrlShortenerUtils::decodeId() and select by usc_shortcode instead in all the places where it is called. Then the character set can be reduced without conflict, since the new shortcodes, if interpreted with the old character set, would come from a higher ID range. If the character set is increased in size, or if substitutions are made, conflicts can be avoided by increasing the autoincrement value of usc_id in the database. Making the index on usc_shortcode unique would mean that any errors in the configuration of the character set would only result in query errors, not the wrong URL being delivered.

Change 530645 had a related patch set uploaded (by Simon04; owner: Simon04):
[mediawiki/extensions/UrlShortener@master] Support changing a symbol in the alphabet

https://gerrit.wikimedia.org/r/530645

Change 530645 merged by jenkins-bot:
[mediawiki/extensions/UrlShortener@master] Support symbol aliases in the alphabet

https://gerrit.wikimedia.org/r/530645

Ammarpad assigned this task to simon04.