Page MenuHomePhabricator

Spaces in URL are replaced by "+" (e.g. in Archive Of Our Own tag (P8419))
Closed, ResolvedPublic5 Estimated Story Points

Description

As a Wikidata editor I want to be able to enter a value that contains a space character for an external ID Property like Archive Of Our Own tag (P8419) and have the URL to result to the entry of the value on the external website.

Problem:
One example: When I follow the link provided by Archive Of Our Own tag (P8419), spaces in the value are replaced by "+" in the resulting URL and the link does not work. <- For this particular example this is now no longer the case because on WD the formatter URL has been replaced by a tool that handles this better.

Example:
https://test.wikidata.org/wiki/Q140422 - the external ID "Harry Potter" contains a space. The space is converted to a + in the URL instead of leaving it a space or encoding it to %20.

Acceptance criteria:
Editor does not manually replace space character with %20 in entered value.

Previous discussion:
See report on Property talk:P8419.

Open question:

  • Do we always want to convert spaces to %20 in external IDs?
    • Yes, unless "other website" needs us to code it differently

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 4 2021, 5:30 PM

I created a reproducible example in test.wikidata.org here: https://test.wikidata.org/wiki/Q140422 (doing this there because the formatted URL on WD is using the external id URL tool now.

Lydia_Pintscher renamed this task from Spaces in URL are replaced by "+" in Archive Of Our Own tag (P8419) to Spaces in URL are replaced by "+" (e.g. in Archive Of Our Own tag (P8419)).Jan 19 2021, 11:25 AM
Lydia_Pintscher updated the task description. (Show Details)
darthmon_wmde updated the task description. (Show Details)
darthmon_wmde set the point value for this task to 5.
Maintenance_bot moved this task from incoming to in progress on the Wikidata board.Wed, Feb 3, 2:15 PM
  • Do we always want to convert spaces to %20 in external IDs?
    • Yes, unless "other website" needs us to code it differently

I wrote a Python script to query a random sample of each external identifier property and count how many of the identifiers had spaces (P14273). It turns out the number of properties with spaces is somewhat larger than I expected (full output at P14274), so I only manually checked those properties with a 100% ratio of identifiers containing spaces:

P9094: no formatter URL
P8832: + ✔, %20 ✔, _ ✘
P8514: weird website behavior
P7549: + ✔, %20 ✔, _ ✘
P6700: no formatter URL
P6164: no formatter URL
P5738: + ✘, %20 ✔, _ ✘
P5667: + ✔, %20 ✔, _ ✘
P5609: + ✔, %20 ✔, _ ✘
P5049: no formatter URL
P4814: + ✔, %20 ✔, _ ✔
P4483: + ✔, %20 ✔, _ ✘
P4245: no formatter URL
P3248: formatter URL deprecated
P2878: no formatter URL
P2590: no formatter URL
P2589: no formatter URL
P1261: no formatter URL
P1161: no formatter URL
P213: + ✔, %20 ✔, _ ✔

In conclusion: %20 works everywhere, + works almost everywhere but not universally, _ is uncommon. So I think it’s probably safe to go ahead with just %20 and not make this configurable, unless anyone finds a concrete example of a website that requires +.

Percent-encoding is a basic part of URLs as specified in rfc 1738, section 2.2. URL Character Encoding Issues, so it should work everywhere.

Jakob_WMDE added a comment.EditedWed, Feb 17, 10:56 AM

Hello @Lydia_Pintscher!
It looks like we want to use a standard url encoding but exclude slashes. When the slashes task was tackled the url encoding was switched to use a mediawiki function wfUrlencode to prettify internal (!) title urls which unencoded slashes and some other characters (;:@$!*(),~) as well. I'm now curious whether we care about these other characters, or if it's only the slashes that we want unencoded.

I see the following two options:

  1. Keep everything as it is and simply replace "+" with "%20".
  2. Use the built-in PHP url encoding function that converts spaces as "%20" and replace slashes.

The latter would be preferable if we don't care about the other characters, I think. Both are equally easy to implement.

After a quick chat with Jakob we said we'd go with option 2.

Change 664820 had a related patch set uploaded (by Jakob; owner: Jakob):
[mediawiki/extensions/Wikibase@master] Use rawurlencode to encode external IDs in URLs

https://gerrit.wikimedia.org/r/664820

Change 664820 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Use rawurlencode to encode external IDs in URLs

https://gerrit.wikimedia.org/r/664820

amy_rc added a subscriber: amy_rc.Wed, Feb 24, 11:58 AM

I think this has been deployed too ? Any website to test this?

amy_rc closed this task as Resolved.Thu, Feb 25, 9:30 AM

🎇 \o/