Page MenuHomePhabricator

Commons filenames with a + get URL-encoded again as %2B in wikidata sparql query service
Open, MediumPublicBUG REPORT

Description

List of steps to reproduce (step by step, including full links if applicable):

What happens?:
For context: GeoJson files are stored in Wiki Commons with Data:...map in their filename. Wikidata items with geoshapes will refer to the filename in Wiki Commons.

When I use the query service to retrieve GeoJson geoshapes (P3896), the query service has an issue specifically with filenames with whitespace (it replaces them with "+"), and when I click on its link, I get an error message "Our servers are currently under maintenance or experiencing a technical problem. Please try again in a few minutes".

I also posted this issue here https://www.mediawiki.org/wiki/Talk:Wikidata_Query_Service#Problem_with_Commons_file_names_with_white_space

What should have happened instead?:
The query service should reproduce the correct filename in wiki Commons and then be able to redirect me to that GeoJson file. The query service should also be able to pull that GeoJson file and represent it on its map graphic. It does not have this problem for simple filenames (e.g. "Data:Antartica.map") but only for compound names (e.g. "Data:Puerto Rico.map")

Related Objects

Mentioned Here
P3896 LQT reply

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

The problem is that the + gets encoded again. It becomes %2B

Aklapper renamed this task from Problem with white space for wiki commons filenames in wikidata sparql query service to Commons filenames with a + get URL-encoded again as %2B in wikidata sparql query service.Aug 29 2021, 3:29 PM

Just to be clear, the filename themselves do not contain a +. The query service replaces the whitespaces in filenames in commons with + which is also rendered again with %2B.

For example, the following has whitespace: https://commons.wikimedia.org/wiki/Data:India/Tamil Nadu/Theni.map

will be rendered in the browser as and can be found in Commons:

https://commons.wikimedia.org/wiki/Data:India/Tamil_Nadu/Theni.map

but in the query service, it will be rendered as:

https://commons.wikimedia.org/wiki/Data:India/Tamil+Nadu/Theni.map

which does not exist

If you click on the link directly it will then be reencoded as https://commons.wikimedia.org/wiki/Data:India/Tamil%2BNadu/Theni.map

which leads to an error message

It's not the query service gui that adds the +
When you use the bigdata interface the + gets added as well.

When using the api you see spaces for P3896.

Gehel triaged this task as Medium priority.Aug 30 2021, 3:22 PM
Gehel moved this task from Incoming to Small Tasks on the Wikidata-Query-Service board.

I think there’s two unrelated issues here, and neither of them is directly about the query service. The first is that the Wikibase RDF exports encodes spaces as pluses in geoshape values (and probably also tabular data, if I had to guess):

$ curl -s 'https://www.wikidata.org/wiki/Special:EntityData/Q550.json' | jq . | grep -F .map
                "value": "Data:Avenue des Champs-Élysées.map",
$ curl -s 'https://www.wikidata.org/wiki/Special:EntityData/Q550.ttl' | grep -F .map
        wdt:P3896 <http://commons.wikimedia.org/data/main/Data:Avenue+des+Champs-%C3%89lys%C3%A9es.map> ;
        ps:P3896 <http://commons.wikimedia.org/data/main/Data:Avenue+des+Champs-%C3%89lys%C3%A9es.map> .

The query service just faithfully represents those pluses (as far as I can tell), that’s not Blazegraph’s fault. We should probably fix that in Wikibase.

And the second issue is that action=raw returns something that looks like an internal server error for any missing title. That doesn’t have to be those pluses: https://commons.wikimedia.org/w/index.php?title=missing&action=raw

Screenshot 2021-09-13 at 11-41-11 Wikimedia Error.png (521×640 px, 40 KB)

The “error” is apparently “404, Not Found”, so I’m guessing the layer above MediaWiki (Varnish?) just isn’t expecting a 404 status code here.