Page MenuHomePhabricator

Wikidata's SPARQL endpoint doesn't escape commas in IRIs in CSV output
Closed, ResolvedPublic


Wikidata's SPARQL endpoint doesn't escape commas in IRIs in CSV output, causing the produced CSV to be syntactically invalid. For example, the issue can be replicated using the following request:

curl \
  -H "Accept:text/csv" \
  --data-urlencode "query=SELECT (<,b> AS ?result) WHERE {}" \

The request produces the following CSV results:


This CSV fails to parse correctly, since the second row is interpreted as two columns. Correctly escaped, the results should look like this:


Commas in IRIs typically appear in those linking other Wikimedia sites, such as <,_Yvelines>.

This might be an upstream issue that the Blazegraph RDF store backing the Wikidata's SPARQL endpoint has.

Event Timeline

This seems to be an issue in Sesame library, it quotes commas when they are part of the literal value, but not when they are part of the URI. It was fixed here: - we may need to upgrade our Sesame version.

This is fixed in 2.8.0 of Sesame, unfortunately Blazegraph does not build cleanly with it... Will have to look into how to upgrade.

Smalyshev triaged this task as Medium priority.Dec 13 2018, 12:23 AM
Smalyshev moved this task from Next to In review on the User-Smalyshev board.

The fix is already in new Sesame, so should solve the issue by upgrading Sesame version to 2.8.11.

Getting this in the CI run:

8:59:10 SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
18:59:10 SLF4J: Defaulting to no-operation (NOP) logger implementation
18:59:10 SLF4J: See for further details.
18:59:10 log4j:WARN No appenders could be found for logger (com.bigdata.rdf.ServiceProviderHook).
18:59:10 log4j:WARN Please initialize the log4j system properly.
18:59:10 log4j:WARN See for more info.
18:59:10 Failed tests: 
18:59:10   TestEncodeDecodeValue.test_encodeDecode_Literal:89->doTest:190 expected:<"abc"[]> but was:<"abc"[^^<>]>
18:59:10   TestEncodeDecodeValue.test_encodeDecode_Literal_escapeCodeSequence:170->doTest:190 expected:<"ab"c"[]> but was:<"ab"c"[^^<>]>
18:59:10   TestEncodeDecodeValue.test_encodeDecode_Literal_languageCode:128->doTest:190 expected:<"abc"@en[]> but was:<"abc"@en[^^<>]>
18:59:10   TestEncodeDecodeValue.test_encodeDecode_Literal_singleQuotes:123->doTest:190 expected:<"'ab'c'"[]> but was:<"'ab'c'"[^^<>]>

Not sure if addition of <> is legit or not, need to check.

This seems to be a change in Sesame - before, LiteralImpl(String) produced null as datatype, now it produces XMLSchema.STRING. It was changed in this commit:

Change 502311 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[wikidata/query/blazegraph@master] Upgrade to Sesame 2.8.11

Turns out Sesame 2.8 has pretty big difference from 2.7 - in RDF 1.1/SPARQL 1.1 there are no "simple literals" anymore, i.e. literal "abc" is the same as literal "abc"^^xsd:string and the only type of plain literal is language one - "abc"@en. Updating for this may be a bit tricky.

Change 502311 merged by Smalyshev:
[wikidata/query/blazegraph@master] Upgrade to Sesame 2.8.11

@Smalyshev Thank you for this fix! Is there a way to know if/when this is deployed to Regards