Page MenuHomePhabricator

Wikidata's SPARQL endpoint doesn't escape commas in IRIs in CSV output
Closed, ResolvedPublic

Description

Wikidata's SPARQL endpoint doesn't escape commas in IRIs in CSV output, causing the produced CSV to be syntactically invalid. For example, the issue can be replicated using the following request:

curl \
  -H "Accept:text/csv" \
  --data-urlencode "query=SELECT (<https://example.com/a,b> AS ?result) WHERE {}" \
  https://query.wikidata.org/sparql

The request produces the following CSV results:

result
https://example.com/a,b

This CSV fails to parse correctly, since the second row is interpreted as two columns. Correctly escaped, the results should look like this:

result
"https://example.com/a,b"

Commas in IRIs typically appear in those linking other Wikimedia sites, such as <https://en.wikipedia.org/wiki/Versailles,_Yvelines>.

This might be an upstream issue that the Blazegraph RDF store backing the Wikidata's SPARQL endpoint has.

Event Timeline

This seems to be an issue in Sesame library, it quotes commas when they are part of the literal value, but not when they are part of the URI. It was fixed here: https://bitbucket.org/openrdf/sesame/commits/66b503ece10194b3955af6e4cba75142c0733951#chg-core/queryresultio/text/src/main/java/org/openrdf/query/resultio/text/csv/SPARQLResultsCSVWriter.java - we may need to upgrade our Sesame version.

This is fixed in 2.8.0 of Sesame, unfortunately Blazegraph does not build cleanly with it... Will have to look into how to upgrade.

Smalyshev triaged this task as Medium priority.Dec 13 2018, 12:23 AM
Smalyshev moved this task from Next to In review on the User-Smalyshev board.

The fix is already in new Sesame, so https://github.com/blazegraph/database/pull/112 should solve the issue by upgrading Sesame version to 2.8.11.

Getting this in the CI run:

8:59:10 SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
18:59:10 SLF4J: Defaulting to no-operation (NOP) logger implementation
18:59:10 SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
18:59:10 log4j:WARN No appenders could be found for logger (com.bigdata.rdf.ServiceProviderHook).
18:59:10 log4j:WARN Please initialize the log4j system properly.
18:59:10 log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
18:59:10 Failed tests: 
18:59:10   TestEncodeDecodeValue.test_encodeDecode_Literal:89->doTest:190 expected:<"abc"[]> but was:<"abc"[^^<http://www.w3.org/2001/XMLSchema#string>]>
18:59:10   TestEncodeDecodeValue.test_encodeDecode_Literal_escapeCodeSequence:170->doTest:190 expected:<"ab"c"[]> but was:<"ab"c"[^^<http://www.w3.org/2001/XMLSchema#string>]>
18:59:10   TestEncodeDecodeValue.test_encodeDecode_Literal_languageCode:128->doTest:190 expected:<"abc"@en[]> but was:<"abc"@en[^^<http://www.w3.org/1999/02/22-rdf-syntax-ns#langString>]>
18:59:10   TestEncodeDecodeValue.test_encodeDecode_Literal_singleQuotes:123->doTest:190 expected:<"'ab'c'"[]> but was:<"'ab'c'"[^^<http://www.w3.org/2001/XMLSchema#string>]>
18:59:10

Not sure if addition of <http://www.w3.org/2001/XMLSchema#string> is legit or not, need to check.

This seems to be a change in Sesame - before, LiteralImpl(String) produced null as datatype, now it produces XMLSchema.STRING. It was changed in this commit: https://bitbucket.org/openrdf/sesame/commits/675015e6b996cc8609fa735730baa49edf27d2e7

Change 502311 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[wikidata/query/blazegraph@master] Upgrade to Sesame 2.8.11

https://gerrit.wikimedia.org/r/502311

Turns out Sesame 2.8 has pretty big difference from 2.7 - in RDF 1.1/SPARQL 1.1 there are no "simple literals" anymore, i.e. literal "abc" is the same as literal "abc"^^xsd:string and the only type of plain literal is language one - "abc"@en. Updating for this may be a bit tricky.

Change 502311 merged by Smalyshev:
[wikidata/query/blazegraph@master] Upgrade to Sesame 2.8.11

https://gerrit.wikimedia.org/r/502311

@Smalyshev Thank you for this fix! Is there a way to know if/when this is deployed to query.wikidata.org? Regards