Page MenuHomePhabricator

JSON results serializer in Wikidata Query Service generates an extra "datatype" field
Closed, ResolvedPublic

Description

The JSON results serializer in the Wikidata Query Service generates an extra "datatype" field.

Steps to Reproduce:

curl -H 'Accept: application/sparql-results+json' -d query='SELECT ("" AS ?string) (""@en AS ?langString) {}' https://query.wikidata.org/sparql

Actual Results:

{
  "head" : {
    "vars" : [ "string", "langString" ]
  },
  "results" : {
    "bindings" : [ {
      "string" : {
        "datatype" : "http://www.w3.org/2001/XMLSchema#string",
        "type" : "literal",
        "value" : ""
      },
      "langString" : {
        "xml:lang" : "en",
        "datatype" : "http://www.w3.org/1999/02/22-rdf-syntax-ns#langString",
        "type" : "literal",
        "value" : ""
      }
    } ]
  }
}

Expected Results:

No datatype field as per https://www.w3.org/TR/2013/REC-sparql11-results-json-20130321/#select-encode-terms

{
  "head" : {
    "vars" : [ "string", "langString" ]
  },
  "results" : {
    "bindings" : [ {
      "string" : {
        "type" : "literal",
        "value" : ""
      },
      "langString" : {
        "xml:lang" : "en",
        "type" : "literal",
        "value" : ""
      }
    } ]
  }
}

Event Timeline

Fnielsen created this task.Jun 18 2019, 9:29 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 18 2019, 9:29 AM
Fnielsen updated the task description. (Show Details)Jun 18 2019, 9:32 AM

BTW: This affect the ShEx validator.

BTW: This affect the ShEx validator.

This was worked around now.

Lucas_Werkmeister_WMDE renamed this task from JSON results serializer in Wikidata generates an extra "datatype" field to JSON results serializer in Wikidata Query Service generates an extra "datatype" field.Jun 18 2019, 3:16 PM
Smalyshev added a subscriber: Smalyshev.EditedJun 18 2019, 5:38 PM

In the new Sesame and RDF 1.1, everything has data type, even string literals. Is it causing any problems? Additional field in JSON should not be problematic.

Is it causing any problems?

Well, not right now, but it used to.

BTW: This affect the ShEx validator.

This was worked around now.

It's a bit messy since different RDF/SPARQL/etc. standards disagree on how literals look like - newer RDF 1.1 says everything has datatype, but allows to skip datatype in some syntax, and some older standards still omit datatype...

It sounds like this also broke our own code (judging from the fix for T226017), so perhaps we should announce it to wikitech-l? I assume this is due to the recent Sesame upgrade in Ic0092b2787?

Smalyshev triaged this task as Normal priority.Jun 20 2019, 10:13 PM

I wrote a note on wikidata list. The standards seem to be conflicting here, so I'll try to research what's more correct - or at least most accepted - practice here. For now, it stays as it is, but if it turns out common practice is to omit the types, we'd have to patch.

Smalyshev changed the subtype of this task from "Bug Report" to "Task".Jun 21 2019, 4:54 AM
ericP added a subscriber: ericP.EditedJun 21 2019, 6:07 AM

The RDF model asserts that a literal with a langtag is considered to have a datatype of rdf:langString. The various formats state exactly how literals with langtags are written. The former specifies how APIs behave, such as a SPARQL FILTER DATATYPE("ab"@es) = rdf:langTag; the latter specifies that it's written {"type":"literal", "value":"ab", "xml:lang":"es"}. The rules for the XML and JSON results formats are that the implicit datatype of langtagged literals is omitted.

I can see how this is confusing but I believe the community doesn't see them as conflicting. Also, sadly, it seems that these are tested for the XML results format (look for xml:lang in the results for strlang03), but not for JSON.

The Wikidata app in Cytoscape is also affected by this.

The rules for the XML and JSON results formats are that the implicit datatype of langtagged literals is omitted.

The problem is I haven't see this codified in any standard. JSON result standard predates RDF 1.1 so I can't be sure whether it's just out of date with RDF 1.1 or actually is meant to override RDF 1.1 changes. And the test suite you're quoting is from 2009, while RDF 1.1 is from 2014 so again, how can I be sure?

This: https://issues.apache.org/jira/browse/JENA-1077 suggests JENA also thinks types should be omitted on output for plain/language literals.

OK, I did a survey of existing SPARQL endpoints and looks like they all omit datatypes on plain literals on JSON. So I'll fix WDQS to do the same.

Change 518310 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[wikidata/query/blazegraph@master] Fix writing JSON result literals

https://gerrit.wikimedia.org/r/518310

Addshore moved this task from incoming to in progress on the Wikidata board.Jun 21 2019, 11:25 PM

Change 518310 merged by jenkins-bot:
[wikidata/query/blazegraph@master] Fix writing JSON result literals

https://gerrit.wikimedia.org/r/518310

Change 518401 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[wikidata/query/rdf@master] BUild with wmf.6

https://gerrit.wikimedia.org/r/518401

Change 518401 merged by jenkins-bot:
[wikidata/query/rdf@master] Build with wmf.6

https://gerrit.wikimedia.org/r/518401

Smalyshev closed this task as Resolved.Jun 24 2019, 6:51 PM