Page MenuHomePhabricator

Query service TSV/CSV returns different number formatting in different contexts
Open, MediumPublic

Description

The query service is giving different notation formats for numbers in TSVs depending on the method of access (direct download / web interface)

For example, this query - https://w.wiki/JoF - to find someone's age on a certain date. The web query service returns the answer as "19493.0".

Downloading the standard TSV gives:

president presidentLabel president_age
http://www.wikidata.org/entity/Q1124 Bill Clinton 19493.0

If we use the "TSV verbose" download option, we get basically the same thing:

?president ?presidentLabel ?president_age
<http://www.wikidata.org/entity/Q1124> "Bill Clinton"@en 19493.0

The URL for the entitity is now wrapped in <>, the label is quoted and has a @xx language tag. The numeric value is the same, however.

But if we get the TSV through curl, we get something subtly different to either of these:

curl --header "Accept: text/tab-separated-values" "https://query.wikidata.org/sparql?query=SELECT%20DISTINCT%20%3Fpresident%20%3FpresidentLabel%20%3Fpresident_age%0Awhere%20%7B%0Abind%28%28%222000-01-01T00%3A00%3A00Z%22%5E%5Exsd%3AdateTime%29%20as%20%3Fdate%29%20.%0A%3Fpresident%20p%3AP39%20%3Fpps%20.%20%3Fpps%20ps%3AP39%20wd%3AQ11696%20.%20%3Fpresident%20wdt%3AP31%20wd%3AQ5%20.%0A%3Fpps%20pq%3AP580%20%3Fpres_start%20.%20OPTIONAL%20%7B%20%3Fpps%20pq%3AP582%20%3Fpres_end%20.%20%7D%0Afilter%20%28%3Fpres_start%20%3C%3D%20%3Fdate%20%29%20.%20filter%20%28COALESCE%28%3Fpres_end%2C%20NOW%28%29%29%20%3E%3D%20%3Fdate%20%29%20.%0A%23%20use%20qualifier%20dates%3B%20%C2%A0if%20no%20end%20date%20use%20now%0A%3Fpresident%20wdt%3AP569%20%3Fpres_born%20.%20bind%28%28%3Fdate%20-%20%3Fpres_born%29%20as%20%3Fpresident_age%29%0ASERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%27en%27%20%7D%0A%7D"

?president ?presidentLabel ?president_age
<http://www.wikidata.org/entity/Q1124> "Bill Clinton"@en 1.9493E4

This is generally the same as the "TSV verbose" formatting, which is as expected, but the number is now in scientific notation and not decimals. It applies to both CSV and TSV forms in curl. (JSON and XML keep the decimal form).

This isn't wrong, as such; they do represent the same number and there is no loss of information by rounding or through added false precision. However, it is a bit confusing for the end user - I can't see an obvious reason these two should be differently formatted depending on how you access the TSV. Decimal would probably be preferred, if standardisation is possible.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

For convenience, this query hard-codes the same results.

$ curl -H 'Accept: text/csv' 'https://query.wikidata.org/sparql?query=SELECT%0A%28wd%3AQ1124%20AS%20%3Fpresident%29%0A%28%22Bill%20Clinton%22%40en%20AS%20%3FpresidentLabel%29%0A%28%2219493.0%22%5E%5Exsd%3Adouble%20AS%20%3Fpresident_age%29%0A%7B%7D'
president,presidentLabel,president_age
http://www.wikidata.org/entity/Q1124,Bill Clinton,1.9493E4

Notably, this only happens for xsd:double values. xsd:decimal is unaffected:

$ curl -H 'Accept: text/csv' -d query='SELECT ("19493.0"^^xsd:double AS ?xsdDouble) ("19493.0"^^xsd:decimal AS ?xsdDecimal) (19493.0 AS ?literal) {}' https://query.wikidata.org/sparql
xsdDouble,xsdDecimal,literal
1.9493E4,19493.0,19493.0

This might actually be an intended feature of Blazegraph. In SPARQL, numeric literals without an exponent (but with a decimal part) are tagged as xsd:decimal, whereas numeric literals with exponent are tagged as xsd:double (§4.1.2 Syntax for Literals); reusing this convention in the output allows preserving a little bit of type information in the otherwise-untyped CSV/TSV output formats.

That said, I doubt this is actually useful to many people.

Gehel triaged this task as Medium priority.Sep 15 2020, 7:50 AM