Page MenuHomePhabricator

Coordinates are exported into RDF with excessive precision
Open, LowPublic

Description

When coordinates are exported into RDF, they are represented with many more digits than the precision allows. I.e., coordinate for https://www.wikidata.org/wiki/Q116746, with precision specified as "arcseconds", or 31m, are exported as Point(13.366666666667 41.766666666667) - 12 digits, or sub-millimeter precision. It should be exported as Point(13.3667 41.7667) instead.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
thiemowmde subscribed.

I can see that this situation tends to be confusing and could need some improvement, especially UX-wise. But this is not an issue specific to the RDF export or the Wikidata-Query-Service. These numbers are just how the coordinates are stored internally. And I don't think we can or even should change anything about this. Most of the coordinates are submitted via the API. If the submitted coordinate just was 13.366666666667, why should we truncate that?

Mostly because most of these digits are not representing any real data, it's just junk produced by decimal representation with overly big precision and produced by various conversions and calculations. We're just dragging around those meaningless characters that do not have any use and do not represent any data (nobody really measured that coordinate with micron precision and got 13.366666666667, what happened most probably that it was measured in another system, then calculation involved 40.1/3 (probably when converting degrees and minutes to decimal) and the result came out as 13.366666666667. And then we convert back, we'd get 40.100000000001 - again, junk data in 11 last decimal places.

I can follow all your arguments. It's just that I think the effect of this (actually well defined) behavior on users is really, really negligible. Most users are never going to see coordinates as numbers anyway, but as dots or shapes on maps.

And even if, which user will think of sub-millimeters when they see a representation like 13.366666666667? Especially when the object is a city, or any larger shape. Most users don't even know what 1 degree is in meters or miles.

That said, I agree this could be improved, and even have an actual suggestion I want to implement some day, either in the RDF export or somewhere deeper in the Wikibase code base: Basically, cut off decimal places that do not have any effect on any of the output formats we support. This algorithm should consider all output formats, because when such an algorithm is applied we don't know which output format will be used.

But this idea requires coordinates to be stored as strings, which they are not. Basically, this requires a new datatype.

Change 521984 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[mediawiki/extensions/Wikibase@master] Format coordinates with limited precision

https://gerrit.wikimedia.org/r/521984

Change 521984 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Format coordinates with limited precision

https://gerrit.wikimedia.org/r/521984

I have filed T232984.

Can we please revert this change, or at least replace it with a solution that actually takes into account the precision instead of blindly rounding everything down to 4 decimal places?

It is relatively easy for data consumers to round down the coordinate values if they need to (including within SPARQL code), but practically impossible to recover the missing data (better precision) without having to refer outside the WDQS once they have been arbitrarily rounded down.

thiemowmde moved this task from monitoring to needs discussion or investigation on the Wikidata board.
thiemowmde added a project: Regression.

As far as I understand the patch https://gerrit.wikimedia.org/r/521984, it truncates all coordinates to at most 4 decimal places, even without looking at the precision of the coordinate value. I'm very much concerned about this, as well as confused why this was done.

The frontend offers a few precisions to the user. The smallest is "1/10000 of an arcsecond", or 0.000000278 degrees. This requires at least 7 decimal places, better 8 to be sure.

The backend does not limit the precision of a coordinate value to anything. It can be as arbitrary as a user of the API wants. The frontend respects arbitrary precisions, displays them, and makes sure they don't get lost when such a value is edited.

@WMDE-leszek, may I ask why the patch was merged? Was this in a sprint or discussed anywhere else within the Wikidata team? Also pinging @Lydia_Pintscher because I believe this is now causing actual data loss (within the query service) since 8 weeks.

Note that changing the exported precision in RDF does not invalidate the hash of the data values, so even if we revert the code change and all affected items are edited again, reduced-precision coordinates will remain in the query service until the value nodes are updated or a full reload is done.

Personally, I think we should revert this change. I also don’t know why it was suddenly merged, but if we decide to do it, the implementation should take the specified precision of the coordinate value into account instead of hard-coding a certain number of digits, the change should be announced in advance, and we should take care that it applies to all coordinate values, not just those of whichever items are edited after the change is deployed. (In practice, that last bullet point would probably require a full reload – I doubt that updating //all// coordinate value nodes is feasible.)

Thanks for the additional analysis. Another thing I realized later is that this might cause actual data loss in the Wikidata database. This can happen when a tool uses data it got from the query service to edit the original Wikidata entity. I believe this scenario should be rare, but wanted to mention it.

https://gerrit.wikimedia.org/r/521984 has been reverted, see https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/540402
Reverting does not mean dismissing points brought up by @Smalyshev here and in T232984. Revert is only meant as a mitigation to issues reported by @seav in T232984. Issues raised by @Smalyshev hold, they would need a bit more appropriate solution than what we have tried in https://gerrit.wikimedia.org/r/521984.

Mentioned in SAL (#wikimedia-operations) [2019-10-07T19:09:33Z] <lucaswerkmeister-wmde@deploy1001> Synchronized php-1.34.0-wmf.25/extensions/Wikibase: SWAT: [[gerrit:540419|Revert "Format coordinates with limited precision" (T174504)]] (duration: 00m 57s)

Revert merged and backported, hopefully in time for this week’s RDF dumps (if I read the cron config correctly, they start at 23:00, presumably UTC).

Hello. I stumbled over this ticket out of curiosity. Here my contribution:

  1. GeoSPARQL [1], which defines the geo:wktLiteral datatype, does not mention the concept of precision. Therefore, the number of digits does not imply any precision (at least not when following the standard).
  2. This is similar to the datatype xsd:decimal [2], which does not carry any information about precision. The standard states that explicitly and gives an example where 2.0 equals 2.00 (in that space of xsd:decimal).
  3. Wikibase already handles precision [3] with a "precision" field for coordinates and time, and "upperBound" and "lowerBound" for quantities/numbers.

Therefore I'd conclude that the precision-argument alone is not sufficient to justify truncating digits. (There might be other, practical considerations, though.)

[1] https://www.ogc.org/standards/geosparql/
[2] https://www.w3.org/TR/xmlschema-2/#decimal
[3] https://www.mediawiki.org/wiki/Wikibase/DataModel/JSON