Page MenuHomePhabricator

Wikidata accepts invalid geo coordinates, causing indexing failures
Open, HighPublic

Description

Saw some failures for updates to elasticsearch coming from testwikidatawiki. it turns out this is happening because the coordinates are invalid:

120°0'N, 111°0'E
https://test.wikidata.org/wiki/Q206

Should probably add some validation in GeoData (or do these get provided by wikidata directly?)

Event Timeline

Restricted Application added projects: Discovery, Discovery-Search. · View Herald TranscriptMar 10 2017, 12:14 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

GeoData does validate coordinates provided to it the standard way - via the parser function. WD adds its data later, when GD assumes it doesn't need any validation. I guess I can add optional validation to CoordinatesOutput but WD would need to handle exceptions I'll be throwing. Any objections?

thiemowmde triaged this task as Lowest priority.

Note that these coordinates are not "invalid" per-se. The LatLongValue class in Wikibase does have a range check that limits latitude and longitude to ]-360, +360[. This is done on purpose and not a mistake. It depends on the globe if a coordinate like "120° N, 111° E" can be considered valid or not.

Here are some ideas of what can be done:

  • Make sure external validation tools for wikidata.org know about this and suggest to fix coordinates on Earth that have weird values like this.
  • We could add an actual validation step to the Wikibase code base that limits the values, in case the globe is known to be Earth. We already have code that hard-codes http://www.wikidata.org/entity/Q2 in multiple places, so this would not even introduce a new dependency.
  • Make the code that consumes such values aware of it, and either skip or transform them. Transformation is possible, because these coordinates still describe a unique point on a globe, even if the format is "invalid". There is even existing code for said normalization in https://github.com/DataValues/Geo/blob/master/src/GlobeMath.php.

This problem means that these pages with invalid coordinates are not indexed and will never show up in search. I am confused why this is a lowest priority issue for Wikidata. Can someone explain? :-)

There is way to much essential information missing here: Steps to reproduce, what "search" the ticket is referring to, what information is provided to said search (e.g. via an example URL), what "failures" appear, and where they appear. Currently the ticket jumps straight to the conclusion that Wikibase must miss validation, which is incorrect, as I just explained. I did not closed this ticket as invalid because I understand there is indeed something wrong, I just can't figure out where based on the given information. All I know is that the error is not in Wikibase, as explained above.

So please describe the problem and not a proposed solution.

There is way to much essential information missing here: Steps to reproduce, what "search" the ticket is referring to, what information is provided to said search (e.g. via an example URL), what "failures" appear, and where they appear.

I don't know the details either. So, let's figure them out together, shall we? :-)

If you do a fulltext search for Q206 on test.wikidata.org, you don't get any results; see API query and search on-wiki. It still displays the "There is a page named..." message, but that's because there's an exact title match; that doesn't have anything to do with the search index, e.g. searching for just the label doesn't display that message and also displays no results as above.

Note that this problem is restricted to full-text search and doesn't seem to affect the search box on pages; see screenshot below.

Now, I know next to nothing about the technical details, so @EBernhardson and @MaxSem will have to help you there. But, hopefully that should describe the problem at least. :-)

thiemowmde raised the priority of this task from Lowest to High.Mar 24 2017, 12:37 PM

@Deskana, this is very helpful, thank you very much. This gave me an idea. Items with extreme coordinates like the one shown above are obviously not indexed at all. I guess there is some code somewhere that fails with an error, which makes CirrusSearch skip the Item entirely. I'm pretty sure said code is not in Wikibase, which most probably makes this a Discovery-Search ticket.

As far as I know, Cirrus does nothing with statement values at all at the moment. Also, fulltext search is essentially broken for Wikidata, we are only now starting to add code to make it useful.
If extreme globe coordinates cause items to go missing from the cirrus index, the problem pretty much has to be in Wikibase, perhaps in our integration with the GeoData extension. Maybe @MaxSem has an idea.

FYI, I tried to add an extreme value to the GeoDataDataUpdaterTest in Wikibase, but all code this test triggers (which includes parts of GeoData, including its Coord constructor) succeeds just fine.

@thiemowmde thanks for checking this out. Strange issue. Not sure how to find the problem.

EBernhardson added a comment.EditedMar 24 2017, 3:21 PM

The indexing failure comes from elasticsearch itself, elasticsearch will only accept geocoordinates that have valid coordinates on an earth globe. it rejects any create/update request to a document that includes a coordinate outside the -180 to 180. CirrusSearch doesn't do anything with the coordinates, they are provided by GeoData as data to be indexed.

Ok. And where are we going to fix this? I would not like to add special case handling to Wikibase. I believe it should be GeoData, or whatever code actually puts this into Cirrus. It should check the Coords it gets (they include the globe, and GeoData supports more than just earth), and either fix or skip coordinates that are known to fail on Cirrus.

daniel added a comment.EditedMar 24 2017, 5:53 PM

Could GeoData "wrap" the coordinates into the range of -180..+180 before exposing them to elastic? Or at least skip unsupported ranges, without causing the indexing to be aborted completely.

IMO appropriate fix is probably in GeoData, it needs to not provide coordinates for elasticsearch to index if they don't validate.

Change 345419 had a related patch set uploaded (by MaxSem):
[mediawiki/extensions/GeoData@master] Don't attempt to index invalid coordinates

https://gerrit.wikimedia.org/r/345419

Change 345419 merged by jenkins-bot:
[mediawiki/extensions/GeoData@master] Don't attempt to index invalid coordinates

https://gerrit.wikimedia.org/r/345419