Make GeoData search over a larger area
Closed, DeclinedPublic

Description

Currently, GeoData is restricted to a narrow use case of searching within a radius of 10 km (20 km on Wikivoyage). That's mostly sufficient for apps or mobile web Nearby, however not even nearly enough for a general use of displaying POIs on a map with wildly varying zoom level. We need to bump it seriously, making sure that we don't permit queries that are too slow (i.e. sort all the pages on enwiki by distance).

CC CIrrusSearch team for performance considerations.

MaxSem created this task.Sep 10 2015, 12:11 AM
MaxSem updated the task description. (Show Details)
MaxSem raised the priority of this task from to Needs Triage.
MaxSem added a subscriber: MaxSem.
Restricted Application added a project: Discovery. · View Herald TranscriptSep 10 2015, 12:11 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

@dcausse I don't know nearly enough about this, although just started reading a bit about how these geo distance queries work. Any idea's here? The relevant queries are generated at [1]

[1] https://phabricator.wikimedia.org/diffusion/EGDA/browse/master/api/ApiQueryGeoSearchElastic.php;ba59105514637edf2467d388dd3eb8de2eccbae2$22

I'm not sure this query is well suited to display points on a map.
It will always return closest pages, so when the user will zoom out the results won't change as they will be limited by the "limit size" param. We will certainly have the same set of close points whatever the zoom level is.
So I think we should change the sort method (line 76) to something else, I'm not sure what could make sense here (incoming_links as we do for normal searches?), and certainly move it to a rescore function instead of sort. It's also hard to predict how the points will be distributed, ie. rescoring on incoming_link can lead to weird displays where most of the points will be concentrated on the same tiny area, I'm not sure how to force a nice distribution on the map (divide the screen in subcells and run multiple queries?).

Concerning performance if we remove the sort option based on geo distance the last performance consideration is the geo distance filter (line 44). I think it makes more sense to use a bounding box filter and enable geo hash prefix optimizations. We will be able to use the geohash cell filter (it's like a prefix query for geo queries). Having a look at the mapping we do not index geohash prefixes so we will have to change the mapping and reindex :(

I'm not sure what's better here, reuse the same function or create a new one, IMHO this is a different usecase.

@dcausse, dim (aka object's rough size) is naturally suited for filtering objects on map. I have no idea though how sane are the values in the DB.

Yurik moved this task from All map-related tasks to Tilerator on the Maps board.Sep 13 2015, 6:41 AM

If it's not possible to limit radius to something small (hard limit in the code) maybe we could change the sort operation into a rescore function. The rescore window (8196 in cirrus) will control the max number of results to sort. This will prevent sorting too many pages but will also return incomplete/wrong results if the filters return more than 8192 docs per shard.

Yurik moved this task from Tilerator to Related on the Maps board.Feb 7 2016, 10:06 PM
Restricted Application added a subscriber: StudiesWorld. · View Herald TranscriptFeb 7 2016, 10:06 PM
Restricted Application added a project: Discovery-Search. · View Herald TranscriptMay 26 2016, 10:47 PM
MaxSem closed this task as Declined.Jul 19 2016, 6:22 PM

With geosearch keywords available in CirrusSearch as of T139378, this is not really a blocker for anything. I would prefer GeoData to remain a fully deterministic, 100% geo search service over a small area, while Cirrus can use various scoring mechanisms available to return a few most relevant points from a load available over a large area.