Page MenuHomePhabricator

Implement nearby keyword for full text search
Closed, ResolvedPublic

Description

Mobile apps would like to start providing people localized search results, based on the GPS coordinates available to them. This is essentially an integration of the GeoData nearby api into full text search.

Keywords to implement:

  • nearcoord: 40.446,-79.982
  • nearcoord: 20km,40.446,-79.982
  • neartitle: "San Francisco"
  • neartitle: "20km,San Francisco"
  • boost-nearcoord: 5km,40.446,-79.982
  • boost-neartitle: "5km,Coit Tower"

Event Timeline

Restricted Application added subscribers: Zppix, Aklapper. · View Herald Transcript

after talking to max, while the idea of using | to separate makes sense, it prevents wikilink searches which use [[Special:Search/nearby:"20km|San Francisco"]] because | isn't allowed in a title. The chances of a page actually starting with Nkm is low enough we will make that a bold assumption.

Change 297524 had a related patch set uploaded (by EBernhardson):
Implement near: and neartitle: search keywords

https://gerrit.wikimedia.org/r/297524

I've noticed while implementing this that in production we are limiting the radius to a maximum of 10km. @dcausse Is there any good way we could evaluate the consequences of increasing this to N km?

@JMinor Does your use cause have any approximations for what radius's you will be showing to users?

@EBernhardson we would like to increase the search radius a bit to get to a more "regional" size. Thats a bit vague, but thinking of the use case of a user zoomed in on a metro sized area, which is larger than a city but generally smaller than a state or small country.

Since we should be returning many fewer results than without keyword - the thought was that widening the search area would hopefully not have much of a performance impact, but that needs to be verified.

The example I have used on several occasions is that if I am zoomed in on the SF Bay Area and search for "bridge", we should get back results for the Golden Gate and Bay Bridges.

I think at doubling the search radius would be a good place to start - 20km - but for future proofing and for other use cases, going larger may be good. I'd be curious to hear others' thoughts.

Side note: sorting the results is probably just as important as the radius. Since using my above example it is possible that there may be several pages in SF that have the word "bridge" in the title, which hopefully won't push the bridges out of the top results. Increasing the radius may make this problem worse. Though I'm not sure if this is/will be an actual problem or what the right solution would be.

Fjalapeno raised the priority of this task from Medium to Needs Triage.Jul 6 2016, 6:33 PM
Fjalapeno moved this task from Needs Triage to Tracking on the Wikipedia-iOS-App-Backlog board.

@dcausse gave some useful code review, because this query is different enough from how the geo data api is querying things we may be able to open up the allowable radius quite a bit. For queries that include any other search term we should be able to allow almost any size. For queries that include *only* a nearby keyword and no other search we can still allow a larger radius, but the results may not be fully deterministic as the rescore window has a size limit. When the query returns more results than fit in the per-shard rescore window some documents will not be considered.

I think @Fjalapeno did the use case justice, but see also T114066 which is a similar request for setting "zoom radius" on nearby queries filed by @Dbrant for Android.

It would be great to let the client request an optional radius, and then we'll be at the mercy of the ranking quality (which per our meeting is going to be a sticky bit from the user quality/acceptance).

If we don't specify a radius, I don't have a strong reasoning for a specific size, though I agree "metropolitan area" 20km or a bit makes sense. I don't think the lack of repeated results is problematic for the user (how often do they run the exact same query with the map view? will they really be upset if the results change slightly?) , but I guess it depends on how bad the volatility of the re-score is.

To make this a bit more flexible I've added a some boosting options as well. These boosting options can be used on their own, or in combination with the radius filters. By combining the keywords we can achieve results such as searching within 20km and providing a boost to articles within 2km. Haven't thought of a good way to adjust the syntax for user specified boosts though, so this is currently hard-coded with a *2 score multiplier. In previous testing *2 seemed plenty sufficient (although this was not a rigorous test, i ran a few queries for bridge, museum, etc. and looked at the results). Be warned multiple usage will stack, so providing a boost-nearcoord:5km,... and a boost-nearcoord:20km,... will give a total boost of *4 to pages within 5km.

Will have to be in a separate patch, but based on a suggestion in code review i put together a proof of concept query that will take the geo filtered results and breakup the result area into a grid. It then returns the top N documents per point in the grid, allowing a fairly even distribution across the search area.

query: P3359
results (bridges within 40km of sydney, bucketed to 25km^2): P3360

cc @RHo who is the designer on this feature, and also spent some time in Sydney. See the JSONs linked above for sample query results. Would be great to get a quick sanity check on if these seem like reasonable results for bridges in a 40km around Sydney?

Turns out the above has too many results, and many things that arn't bridges (the page content just happens to include bridge). Could have been better by using the intitle:bridge syntax instead so i tried that, but turns out there are only ~40 pages of bridges near sydney, so the difference between normal sorting and bucketing into grid doesn't make a big difference.

So I put together a better query example that searches strictly by area, not for anything in particular. There are ~1k pages with coordinates within 10km of the chosen point to choose from (in sydney).

query: nearcoord:10km,-33.865,151.209444
map of top 25 results: http://jsfiddle.net/xu7wL/120/

query: nearcoord:10km,-33.865,151.209444 geogrid:5,1
(note i havn't implemented geogrid, it's a placeholder. the numbers are geohash precision of 5, 1 result per bucket)
map of top result in each point of grid(24 results): http://jsfiddle.net/xu7wL/122/

Unfortunately while this demo's how it separates the results out much better across the map, there is a bug in my query where the aggregation is happening before the rescore, Looking into if that's possible to fix, or is a downside of using aggregations. The primary downside here is that when not using any search term there are no scores (prior to rescore which takes popularity and various templates into consideration), so the items chosen on the grid are spaced out but relatively random.

Here is a map with the scoring fixed, but due to the way it works we couldn't offer search over large geographic areas with it (the expensive rescore is run on all results, instead of the top 8192 per shard): http://jsfiddle.net/xu7wL/123/

Change 297524 merged by jenkins-bot:
Implement nearcoord: and neartitle: search keywords

https://gerrit.wikimedia.org/r/297524