Page MenuHomePhabricator

Create Location Search APIs in RESTBase
Closed, DeclinedPublic

Description

The iOS team is currently implementing some new location search features to populate results on a map.

Something like this which returns the most popular results in a region based on page views:
api.php?action=query&cirrusPageViewsW=1000&colimit=100&format=json&generator=search&gsrlimit=100&gsrsearch=nearcoord%3A5421km%2C37.133%2C-95.786&pilimit=100&piprop=thumbnail&pithumbsize=240&prop=coordinates%7Cpageimages%7Cpageterms

In using this, it seems that it would be nice to return results like the summaries returned in the feed endpoints.

Current known use cases:

  1. Provide a location and radius (or bounding box) and return articles sorted by page views
  2. Provide a search query, location and radius (or bounding box) and return articles sorted by relevance/page views
  3. Provide a search query and only return articles with locations sorted by relevance/page views
  4. Provide "an object type", location and radius (or bounding box) and return articles sorted by relevance/page views (e.g. buildings, cities, monuments, etc…)

Breaking this down into a few dimensions:

  1. Return results within a bounding box OR or the entire world
  2. Return unfiltered results OR results filtered on a search term or object type
  3. Sorting results by page views
  4. Sorting results by relevance and page views

Some questions:

  1. Is this feasible for RESTBase? Obviously caching is nearly impossible with location queries.
  2. Is it best just to proxy to the MediaWIki API or are their better ways to perform location services in RESTBase?
  3. Can we use Wikidata categories to filter on object type

Event Timeline

Is this feasible for RESTBase? Obviously caching is nearly impossible with location queries.

Yes, it is feasible. Caching - ye, not that much we could do here.

Is it best just to proxy to the MediaWIki API or are their better ways to perform location services in RESTBase?

We could go directly to ElasticSearch that's powering the query which would avoid some overhead, but that's not critical imho.

Is this feasible for RESTBase? Obviously caching is nearly impossible with location queries.

Yes, it is feasible. Caching - ye, not that much we could do here.

Caching would in theory be possible with enough calculations and restrictions, but it is not as critical for the version. I do think that, from the user perspective, getting results back quickly is imperative so I think we should explore the possibilities here. I speak from experience, where if you are interested what is around you, it is quite possible that your connection is not the best one. For example, you are in another country/city you don't really know that well and having to wait for results is not the best experience.

An idea here could be bounding coordinates to predefined (or pre-computed) boxes, as in if you are in SF, then instead of sending your exact location, the app could send a central location for SF and then sort the results based on your actual distance to the returned articles. This is not the best idea evah, but it may be a good starting point for discussions.

Is it best just to proxy to the MediaWIki API or are their better ways to perform location services in RESTBase?

We could go directly to ElasticSearch that's powering the query which would avoid some overhead, but that's not critical imho.

+1. We were discussing the idea of bypassing the MW API via RB with Discovery for the search component of MediaWiki.

An idea here could be bounding coordinates to predefined (or pre-computed) boxes, as in if you are in SF, then instead of sending your exact location, the app could send a central location for SF and then sort the results based on your actual distance to the returned articles. This is not the best idea evah, but it may be a good starting point for discussions.

Ye, but that's certainly not something we should consider in the first round. I guess I could just go and implement the proxy endpoint and see how the latency looks like..

Fixed area geometric queries are frequently done using quadtrees. For example, using a mapping like

a b
c d,

the string 'aa' would identify the top-left quadrant of the top-left quadrant. The string 'a' would just be the top-left quadrant (one zoom level up). This format is used very frequently for map tiles, for example in google maps. There are libraries that map a coordinate and zoom level to a specific tile coordinate, or a geometric shape to a set of quadtree tiles.

The downside for typical distance queries around a point is the need to make multiple requests (at least for some coordinates) and filtering. The upside is that those requests can be easily cached, and that client libraries are available.

Fixed area geometric queries are frequently done using quadtrees. For example, using a mapping like

a b
c d,

the string 'aa' would identify the top-left quadrant of the top-left quadrant. The string 'a' would just be the top-left quadrant (one zoom level up). This format is used very frequently for map tiles, for example in google maps. There are libraries that map a coordinate and zoom level to a specific tile coordinate, or a geometric shape to a set of quadtree tiles.

The downside for typical distance queries around a point is the need to make multiple requests (at least for some coordinates) and filtering. The upside is that those requests can be easily cached, and that client libraries are available.

Digression: The iOS team is actually implementing quadtrees for clustering of the results and displaying them on the map. If any of you are on the internal beta, you can check it out. If you would like to be on internal betas, let me know.

For more detail, here are the current use cases for the maps UI:

  1. Show "popular" articles within X km of the specified location
  2. Show results for a given search term within X km of the specified location
  3. Show autocompletion results for search queries (which include search results within the bounding box AND results outside of the bounding box)

Some general comments:

  • It doesn't make a difference from the client side whether we need to send a location with a radius, or a bounding box… either is fine
  • Sorting is important here… most popular (page views?) and most relevant (closest to search term)
  • It would be nice to update all summaries with geo location data if it is associated with the page.
  • There is no API for autocompleting queries as specified above. See this mock:

search suggested.png (1×750 px, 254 KB)

An idea here could be bounding coordinates to predefined (or pre-computed) boxes, as in if you are in SF, then instead of sending your exact location, the app could send a central location for SF and then sort the results based on your actual distance to the returned articles. This is not the best idea evah, but it may be a good starting point for discussions.

Fixed area geometric queries are frequently done using quadtrees ……The upside is that those requests can be easily cached, and that client libraries are available.

@mobrovac @GWicke does this also work when filtering by a search term? Meaning is this strategy is still efficient when the user is querying for specific results and not just "any page in the tile" - would you actually cache results for individual search terms?

@Fjalapeno: We can cache anything, but whether it is effective depends on the distribution of queries. If there are only a limited number of popular "search terms" like "object types", for example driven by a dropdown menu, then this could be very effective. Free-form text entry however is likely to be dominated by a long tail, making caching less effective.

@GWicke ok, free form text would be the primary use case here. Not sure if that changes your answer on how well this would fit into RESTBase?

Note: A dropdown is being designed but only for filtering by type (like monuments, or museums) - and I would guess that is mostly blocked by efficiently querying Wikidata (or ingesting it into ElasticSearch)

@Fjalapeno: Even completely uncacheable entry points can still benefit from being exposed in the REST API:

  • Clear documentation & ease of use
  • Performance: Straightforward URL rewriting based on paths allows for a variety of optimization options. Instead of proxying in the PHP API (~50ms minimum overhead), search entry points can be pointed straight to ElasticSearch (if secure enough), a dedicated service with low latency, or a light-weight RESTBase entry point. We can transparently switch between those options at any time.

That said, at large scale even fairly fragmented end points can benefit some from caching, as there tend to be a couple of frequent queries that make up a significant part of overall traffic. It definitely pays to avoid accidental fragmentation, so that whatever caching potential there is can be realized.

@GWicke Cool - I had much the same feeling about the benefits of using RESTBase here even with it being less cacheable than most of our other endpoints.

Next questions:

  1. Would you prefer this endpoint(s) live in RESTBase or in the MCS?
  2. The MW API provides a means to specify both the number of results and to be able to paginate results - have you done this in RESTBase yet?
  1. Would you prefer this endpoint(s) live in RESTBase or in the MCS?

I don't know much about how you plan to implement this. The documentation should be exposed as part of the normal REST API docs. Whether the requests themselves should pass through RESTBase or not depends on a lot of factors. For example, if your backend is ElasticSearch, but you don't want to allow raw ES access for security reasons, then a very thin RB wrapper that restricts the query could make sense. If you have some other backend in mind, or need to do heavy processing on the ES result, then another service (not necessarily MCS) might make more sense.

  1. The MW API provides a means to specify both the number of results and to be able to paginate results - have you done this in RESTBase yet?

We haven't done the former (to avoid fragmentation of whatever caching is possible), but have done the latter. Again, how this actually works in practice depends a lot on your backend.

The backend would be ElasticSearch - since we have the geo data in there already.

No heavy processing… just merging the summaries in.

@GWicke I added this to the Sync doc for the net meeting as well

Current known use cases:
Provide a location and radius (or bounding box) and return articles sorted by page views
Provide a search query, location and radius (or bounding box) and return articles sorted by relevance/page views
Provide "an object type", location and radius (or bounding box) and return articles sorted by relevance/page views (e.g. buildings, cities, monuments, etc…)

So, seems like something like this could work: /search/location/{type}/{sort}/{lat}/{lon}/{radius}{/searchquery} ?

Or since we're not really caching this, having a default sort and type parameters and move them to the query? Like /search/geo/{lat}/{lon}/{radius}{/searchquery}?sort=pageviews&filter=monuments

Or, if we adopt quadtrees for the request, /search/location/{quadtree}{/searchquery}?sort=pageviews&filter=monuments

Provide a search query and only return articles with locations sorted by relevance/page views

This is actually a little bit different from the rest - it's not necessarily a geo-search endpoint, it's more like a 'generic search' endpoint returning summaries. Since we've included coords to summaries, that could be achieved by simple hydration. Or should it filter out only the results with summaries?

An user case we are working with is documenting a cemetery and connect Wikipedia articles to the graves. The idea is to make a visit to a cemetery much more interesting....and make it easier to use Wikipedia articles based on location..... the feeling I have is that todays concept with primary coordinates needs to be developed to better support different types of coordinates....

We are documenting the Northern Cemetery in Stockholm d:Q252312 see blog wiki-map-supports-all-types-of +1000 graves has geocoordinates in Wikidata and swedish Wikipedia articles eg. Alfred Nobel Q23810#P119/P625 we then use a Template to "transform" the grave coordinates to the swedish Wikipedia Article

--> we get +1000 geocoded articles using secondary coordinates in a small area with many people having the same coordinates (are in the same grave)

The user case we are thinking about:
A) as an user visiting the cemetery I would like to find the most famous graves (page views or number of articles about a person in different languages) --- not get buildings on the cemetery that have an article
B) as an user visiting the cemetery I would like to read more about the graves near me
C) as an user I would like to find graves for people that are e.g. famous singers

Lesson learned

  1. That apps like V for Wiki is just fetching primary coordinates
    1. problem with an article about a person which is the primary coordinate? The grave? guess not.....
    2. Wiki-maps had the same problem but changed the implementation to use gsprimary=all
  2. The concept of primary coordinates need some more thoughts. Today you cant have more primary coordinates in an article but a template I think transforming coordinates from Wikidata to Wikipedia can't understand if more coordinates are primary in the article?!?!?
    1. What is a primary coordinate for a person?
      1. The birth place?
      2. the grave? (his current location ;-) )
      3. Famous building he has built....?
      4. ......

Next step

We need to have articles that contains more coordinates of different types and we need to be able to filter on the type...==>

  1. a person can have
    1. a grave coordinate
    2. a coordinate for a famous building he/she has created,...
    3. a coordinate for the house he/she she was grown up
    4. a coordinate for the museum about the person...
  2. a painting can have
    1. one coordinate where the painting is found in a museum
    2. one coordinate is the location of what we see on the painting.....

--> filtering must be done on the coordinate type not on the article categories....

GWicke triaged this task as Medium priority.Aug 8 2017, 9:11 PM

There was neither movement on implementing it nor further requests from audiences. I don't think it's needed anymore.