Page MenuHomePhabricator

Returning a list of summaries given a list of list of titles
Closed, InvalidPublic

Description

While returning one summary at a time is useful and RESTful, there are many situations where clients need to retrieve a list of summaries efficiently.

Usually this happens when pages are retrieved from a PHP api (like morelike, location search, and soon Reading Lists).

In order to make RESTBase summaries more useful, it would be a huge benefit if clients could pass the titles returned by these APIs and then receive the summaries as an array (potentially paging large result sets)

While not as RESTful, it is practical and helps us work within the complexity of the MediaWIki / RESTBase ecosystem and slowly rely on RESTBase more in situations where it is shines (like returning normalized consistent summaries of pages)

One additional constraint is that while fetching a list from a single project would be an improvement, in order to support the use case of Reading Lists, we would need to be able to fetch a lists of summaries from multiple projects.

While clients could split this work up and call the project APIs individually, it would be much easier if this could be done server side.

Event Timeline

There's multiple levels of reasoning why this will probably not happen:

  1. Fetching by page_id is hard. Summaries as all the rest of the content in RESTBase are indexed by title, and secondary indexing them by page_id is quite some work (performance-vise, not implementation work). Also, fetching something by a secondary index is effectively queuing 2 tables inn Cassandra instead of one, thus almost duplicating the latency. Last but not least - with the new storage design we're working on we might get rid of secondary indexes all together.
  1. Supporting bulk access with effectively eliminate Varnish cashing for those requests. In RESTBase each summary will still be queried individually, so the overall latency for the client will probably increase compared to making individual requests.
  1. With http/2 the connections are shared, so making a lot of small requests shouldn't be an issue.

We can try to prototype this and measure the performance, but I doubt the results would be appealing.

[...] it would be a huge benefit if clients could pass the page ids (or titles) returned by these APIs and then receive the summaries as an array (potentially paging large result sets)

I may be misinterpreting the ticket, but it seems to me that the only huge benefit here is that clients wouldn't need to loop through the result set obtained from MW. In general, I understand your argument. Ideally, it would be possible to have the REST API return results based on an arbitrary (but yet unique) field. Unfortunately, the reality of the situation is that it comes at a great cost (on all levels)...

We can try to prototype this and measure the performance, but I doubt the results would be appealing.

I don't see how this would fit in the new schema design plan either, so I wouldn't be keen on trying this, TBH.

Sorry let me descope this a bit here:

Lets focus on the list rather than page ids… (that is a secondary concern) Sorry to conflate the issues

Fjalapeno renamed this task from Returning a list of summaries given a list of page ids. (or a list of titles) to Returning a list of summaries given a list of list of titles.Apr 28 2017, 5:47 PM
Fjalapeno updated the task description. (Show Details)

@mobrovac @Pchelolo so now this is just about returning the list…

I may be misinterpreting the ticket, but it seems to me that the only huge benefit here is that clients wouldn't need to loop through the result set obtained from MW.

Yes, thats the main idea here. Its much better for clients (especially mobile clients) to get 25 results from MW and then be able to get all those 25 summaries from RESTBase in one call.

In this case the client would only need to make 2 calls, rather than the current situation: n+1, or 26 calls.

Supporting bulk access with effectively eliminate Varnish cashing for those requests. In RESTBase each summary will still be queried individually, so the overall latency for the client will probably increase compared to making individual requests.

Thats what I was afraid of. This is going to make fetching summaries for Reading Lists a bit of a bear for clients. it's possible, its just a lot of requests to different end points. But if thats the way it has to be, thats the way it has to be.

  1. With http/2 the connections are shared, so making a lot of small requests shouldn't be an issue.

On the other hand having to piece together the list from requests to a dozen different domains does have a cost. At a mininum extra DNS lookups will have to be done, and connection sharing between multiple domains seems to be problematic as well.

Clearly we don't want Varnish to store a copy of the same article for every domain, so I wonder if it's possible to do some creative rewriting? Say, set up something like wikimedia.org/api/shared/en.wikipedia.org/rest_v1 and then have Varnish rewrite it to en.wikipedia.org/api/rest_v1 before doing the cache lookup.

Reading lists will be user-centric, so perhaps the service can assemble those on the back-end?

Reading lists will be user-centric, so perhaps the service can assemble those on the back-end?

+1. This also avoids the DNS latency issues. Server side caching should not be a major concern for per-user reading lists.

Closing as we are going to do this without a public API