Page MenuHomePhabricator

RFC: Agree on feed endpoints
Closed, ResolvedPublic

Description

We would like to have one or more endpoints for feeds, to be used initially by the mobile apps. See T132340 for details.

Current thinking:

https://{domain}/api/rest_v1/

  • feed/featured/{year}/{month}/{day}: An aggregating endpoint comprised of the ones below, which are the non-user specific portions of the Explore feeds in the apps.
  • page/featured/{year}/{month}/{day}: The "Article of the day"
  • page/news/{year}/{month}/{day}: "In the news" entries.
  • page/random/title: One random article.
  • media/image/featured/{year}/{month}/{day}: The "Picture of the day"
  • media/video/featured/{year}/{month}/{day} The "Video of the day"

All by-date entry points return "this date or earlier" content, as available in storage. Additionally, results will contain a link to the previous date's content, based on date arithmetic. In combination, this makes sure that

  • clients can request latest content by asking for the current UTC date, and
  • clients can efficiently page backwards, skipping over gaps in content.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
bearND added a comment.EditedApr 29 2016, 6:09 PM

@mobrovac I've split out the random pages feed endpoint into T132811. Let's continue discussion about random article implementation there.

We discussed this today in the Services / Reading sync-up. For the top-level hierarchy, options considered were: 'feed', 'project', 'wiki', 'content', and 'read'.

The front-runner so far is project, mostly as it is general enough to accommodate all the project-global entry points under discussion here (including random), but also distinct enough from other existing top-level hierarchies.

This would mean that we'll have the following top-level hierarchies:

  • page
  • media
  • transform
  • metrics
  • project

Concretely, for enwiki the base URL would be https://en.wikipedia.org/api/rest_v1/project/, with sub-entrypoints (possibly) along these lines:

  • today/feed <- aggregated endpoint
  • today/article
  • today/picture
  • today/news
  • random

@mobrovac @Pchelolo @Eevans @Anomie @aaron @dr0ptp4kt @bearND @Mholloway @Fjalapeno @Mhurd @cscott @Arlolra @ssastry: Does the /project/ hierarchy make sense to you? We would like to make a decision some time next week, so please speak up if you have other proposals or concerns about this.

bearND triaged this task as High priority.May 13 2016, 6:11 AM
bearND updated the task description. (Show Details)
bearND moved this task from Backlog to Doing on the Mobile-Content-Service board.

Makes sense as long as we can aggregate cross-wiki things (e.g., fold in Commons picture of the day to the aggregated feed).

bearND updated the task description. (Show Details)May 13 2016, 8:48 PM

I think it doesn't hurt to be domain specific since we also want to be able to use the same domain for all requests, which means fewer TLS handshakes for the client.

While having the same URL to the picture, POTD would still come with different captions for different languages. So, I think having the endpoint for it to be domain specific is actually a good thing.

GWicke proposed random/article, which I like. So, I'm updating this task description accordingly.

bearND updated the task description. (Show Details)May 13 2016, 9:33 PM

I'll cast my vote for feed since that seems least surprising to me. I would also like to suggest some brainstorms for the aggregate structure (below).

Thank you for soliciting feedback!

Concretely, for enwiki the base URL would be https://en.wikipedia.org/api/rest_v1/project/, with sub-entrypoints (possibly) along these lines:

  • today/feed <- aggregated endpoint
  • today/article
  • today/picture
  • today/news
  • random

Would specific dates also be supported in the today field? For example, 2016-05-01?

Conceptually project seems redundant to me in that by using an URI of the form //{domain}/api/rest_v1/ we are already in the realm of the project identified by its domain. However, we should aggregate the feeds under the same umbrella as otherwise we run the risk of top-level hierarchy proliferation.

latest perhaps? (I'm not really convinced it fits, but seems somewhat better than project)

dr0ptp4kt added subscribers: JMinor, JKatzWMF.EditedMay 14 2016, 3:29 PM

Maybe explore?

@MZMcBride, good question. @JMinor, @Dbrant, @JKatzWMF: can you conceive of a likely case where the apps or, in some potential future state where we exposed the facilities via template/tag/magic word, web would need to express the date to one or more of the microservices or aggregate service to look back in time? I guess we could do it on a case by case basis (e.g., POTD) and otherwise defer on support for that if we don't need it yet, as presumably it would be appended to the URL RESTfully.

@coreyfloyd, has the iOS app needed to provide the api.php endpoints date info in the past? Would you like to avoid it doing so in the future? Do the new micro services and aggregated service need to provide date data in the response or is that a concern managed best at the client?

@mdholloway how were you guys thinking about that on Android? As it stands, we maintain persisted state about previous days' fetches at the client in iOS. I think we'd replicate that for Android for the foreseeable future. Auth'd key-value store mechanisms aren't even yet in RFC.

Coming back to the web, I assume similar mechanisms of previous feed reading in some sort of feed mechanism would most easily be done in LocalStorage (perhaps in a PWA?) if some day down the road we see this as viable there.

As far as naming I'm pretty ambivalent.

@Niedzielski I think we should avoid the word feed as these endpoints are more general than that. Random need not be used in the feed, etc

As far as date specific stuff. Yes we do provide date info to the features article and potd api and would like to be able to pass dates for past info to the new endpoints. A range of days would seem natural (much like he page view api)

@Fjalapeno, hm, maybe random is a corner case? All our work this quarter is based around the "feeds" idea. We even call it "the feed" on iOS. As a client, it's very intuitive to me for services to use a similar name. "project" is just too generic. If random is a sticking point, maybe we could put that under https://en.wikipedia.org/api/rest_v1/page/random?

Mholloway added a comment.EditedMay 16 2016, 2:06 PM

@mdholloway how were you guys thinking about that on Android? As it stands, we maintain persisted state about previous days' fetches at the client in iOS. I think we'd replicate that for Android for the foreseeable future. Auth'd key-value store mechanisms aren't even yet in RFC.

Right. We haven't really gotten into implementation details yet but the idea is that we'd persist some reasonable amount of feed data on the client and update upon starting the feed activity. (I don't think this would be a use case for "cloud" syncing in the proposed key-value store.)

I agree that we'll want to be able to specify date params to the service, or, as an alternative, a number of days/items to go back in time—e.g., today/picture/last/5—though it would probably be more flexible to specify dates outright.

GWicke added a comment.EditedMay 16 2016, 4:25 PM

@Niedzielski's proposal to move random to page/ got me thinking about integrating more of these end points into existing hierarchies. Here is an alternative scheme exploring this:

  • page/of_the_day/today
  • page/in_the_news/today
  • page/random/article
  • media/of_the_day/today

By moving the time selector last, the *_of_the_day end points could quite easily support specific days, as proposed by @MZMcBride.

This would still leave the aggregated mobile 'feed', which doesn't seem to fit into any of the existing 'type of content / functionality' focused hierarchies. Looking ahead, it seems likely that we'll expose more global change feeds (like recentchanges & watchlists), but also more kinds of aggregations. Most aggregations will be focused on latest data though, so arguably can be described as a feed as well.

So, perhaps it would make sense to introduce a feed top-level hierarchy after all:

  • feed/mobile

Compared to more use case focused proposals like explore, this would keep the top level hierarchies focused on the type of data or functionality.

@Fjalapeno, hm, maybe random is a corner case? All our work this quarter is based around the "feeds" idea. We even call it "the feed" on iOS. As a client, it's very intuitive to me for services to use a similar name. "project" is just too generic.

IMHO, feeds is not a high-level concept enough to be promoted to the top level of the hierarchy (comapre with, e.g., page or media). That is not to say that the feeds are not important enough, mind you.

If random is a sticking point, maybe we could put that under https://en.wikipedia.org/api/rest_v1/page/random?

Yup, I was thinking about the same thing, +1. It makes more sense to me to put it under /page/.

@mdholloway how were you guys thinking about that on Android? As it stands, we maintain persisted state about previous days' fetches at the client in iOS. I think we'd replicate that for Android for the foreseeable future. Auth'd key-value store mechanisms aren't even yet in RFC.

Right. We haven't really gotten into implementation details yet but the idea is that we'd persist some reasonable amount of feed data on the client and update upon starting the feed activity. (I don't think this would be a use case for "cloud" syncing in the proposed key-value store.)

I agree that we'll want to be able to specify date params to the service, or, as an alternative, a number of days/items to go back in time—e.g., today/picture/last/5—though it would probably be more flexible to specify dates outright.

Hmmm. Using start and end dates fragments the cache heavily. Sending multiple requests with exact dates would greatly help. But that would mean that when a user doesn't open the App for a while, the App would need to make as many requests as they've been absent (up to a limit, though, I hope). That would allow us to cache responses for exact dates more efficiently and it shouldn't increase the latency in practice because the App can make async requests and compose the page as the response arrives (no need to wait for all of the responses to come right away, nor to send all of them at all for that matter).

  • page/of_the_day/today
  • page/in_the_news/today
  • media/of_the_day/today

@GWicke, that looks good. To @MZMcBride's point, can today be substituted with an arbitrary date? For example, the date command line utility can understand date -dtoday, date -d2001-02-03, etc. Another example is Git which understands ranges like git log @{2016-05-01}..@{now} but that might not fit in with @mobrovac's point about caching.

feed/mobile

I'm sorry to be so nit picky but is the word mobile needed? It seems a small thing but I would like to make a humble request that we never use the word mobile. My limited understanding is that we should always try to build universal services that work for desktop or mobile. I'm not sure how practical that always is but, as a supporting example, I think all the work done for Wiktionary could be used anywhere. As a client, specifying that a service is "mobile" is confusing to me because I'm not sure if I should be using MW API for desktop or searching for other platforms like feed/android.

To @MZMcBride's point, can today be substituted with an arbitrary date?

Yes, we could accept an ISO 8601 date like 2016-01-03, or 2016-01, if we wanted to offer monthly summaries. We could even get rid of today completely, and ask clients to pass in the current date. However, most wikis actually use LOCALDAYNAME to switch *_of_the_day content, which means that for example the German Wikipedia is changing this content at midnight in Berlin. Clients would need to know the right wiki timezone in order to figure out the correct date, which is easy to get wrong. So, I think we should keep today functionality.

Although this is a technical discussion, my $0.02 on the product side is to second @GWicke proposal. Although these end points are being used to construct the future Explore feed on Android, the individual components (like featured article by day) could have many potential uses for all sorts of clients.

GWicke added a comment.EditedMay 16 2016, 7:04 PM

I'm sorry to be so nit picky but is the word mobile needed?

Agreed, /feed/mobile can likely be improved upon. Suggestions welcome ;)

Edit: Perhaps /feed/today or /feed/featured?

@GWicke, I fear I'm missing an obvious point but could we just call it /aggregate, /feed, or /feeds? In other words, just pick something close to what it does but drop the word mobile? I think @mobrovac expressed some concern about putting feeds at the top level but I'm not sure it really fits under anything. Maybe /pages/aggregate (pages being plural) would satisfy? @JMinor, is there a word more appropriate than feed or aggregate?

@Niedzielski, while this feed is the first one, it is unlikely to be the only one. This is why I think introducing a top-level feed hierarchy can make sense. We still need to name this feed in a way that avoids confusion with others, like recentchanges or watchlists.

Just to clarify I think feed makes sense for the aggregation point, its just keeping the individual components in the page or other existing spaces.

I like the way this is shaping up. +1 to individual item endpoints with a (single) date param that can understand either today or an ISO 8601 date.

Perhaps the aggregated endpoint could be /feed/explore, since I think 'Explore' is the feature branding we're going with for the feed?

  • page/of_the_day/today

s/of_the_day/daily/ ? of_the_day seems too verbose to me.

  • page/in_the_news/today

Idem. Go with just news ?

  • page/random/article

The word article shouldn't be used as it's the name of a page in the main namespace on enwiki. In principle we could just drop it.

  • media/of_the_day/today

s/of_the_day/daily/ :P

By moving the time selector last, the *_of_the_day end points could quite easily support specific days, as proposed by @MZMcBride.

We could offer today as an alias for the current date, but in principle we should deal with dates if we plan to support getting content for a day other than today.

This would still leave the aggregated mobile 'feed', which doesn't seem to fit into any of the existing 'type of content / functionality' focused hierarchies. Looking ahead, it seems likely that we'll expose more global change feeds (like recentchanges & watchlists), but also more kinds of aggregations. Most aggregations will be focused on latest data though, so arguably can be described as a feed as well.

Good point. We can indeed turn it into an aggregate feed endpoint.

Yes, we could accept an ISO 8601 date like 2016-01-03, or 2016-01, if we wanted to offer monthly summaries. We could even get rid of today completely, and ask clients to pass in the current date. However, most wikis actually use LOCALDAYNAME to switch *_of_the_day content, which means that for example the German Wikipedia is changing this content at midnight in Berlin. Clients would need to know the right wiki timezone in order to figure out the correct date, which is easy to get wrong. So, I think we should keep today functionality.

I think we should leave that up to the client to decide or resolve. IMHO, the Berliner in your story shouldn't see (for him) yesterday's feed because our server hasn't switched the date yet. On the other hand, if he moved to the US and kept consuming the dewiki feed, he should see it relative to his new TZ. The premise here is, of course, that we can produce feeds for future dates (as observed by the server).

Perhaps the aggregated endpoint could be /feed/explore, since I think 'Explore' is the feature branding we're going with for the feed?

While I find @GWicke's suggestion /feed/featured somewhat better, I'd go with /feed/explore too simply because featured seems too broad to me and could create confusion. In my mind /feed/collection would be best, but that is a no-go since we have the Collections extension.

Agreed that the paths could be a little less verbose.

  • page/of_the_day/today

s/of_the_day/daily/ ? of_the_day seems too verbose to me.

page/featured/[today|{date}]? (As the section is called today's featured article, at least on enwiki)

  • page/in_the_news/today

Idem. Go with just news ?

+1

  • page/random/article

The word article shouldn't be used as it's the name of a page in the main namespace on enwiki. In principle we could just drop it.

+1

  • media/of_the_day/today

s/of_the_day/daily/ :P

'featured' could also work here.

I really like the featured idea

  • /page/featured/{today|date}
  • /page/news/{today|date}
  • /page/random
  • /media/featured/{today|date}

So, +1 from me to these options.

GWicke added a comment.EditedMay 17 2016, 5:54 PM

The word article shouldn't be used as it's the name of a page in the main namespace on enwiki. In principle we could just drop it.

The idea behind /random/article was to indicate that only articles (main namespace) were desired, but I think we could also add query parameters to a plain /random entrypoint as well, especially if we don't make it cacheable anyway. Most use cases will want articles, and not random pages. So, +1 from me on /page/random.

Some more thoughts:

  • I'm also +1 on /{page,media}/featured. Shorter and clearer.
  • I fear that /page/news/ might be a bit cryptic to users not familiar with the end point; /page/in_news/ would perhaps be clearer, without being overly verbose either.
  • On /feed/featured vs. /feed/explore, I think I'm still leaning towards featured. For third party API users (who don't know / care about project names etc), this seems to provide better guidance for what this feed contains, at least right now. explore is general enough to describe most feeds & featured content, which makes it less useful to distinguish end points.

Overall, it looks like we all agree on having most entry points in page / media, and adding a single feed entry point. The remaining questions all seem to be about the detailed path names, but I'm sure we can agree on those in short order as well.

A lot of discussion!

I think any of /project/... or /{page,media}/featured work well.

Keep in mind that there is video and picture of the day, so we can potentially serve them separately with /media/picture/... or /media/video/....

Also, from my own experiences wrapping the potd templates, a good API I came up with that was flexible and usable was using /endpoint/YYYY[/MM[/DD]] and having aliases like /endpoint/latest.

This way if you queried /media/featured/2016 you can get the list of featured media of the year, and /media/featured/2016/03 the ones from March.

This highly curated high quality content I think deserves more flexible endpoints than just today, so a big +1 to add the dates. I'm thinking about all the cool interfaces you could do with those endpoints.

The namespacing by YYYY/MM/DD provides a very natural interface to the grouping of this media, which is hierarchical by the nature of its classification (by date).

Same applies for /page/{featured, news}/YYYY/MM/DD.

GWicke renamed this task from Agree on endpoints for feeds to RFC: Agree on endpoints for feeds.May 18 2016, 8:10 PM
GWicke added a project: TechCom-RFC.

The word article shouldn't be used as it's the name of a page in the main namespace on enwiki. In principle we could just drop it.

The idea behind /random/article was to indicate that only articles (main namespace) were desired, but I think we could also add query parameters to a plain /random entrypoint as well, especially if we don't make it cacheable anyway. Most use cases will want articles, and not random pages. So, +1 from me on /page/random.

The implementation we have in mind is not just a simple random but a more sophisticated one which would try to pick better results by favoring articles with lead images, descriptions, and longer text extract. It also removes disambiguation pages. @Mhurd has started a patch which ports the iOS app implementation used for the random entry in the iOS app feed and makes some enhancements to it. I'm wondering if we should add another level after /page/random/ but don't have a great name that doesn't sound subjective, like /page/random/interesting.

Some more thoughts:

  • I'm also +1 on /{page,media}/featured. Shorter and clearer.

+1 from me, too

  • I fear that /page/news/ might be a bit cryptic to users not familiar with the end point; /page/in_news/ would perhaps be clearer, without being overly verbose either.

I think /page/news/ is just fine, and not too cryptic.

  • On /feed/featured vs. /feed/explore, I think I'm still leaning towards featured. For third party API users (who don't know / care about project names etc), this seems to provide better guidance for what this feed contains, at least right now. explore is general enough to describe most feeds & featured content, which makes it less useful to distinguish end points.

Either one sounds good to me. I could also throw /feed/daily into the mix, or open to a different top level entry than /feed if we find a better one. I hope this doesn't get confused with Atom feeds.

Overall, it looks like we all agree on having most entry points in page / media, and adding a single feed entry point. The remaining questions all seem to be about the detailed path names, but I'm sure we can agree on those in short order as well.

With detailed path names do you mean the last portion, which lets you specify for which day you want to see the daily feed for? That one is still a bit murky.

I'm thinking for the apps at least it would be nice not to have to calculate dates/times when building the requests, at least for the initial aggregating endpoint. Then it wouldn't have to know at what time the feed content changes. So, we definitely need a today or latest. Today's feed might get updated several times a day, esp. the "In the news" portion.

When the app fetches the explore feed it would first ask for latest daily feed content. Once the user scrolls down to a point near the end of the list the app would fetch the previous day's daily feed content, and so on. If the service includes a link to a URI portion which is used to get to the previous day's feed content, the client could easily go back in time one day at a time, which I think is sufficient for our use case. No date/time manipulations needed on the client. Just pass around strings. This is somewhat similar to the _links.next.href I see in existing RESTBase endpoints. (Hypermedia FTW!) We probably want to call it prev or previous instead of next since it's going backwards in time instead of forward. @GWicke, is this linking of results something that could be provided by RESTBase?

From a server/cache perspective, having the date as part of the URI has the advantage that we could cache responses of previous days for a very long time. The drawback is that we need to have a clear definition of when a day starts (IOW, which time zone?), so that clients know what to expect. UTC for every wiki or not? While UTC is very convenient, @GWicke's mention of most wikis actually use LOCALDAYNAME to switch *_of_the_day content makes me think that this is not a given.

Also, from my own experiences wrapping the potd templates, a good API I came up with that was flexible and usable was using /endpoint/YYYY[/MM[/DD]] and having aliases like /endpoint/latest.

This way if you queried /media/featured/2016 you can get the list of featured media of the year, and /media/featured/2016/03 the ones from March.

If we wanted to expose featured of the month or year we could alternatively make the date portion flexible enough to allow just YYYY and YYYY-MM in addition to YYYY-MM-DD.

/endpoint/YYYY[/MM[/DD]]

If we wanted to expose featured of the month or year we could alternatively make the date portion flexible enough to allow just YYYY and YYYY-MM in addition to YYYY-MM-DD.

Totally. I just think it looks better with the path namespacing, but it is personal preference and just aesthetics.

/media/featured/2016/03/22/media/featured/2016-03-22
/media/featured/2016/03/media/featured/2016-03
/media/featured/2016/media/featured/2016

FWIW I like the path namespacing better too (also for aesthetic reasons).

bearND moved this task from Doing to Backlog on the Mobile-Content-Service board.May 20 2016, 8:29 PM

It looks like we have general agreement on the top-level hierachies:

  • /api/rest_v1/page/featured/{some_day_spec}
  • /api/rest_v1/page/news/{some_day_spec}
  • /api/rest_v1/page/random
  • /api/rest_v1/media/featured/{some_day_spec}

Main things that still seem to be open:

Date spec as path components vs. ISO 8601 string

  • Path components: /featured/{today|year}{/month}{/day}, example /featured/today or /featured/2016/01/01.
  • ISO 8601 timestamp: /featured/{today|date}; example /featured/today or /featured/2016-01-01

Comments:

  • It might be tricky to encode / validate the requirement to specify either today *or* "all of the date components" until we support monthly or even annual summaries. In the docs, these are separate parameters, and the default semantics don't capture the if-this-then-that part. The transition from year to the literal today in the first path component is also bit awkward.
  • Path segments *could* support listings in the future, but it's not clear that those would be useful.

If we can find a way to cleanly document & validate "if you set the year, then you also need to set the month and day", then I think path components would look very strong for their flexibility. If we can't, then ISO 8601 might be slightly less confusing.

Time zones: Local vs. UTC

Generally, I agree with @mobrovac's point that using UTC consistently throughout the API is desirable. For featured articles and media using date-based transclusions, this looks pretty doable, as those are generally prepared ahead of time.

However, things look a bit more complicated for news. Large projects like enwiki and dewiki use editable pages for this content, which contains news items from several days, on several lines. Also, for obvious reasons those are not filled out ahead of time.

A possible approach would be to store content under UTC dates based on their revision timestamp. This would solve the timezone issue in a way that's compatible with UTC, but might introduce some artifacts where there is a large offset between local timezone and UTC. Parsing the actual dates in the content might be able to get around that, but is also a lot more complex.

Random: Mode in path vs. query parameters

It seems likely that we will eventually provide different random modes, and possibly also support restrictions to namespaces. Since random will likely not be cacheable anyway, using query parameters should be fine here. For this reason, my vote goes to using just /page/random for this particular entry point.

Right now, in https://gerrit.wikimedia.org/r/#/c/290510, I have the most-read articles endpoint as /page/trending/yyyy/mm/dd. I know there's some discussion around date handling so that path segment may change (though I personally love the simplicity and flexibility of it), but /page/trending seems a little awkward considering it's not a page or info about a single page being returned, but rather a list of page titles with metadata. /project/trending would be a natural fit bit I think we decided against /project. Is /page/trending good enough?

Also, on https://gerrit.wikimedia.org/r/#/c/290510, @Fjalapeno wrote:

One thing I didn't think of before some conversations about other APIs: we may not want to call this trending and just stick with the same terminology as the discovery team coined: pageviews
We probably want to reserve "trending" for more real time APIs like John Robson's pushipedia work.

Seems reasonable to me. What do you all think?

GWicke added a comment.EditedMay 26 2016, 10:41 PM

Some notes from today's sync-up meeting:

Date format

  • Avoid the doc issues around the today alias by always requiring a full date to be supplied.
  • Go with slashes as the separator: /{year}/{month}/{day}
  • Return the latest available content if the time is current & no exact match is available.

Time zones

  • Use UTC throughout.
  • Refine news timezone alignment over time.

Aggregated feed

  • Decision: Go with /feed/featured.

Trending end point

GWicke updated the task description. (Show Details)May 26 2016, 10:53 PM

I updated the task description in line with the results of the sync-up meeting. I believe we have now resolved all significant outstanding questions & can call this done soon. Please speak up now if you have concerns!

Looks great! Ship it!

@GWicke thank you for summarizing and updating the description. I agree with almost everything except the following:

  • Avoid the doc issues around the today alias by always requiring a full date to be supplied.

I still think we should have a today or latest, at least for the aggregating endpoint.

  • Return the latest available content if the time is current & no exact match is available.

Would elaborate on that a bit more? What do you mean with "no exact match is available"?

GWicke added a comment.EditedMay 27 2016, 4:54 PM

I still think we should have a today or latest, at least for the aggregating endpoint.

Actually, do we have a need to retrieve historical feeds, or should we just make this /feed/featured, without any date? If we later wanted to add by-date retrieval, then that could still be added as /feed/featured/{year}/{month}/{day}, without any conflicts.

Return the latest available content if the time is current & no exact match is available.

Would elaborate on that a bit more? What do you mean with "no exact match is available"?

This is to make sure that requests for the current date in UTC always returns the latest content, even if there have been no edits to a news article on that date yet. It basically makes sure that the current UTC date works as 'latest'.

I still think we should have a today or latest, at least for the aggregating endpoint.

Actually, do we have a need to retrieve historical feeds, or should we just make this /feed/featured, without any date? If we later wanted to add by-date retrieval, then that could still be added as /feed/featured/{year}/{month}/{day}, without any conflicts.

Yes, the app will let you go to previous day's feed content when you scroll down.

Return the latest available content if the time is current & no exact match is available.

Would elaborate on that a bit more? What do you mean with "no exact match is available"?

This is to make sure that requests for the current date in UTC always returns the latest content, even if there have been no edits to a news article on that date yet. It basically makes sure that the current UTC date works as 'latest'.

I remember you mentioned this during the meeting. Now thinking about this a bit more from the client perspective it could duplicate data for different dates if we do it that way. It think an explicit latest endpoint would be better for that. Also, there should be links to previous days, which ever has actual content (that's non empty).

@bearND, do you see any issue with just having the app request the current day's feed (and other date-dependent content) using the current UTC date?

Also, there should be links to previous days, which ever has actual content (that's non empty).

As I said in the meeting, such links are somewhat tricky to provide efficiently & consistently, especially for dates in the past. The only way to do this I can think of would be to pre-generate content for the entire article history / featured article dates. It is a lot simpler to follow a "this date or earlier" policy, and provide the actual date of the modification in the response (for example, in the last-modified header). The client can then page through available content by requesting a date that's one day earlier than the requested content. If that does not have an exact match (page does not exist, or no edit on that date for a news article), then the next earlier content would be returned.

@bearND, do you see any issue with just having the app request the current day's feed (and other date-dependent content) using the current UTC date?

I think that would be acceptable.

Also, there should be links to previous days, which ever has actual content (that's non empty).

As I said in the meeting, such links are somewhat tricky to provide efficiently & consistently, especially for dates in the past. The only way to do this I can think of would be to pre-generate content for the entire article history / featured article dates. It is a lot simpler to follow a "this date or earlier" policy, and provide the actual date of the modification in the response (for example, in the last-modified header). The client can then page through available content by requesting a date that's one day earlier than the requested content. If that does not have an exact match (page does not exist, or no edit on that date for a news article), then the next earlier content would be returned.

We don't need the entire history and we only need to link it one way, backwards in time. We could have a cut-off date where we don't link back to prior days if the requested date is before the cutoff date. I propose a day sometime between 1/1/2016 and about a week before we first deploy the new endpoints.
The advantage of using a _links.previous.href is that it could skip empty days. This would help avoid unnecessary requests for smaller wikis, which may not have a lot of content for the feed.

The advantage of using a _links.previous.href is that it could skip empty days. This would help avoid unnecessary requests for smaller wikis, which may not have a lot of content for the feed.

Empty days can be skipped in my proposal as well.

The advantage of using a _links.previous.href is that it could skip empty days. This would help avoid unnecessary requests for smaller wikis, which may not have a lot of content for the feed.

Empty days can be skipped in my proposal as well.

Do you mean using previous day's content? Otherwise I somehow missed it.
I think supplying older content when the requested date has no content only works if the response also provides the datetime for which the content is for. I still prefer in that case to just have the link to a previous day which has some content. So that for the current day the client doesn't repeatedly get extra content from older days while that content is subject to modification throughout the day.

GWicke added a comment.EditedMay 28 2016, 5:18 PM

Empty days can be skipped in my proposal as well.

Do you mean using previous day's content? Otherwise I somehow missed it.

Sorry if my explanation in T132597#2335088 wasn't entirely clear. Here is some pseudocode to illustrate:

var lastModified = new Date(res.headers['last-modified'];
var prevDate = new Date(lastModified - 86400000);
var prevURL = '/feed/featured/'  + prevDate.getFullYear() + '/' + prevDate.getMonth() + '/' + prevDate.getDay();

We could also do this calculation server-side, and provide a "previous" link, without a guarantee that there is actually content for that precise date. Either way, the key part to make this work reliably is the fall-back to earlier content.

@GWicke Thank you for the clarification. I would like that server side and have it eventually also skip empty days. Maybe not needed for MVP but should be very simple to implement anyways. Just loop until you reach a hard-coded date or non-empty content.

GWicke added a comment.EditedMay 31 2016, 5:48 PM

@bearND, we should be able to get "this date or earlier" content in a single query, so implementing this behavior should be fairly straightforward. Providing the previous link based on the date math is also simple enough server-side & doesn't add much size overhead, so +1 from me on including that in the interest of client simplicity.

Edit: Updated the task description to include these points.

GWicke updated the task description. (Show Details)May 31 2016, 5:54 PM

@GWicke I'm not worried about the implementation of "this date or earlier". If we do that then we'd have to provide a date in the response so the client knows that it's from an earlier date.

GWicke added a comment.EditedMay 31 2016, 6:00 PM

If we do that then we'd have to provide a date in the response so the client knows that it's from an earlier date.

@bearND, the link will take care of that, by linking to the previous date based on the content's actual edit date.

While I was looking at the PR for the first feeds endpoint implementation, one question started to bother me. The endpoint returns full 'content' of the featured article, as extract, thumbnail, etc. In RESTBase this content will be cached, and also cached in Varnish.

The featured endpoint would likely be regenerated with some change-prop rule. The rule could match on the time, or the featured (news etc) template/page update, or just on a Main_Page update.

However, a featured article would likely get a lot of attention during the day it's featured, so it will likely be modified a lot, and we want the users to see the latest version. How are we going to build a dependency like 'this article is edited && this article is a featured article for this day (or any day actually) => rerender featured article for the day'. We need a full-featured dependency tracking for that.

Other approach could be to cache in RESTBase only the title of the featured article, and go to summary for the actual content, then RB will serve the latest content all the time. But this comes with a really high cost (2 cassandra roundtrips instead of one), and this doesn't solve a problem of Varnish purging.

@GWicke @mobrovac @Eevans What are your thoughts?

@Pchelolo, I think the main challenge for change propagation is that the title names / patterns differ between projects. Otherwise, date patterns are predictable & relatively straightforward to match. We should update any date, and not just the current date.

@Pchelolo Maybe it would help to have a separate endpoint that provides just the title and/or pageid? Not sure how critical that is.

Another thing that came up when discussing the random endpoint implementation with @Mhurd is that we now want to have a second "random" endpoint.

The first one is still the one which provides the summary data (description extract, thumbnail).
The second one would provide the full page content, similar to the mobile-sections.
Not sure how to best model this in our RESTBase structure.

For the summary content, I was thinking something like /summary/random or /page/random/summary.
Initially I thought the /page/random endpoint would be more appropriate to provide the full content but in MCS we have that split into lead and remaining sections, too.

To spin this further, it seems that we'd want to do something like a Unix command pipe / MW API generator, where you have something spit out a title or pageid and then combine it with a decorator which provides the extra content. In one case the decorator would provide the summary data, in the other case it would provide the lead and remaining sections of the page content.

GWicke added a comment.EditedJun 17 2016, 9:47 PM

@bearND and myself discussed this a bit on IRC today. We are leaning to structure like this:

  • Have a single entry point /page/random/{format}, with valid values for format being summary and mobile-sections for now.
  • On request, the RB template for this entry point queries MCS (or some other backend service) to get a random_title, and returns a 302 redirect to /page/{format}/{random_title}.
  • Clients follow the redirect, and retrieve cached content as usual.

Some advantages we saw in this solution are:

  • We only need a single, simple entry point for returning different random response formats.
  • Documentation is fairly straightforward, as each of the supported formats can link to the respective end point for detailed per-format docs.
  • Caching of content remains limited to the actual content end points.
  • Caching of redirects could be introduced in the future, without a need to worry about handling content changes (except for page deletions, which are much rarer).

Possible downsides are:

  • Possibly, a minor latency penalty from the redirect. This might be partly or wholly compensated for by returning cached content from the edge.

@GWicke +1 for this plan

bearND updated the task description. (Show Details)Jun 29 2016, 8:35 PM

Thinking about this some more I think we would want to use mobile-sections-lead instead of mobile-sections under /page/random/{format}. Then we would need to return a link to the corresponding /page/mobile-sections-remaining endpoint.

bearND added a comment.EditedJun 29 2016, 8:37 PM

Updated the description to reflect the change for the random endpoint: page/random -> page/random/summary

bearND updated the task description. (Show Details)Jul 7 2016, 12:11 AM

Updated description to change page/random/summary to page/random/title since that's all the app needs currently T139424.
We can implement the summary and/or the mobile-sections-lead format variants once a need arises.

In the same vein, we should consider changing the featured article endpoint from page/featured/{format}/{yyyy}/{mm}/{dd} instead of just page/featured/{yyyy}/{mm}/{dd}. Thoughts?

Bringing the discussion here back from T136960: Create first public endpoints for feeds (relevant comments are T136960#2433969, T136960#2434058 and T136960#2434395).

Actually, page/random/{format} where {format} is any of the URIs under /page/ makes sense to me. That way, we could have only one routine calculating the random title name and then have a proper redirect. Thoughts?

We can add more once we actually need them. I'm thinking we could have format = {title, summary, mobile-sections-lead}. For now, let's just focus on the ones we need right now, which is title for random. My patch for page/random/title is ready for review. I hope we can merge it before the deploy window today.

What I was thinking was this: MCS has one endpoint only - page/random/title - which simply returns the normalised title in the body. Then, in RESTBase we can expose various page/random/{format} endpoints, each of which would get the random title from MCS and then redirect to the appropriate endpoint. Example:

  • client sends a request to en.wp.o/ap/rest_v1/page/random/html
  • RESTBase sends a request to MCS for page/random/title
  • MCS responds with Foobar as the response's body
  • RESTBase returns a 303 for en.wp.o/api/rest_v1/page/html/Foobar
  • the redirect is cached in Varnish for 60 seconds or so

{format} would include title, html, data-parsoid, mobile-sections, mobile-sections-lead, mobile-sections-remaining, summary and related.

As RESTBase already contains the page/title/{title} ednpoint, the Android app would just need to adapt its input format to that (it returns a JSON comprising the name, rev and other info about a title) wrt T139424.

I'm thinking we could do the same also for the featured article of the day. page/featured/{format}/{yyyy}/{mm}/{dd} instead of just page/featured/{yyyy}/{mm}/{dd}.

For it to be feasible we'd need to come up with a specification of the information that is to be contained in the response so that we can apply the above logic to this endpoint as well. And that wouldn't be a bad idea. For page/featured/{date} MCS could return a list of things that need to be in the response, and then those could be assembled for each {format}.

bearND added a comment.Jul 7 2016, 5:33 PM
  • client sends a request to en.wp.o/ap/rest_v1/page/random/html
  • RESTBase sends a request to MCS for page/random/title
  • MCS responds with Foobar as the response's body
  • RESTBase returns a 303 for en.wp.o/api/rest_v1/page/html/Foobar

+1

  • the redirect is cached in Varnish for 60 seconds or so

I think the values for the random endpoints should only be cached for 1 second. The apps will add affordances for the random article to be refreshed easily and quickly. Anything longer than a second and this will result in the user getting the same random page twice after trying to refresh random.

{format} would include title, html, data-parsoid, mobile-sections, mobile-sections-lead, mobile-sections-remaining, summary and related.

For random mobile-sections-remaining probably is not needed/useful (because you don't just want to get the remaining sections of a random page without the corresponding lead section + metadata), but for the featured article it would be.

As RESTBase already contains the page/title/{title} ednpoint, the Android app would just need to adapt its input format to that (it returns a JSON comprising the name, rev and other info about a title) wrt T139424.

I'm not sure I completely understand portion and what you want us to do here. The Android app doesn't use the page/title/{title} endpoint, and I'm not sure why it returns a JSON array with just a single item. Seems to me that that is always the case. Do you want MCS to provide the same content format in the response for page/random/title/{title} as page/title/{title}? IOW would you want MCS to add this extra JSON array and fill out as much of the metadata as possible? Do you want MCS to redirect to page/title/{title}?

I'm thinking we could do the same also for the featured article of the day. page/featured/{format}/{yyyy}/{mm}/{dd} instead of just page/featured/{yyyy}/{mm}/{dd}.

For it to be feasible we'd need to come up with a specification of the information that is to be contained in the response so that we can apply the above logic to this endpoint as well. And that wouldn't be a bad idea. For page/featured/{date} MCS could return a list of things that need to be in the response, and then those could be assembled for each {format}.

I was thinking MCS could just implement page/featured/title/{date} which returns the title. RB could provide the redirects to the other formats.

  • the redirect is cached in Varnish for 60 seconds or so

I think the values for the random endpoints should only be cached for 1 second. The apps will add affordances for the random article to be refreshed easily and quickly. Anything longer than a second and this will result in the user getting the same random page twice after trying to refresh random.

Yup, lowered it to 2 seconds for random endpoints, cf PR 636.

For random mobile-sections-remaining probably is not needed/useful (because you don't just want to get the remaining sections of a random page without the corresponding lead section + metadata), but for the featured article it would be.

It was a lapsus linguae on my part, mobile-sections-remaining makes no sense at all in this context.

As RESTBase already contains the page/title/{title} ednpoint, the Android app would just need to adapt its input format to that (it returns a JSON comprising the name, rev and other info about a title) wrt T139424.

I'm not sure I completely understand portion and what you want us to do here. The Android app doesn't use the page/title/{title} endpoint, and I'm not sure why it returns a JSON array with just a single item. Seems to me that that is always the case. Do you want MCS to provide the same content format in the response for page/random/title/{title} as page/title/{title}? IOW would you want MCS to add this extra JSON array and fill out as much of the metadata as possible? Do you want MCS to redirect to page/title/{title}?

MCS may continue to return {"title": "<radnom-title-here>"} (but with a caveat, see the comments in the aforementioned PR for more info). This is about the SLA between the Android app and RESTBase. The Android app shouldn't expect the output given by MCS for page/random/title but it should expect a redirect that would take it to RESTBase's page/titile/{title} (where {title} will be a random title name as returned by MCS).

So, when the Andoid app does a request for en.wp.o/api/rest_v1/page/random/title, assuming the random title is Foobar it should expect to be handed a redirect:

status: 303
headers:
  location: ../../page/titile/Foobar

Then, if it follows the link in the location header, it will receive:

status: 200
body:
  items:
    - redirect: false
      comment: "/* History and etymology */"
      timestamp: "2016-06-27T04:07:51.000Z"
      user_text: "112.64.123.42"
      user_id: 0
      latest_tid: null
      nextrev_tid: null
      renames: null
      title: Foobar
      page_id: 11178
      rev: 727167735
      latest_rev: null
      tid: "b7f070a8-3c1c-11e6-85d1-ab0b0f37d30a"
      namespace: 0
      restrictions: null
      tags: null

Hope this clarifies any concerns.

Ok, in this case I would prefer to have the same format in MCS and RB to make it easier to test from the Android app.
We only need the title, so I'd change the MCS code to emit the extra items array, which reason for being I still don't understand.

body:
  items:
    - title: Foobar

Change 298243 had a related patch set uploaded (by BearND):
Insert items array into random/title response

https://gerrit.wikimedia.org/r/298243

Added a patch to insert the dummy items array. We could also implement the redirect to page/title/{title} in MCS if you prefer.

Change 298408 had a related patch set uploaded (by Mholloway):
Add mw-format-encoded title to random & featured article endpoints

https://gerrit.wikimedia.org/r/298408

Change 298408 merged by Mobrovac:
Add mw-format-encoded title to random & featured article endpoints

https://gerrit.wikimedia.org/r/298408

Change 298243 merged by jenkins-bot:
Insert items array into random/title response

https://gerrit.wikimedia.org/r/298243

/feed/featuredand /page/random/{format} have been deployed and are now live in prod !

bearND renamed this task from RFC: Agree on endpoints for feeds to RFC: Agree on feed endpoints.Jul 22 2016, 5:06 PM
bearND closed this task as Resolved.Aug 11 2016, 4:42 PM
bearND claimed this task.

Documentation for the new two public endpoints is at https://www.mediawiki.org/wiki/Wikimedia_Apps/Team/RESTBase_services_for_apps#....2Fpage.2Frandom.2F.7Bformat.7D

The non-public endpoints are mentioned in the MCS Git repo's README.md file as well.