Page MenuHomePhabricator

Have RESTBase request individual chunks of the feed endpoint from MCS
Closed, ResolvedPublic

Description

Currently the feed endpoint exposed by the public REST API is derived from the MCS' aggregated endpoint and is re-requested every 2 seconds. Alas, it takes a long time for MCS to compute the aggregated result. It would be much better for RESTBase to request the chunks individually and then aggregate them, so that the load of computing it is spread over multiple MCS workers, and hence be completed quicker.

In a second step, we should find a way for RESTBase to request only the chunks that it really needs. For example, there is no sense in re-requesting the article of the day every two seconds, but we do need to do it for the random pieces of information contained in the feed.

Event Timeline

Do we want to expose individual pieces of content publicly too? I don't see much value in doing that since the feed payload is quite small in size, so there's not much value in exposing individual pieces, what do you think?

Another question is how often can/should we update the stored titles:

  1. Article of the day/picture of the day/most_viewed - 1 time a day, but since we don't really know when it's updated on every wikipedia - 1 time per hour?
  2. In the news - I don't completely understand where's that coming from, so no idea about the update strategy, is once per hour good enough?

If updating the titles once per hour is good enough for all the content, we could add a bucket in RESTBase with 1 hour TTL and update the titles with MCS. Actual returned content will be substituted with summaries on every request to RB, but we would contact MCS only ~1 time an hour

Something like that, yes, @Pchelolo. The content is currently cached for only a couple of seconds in Varnish, while RB just acts as a proxy. My current idea is to cache the contents of the aggregated feed response from MCS in RB, but request the random article from MCS every time a request hits RB, while serving the rest of the content untouched. Article of the day, Image of the day and In the news parts should be fine if the Cassandra TTL is set to a max of 60 minutes. This may create edge cases and introduce inconsistencies, but let's address them separately, as this simple change will significantly improve the users' experience and alleviate some of the excess load currently exhibited by MCS.

Change 308589 had a related patch set uploaded (by Mobrovac):
Feed endpoints: Allow the client to set dontThrow

https://gerrit.wikimedia.org/r/308589

PR #662 for RESTBase splits the request into components and re-requests only the random title when the response falls out of Varnish' cache. It depends on the merge and deployment of Gerrit 308589 though.

Change 308589 merged by Mobrovac:
Feed endpoints: Allow the client to request aggregated parts

https://gerrit.wikimedia.org/r/308589

Change 309004 had a related patch set uploaded (by Mobrovac):
Feed endpoints: Various bug fixes

https://gerrit.wikimedia.org/r/309004

Change 309004 merged by Mobrovac:
Feed endpoints: Various bug fixes

https://gerrit.wikimedia.org/r/309004

Both changes have been merged and deployed in prod. Clients should see improved performance when loading the feed content. Resolving.

Could we use the purge mechanism for the most-read portion? Looks like ChangeProp should be easy to set up for this pair (PageViews API -> most-read portion) since it's all in RB.
If that is set up then we could set the ttl for this portion to 1 day or even longer. That one is probably the most resource intensive portion of the aggregated feed.
The problem is that it's not always clear when exactly the day's PageViews API results are ready. Having ChangeProp would mitigate this.

Article of the day/picture of the day tends to be known in advance. So, once per day for the current day should be fine. Past results could be even longer.

The 'In the news' portion should probably a bit shorter than you had planned, somewhere around 5 minutes.

@Pchelolo fyi, the 'in the news' portion is scraped from the news pages in the respective wikis. Here's the source. Only a few wikis are currently supported but we would like to expand this list in the future.

Could we use the purge mechanism for the most-read portion? Looks like ChangeProp should be easy to set up for this pair (PageViews API -> most-read portion) since it's all in RB.
If that is set up then we could set the ttl for this portion to 1 day or even longer. That one is probably the most resource intensive portion of the aggregated feed.
The problem is that it's not always clear when exactly the day's PageViews API results are ready. Having ChangeProp would mitigate this.

Hehe @bearND how would ChangeProp know when to purge? We'd need an event from Analytics when backfill of cassandra for the current day is complete. I don't think it's feasible/practical to do. @Milimetric could you give us a little incite in how the page view API backfilling works and if it's possible to notify us somehow (post an event in EventBus, or just call some HTTP endpoint) when backfilling some portion of the data was completed?

@Pchelolo, yeah actually, shouldn't be too hard. The backfilling happens in an Oozie job [1]. So as another workflow step there, we could send an event. If you make the schema you need, we can think about it more? Also, maybe make a separate task for it.

[1] https://github.com/wikimedia/analytics-refinery/blob/master/oozie/cassandra and for a specific example here: https://github.com/wikimedia/analytics-refinery/blob/master/oozie/cassandra/daily/workflow.xml#L395