New designs for most read include a graph of the last 5 days. (See the iOS widget for the design.)
Instead of requiring clients to make extra calls, we should return daily page views in an array for each result
New designs for most read include a graph of the last 5 days. (See the iOS widget for the design.)
Instead of requiring clients to make extra calls, we should return daily page views in an array for each result
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
New: add last 5 days of pageviews to most-read response | mediawiki/services/mobileapps | master | +235 -35 |
@Fjalapeno Seeing is believing. j/k. more remembering/visualizing the details in this case.
I do remember seeing the Sparklines design before. I just wasn't sure that 5 values would be enough. It looks like 5 values are sufficient.
This requires changing the content format. We currently have the single integer value views. How do you envision we transition to the array? What should it be called?
@bearND Yeah I would leave "views" as is for compatibility. Add the new property like "views_by_day" with the 5 values (an array of dictionaries with date and view count).
Duplication would be minimal - and I don't think confusing. Size would be negligible and gzip compression would take care of that anyways.
I like this approach for the most part. I'd be inclined to actually skip sending the dates and just do something like an array of the current and last 4 days' pageview counts in reverse chronological order:
{ "mostread": { "date": "2016-10-19Z", "articles": [ [...] { "views": 298864, "views_last_five_days": [ 299326, 300858, 313782, 339489, 323108 ], "rank": 8, "title": "AMGTV", "pageid": 18746613, [...] }, [...] ] } }
(As you can see, the pageview counts seem not to match up 100% between different endpoints in the Pageview API, but I don't think that's a huge deal. They'll be close.)
Looks like we're getting rid of the views properties (https://gerrit.wikimedia.org/r/#/c/320269). So that's less confusing then. How can we convey succinctly that the array of views in in reverse chronological order?
Why is the value for views different from the first value in views_last_five_days?
@bearND @Mholloway @Pchelolo @GWicke @mobrovac @Milimetric
Hey all. In summary, we want the top non-bot pageviews on a given day and for each of these entries the last 5 days of pageviews something like:
{ "date": "2016-10-19Z", "articles": [ { "title": "Dog", "viewsLastFiveDays": [ 299326, 300858, 313782, 339489, 323108 ], "rank": 8, ... }, ... ] }
It came up in our work planning meeting today that this task could be implemented multiple ways. Here are a few of the approaches I'm aware of:
I'm not sure what #4 is but it sounded like there was a RESTBase alternative. I'm currently pursuing #1 and hoping that caching makes this approach practical
Just a note on NOT including dates…
We have recently seen that the 2 APIs (top read and page view API) can sometimes be out of sync. Where one API may have newer data than the other. (Specifically Top Read has newer data than the page view API)
@JoeWalsh just actually worked around this issue in the iOS app.
For this reason, it may be worthwhile to include the dates as a hash rather than just an array. Because it may not be clear to the client which days of page view data it has. Even if the API is up to date, it is still a bit ambiguous because of the time delay for results rolling in and the fact that it is UTC.
@Fjalapeno I assume with top read you mean MCS most read endpoint. I wonder how most read can have newer content if it depends on the PageView API. How did @JoeWalsh work around the issue?
@bearND no it seems to be the Analytics endpoint itself - not the MCS.
Joe worked around it by - and this is what is weird - getting the current day from the MCS top read result. Since the Pageview API was not updated yet - but it seems that top read was already finished.
I think you are making a (reasonable) assumption that the Most read depends on the results of the PageView API. But the way that analytics ingests the page view data and makes it available doesn't need to work that way and it appears that it doesn't.
It came up in our work planning meeting today that this task could be implemented multiple ways. Here are a few of the approaches I'm aware of:
- Request the top pages for a day X. Filter out bots. For each remaining entry in X, request five days of pageviews
- Request the top pages for a day X. Filter out bots. Request the top pages for the past five days. For results that appear both in X and one or more of the previous days, aggregate the pageview historical data for each result
- Update the Pageview API
- Something else.
I'm not sure what #4 is but it sounded like there was a RESTBase alternative. I'm currently pursuing #1 and hoping that caching makes this approach practical
So, #1 seems like the way to go IMHO.
Well, I imagine #3 would be baking in the kind of logic you're doing in #1 for the top pages endpoint, like maybe with a ?last-five-days=true flag or something similar. That would be possible and maybe it would make sense if this is something you all need to rely on long-term. I can talk in a meeting if you want to brainstorm.
I am curious what MCS is and how it's updated more reliably than the Pageview API.
That sounds great. I think this would make this much easier for MCS.
I am curious what MCS is and how it's updated more reliably than the Pageview API.
MCS stand for Mobile Content Service. It's the node service that, amongst other things, hosts the implementation for the components of the aggregated feed endpoint. In this task we're talking about most-read.
The MCS most-read endpoint essentially just gets data from the Pageview API[1] and massages it for our use case, so it's impossible for it to be more up-to-date than the Pageview API is. So I'm confused about what could be going on with it appearing more up-to-date and what the workaround @Fjalapeno mentioned could be doing.
Is there a Phab task with discussion?
[1] https://github.com/wikimedia/mediawiki-services-mobileapps/blob/master/lib/feed/most-read.js#L34-L61
@Mholloway correct me if I am wrong… the MCS gets its most read view data from:
https://wikimedia.org/api/rest_v1/metrics/pageviews/top/en.wikipedia.org/all-access/2016/12/05
Which is not the same as:
https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia.org/all-access/all-agents/Dog/daily/20161205/20161205
What appears to be happening is that the https://wikimedia.org/api/rest_v1/metrics/pageviews/top/ is updated before https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/
And so we used the views from the "top" API to supplement the missing data from the "per-article" API (which to be clear is available "eventually" just not as quickly)
@Milimetric are you able to explain why we could be seeing this behavior?
Ah, I see. Yeah, we don't use the per-article pageviews endpoint at all in MCS yet, although I think @Niedzielski will be using it for the patch for this ticket. @Milimetric or one of the Services engineers would probably have a better idea on what's going on behind the scenes that could cause a discrepancy between those two.
@Mholloway correct me if I am wrong… the MCS gets its most read view data from:
https://wikimedia.org/api/rest_v1/metrics/pageviews/top/en.wikipedia.org/all-access/2016/12/05Which is not the same as:
https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia.org/all-access/all-agents/Dog/daily/20161205/20161205What appears to be happening is that the https://wikimedia.org/api/rest_v1/metrics/pageviews/top/ is updated before https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/
And so we used the views from the "top" API to supplement the missing data from the "per-article" API (which to be clear is available "eventually" just not as quickly)
@Milimetric are you able to explain why we could be seeing this behavior?
Yes, but it's a fundamental limitation. This is the oozie bundle of jobs that populates data for those endpoints [1]. These jobs run in parallel and the per-article one takes longer to execute due to the amount of data involved. If "n" is the number of articles, per-article needs to push O(n) data to Cassandra while top needs to push O(1). There's no real short-cut to that. Even if we were computing everything in real-time via stream processors, we still have a lot of data to copy over the network and write to disk and that takes a lot of time. So the overall time it takes us to update these endpoints might change but the top one will always update quicker than the per-article one while we are serving from disks.
This makes the need for a "views for the last 5 days" parameter or something similar even more obvious. Pageviews are not our focus for this next quarter, we're very much deep in work on editing data, but it's worthwhile to capture this requirement and start prioritizing it now. Please set up a meeting.
[1] https://github.com/wikimedia/analytics-refinery/blob/master/oozie/cassandra/bundle.xml
Change 330836 had a related patch set uploaded (by Niedzielski):
New: add last 5 days of pageviews to most-read response
Change 330836 merged by jenkins-bot:
New: add last 5 days of pageviews to most-read response