Page MenuHomePhabricator

Return array of last 5 days of page views for each most read article
Closed, ResolvedPublic2 Estimated Story Points

Description

New designs for most read include a graph of the last 5 days. (See the iOS widget for the design.)

Instead of requiring clients to make extra calls, we should return daily page views in an array for each result

Event Timeline

@Fjalapeno would you add a screenshot or link to a design document?

@Fjalapeno Seeing is believing. j/k. more remembering/visualizing the details in this case.
I do remember seeing the Sparklines design before. I just wasn't sure that 5 values would be enough. It looks like 5 values are sufficient.

This requires changing the content format. We currently have the single integer value views. How do you envision we transition to the array? What should it be called?

  • Add an array with the remaining 4 earlier days page view numbers?
  • Add an array for all 5 values? That would duplicate data.

@bearND Yeah I would leave "views" as is for compatibility. Add the new property like "views_by_day" with the 5 values (an array of dictionaries with date and view count).

Duplication would be minimal - and I don't think confusing. Size would be negligible and gzip compression would take care of that anyways.

@bearND Yeah I would leave "views" as is for compatibility. Add the new property like "views_by_day" with the 5 values (an array of dictionaries with date and view count).

Duplication would be minimal - and I don't think confusing. Size would be negligible and gzip compression would take care of that anyways.

I like this approach for the most part. I'd be inclined to actually skip sending the dates and just do something like an array of the current and last 4 days' pageview counts in reverse chronological order:

{
  "mostread": {
    "date": "2016-10-19Z",
    "articles": [
      [...]
      {
        "views": 298864,
        "views_last_five_days": [ 299326, 300858, 313782, 339489, 323108 ],
        "rank": 8,
        "title": "AMGTV",
        "pageid": 18746613, 
        [...]
      },  
      [...]
    ]
  }
}

(As you can see, the pageview counts seem not to match up 100% between different endpoints in the Pageview API, but I don't think that's a huge deal. They'll be close.)

https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents/AMGTV/daily/2016101500/2016101900

Looks like we're getting rid of the views properties (https://gerrit.wikimedia.org/r/#/c/320269). So that's less confusing then. How can we convey succinctly that the array of views in in reverse chronological order?
Why is the value for views different from the first value in views_last_five_days?

@bearND @Mholloway @Pchelolo @GWicke @mobrovac @Milimetric

Hey all. In summary, we want the top non-bot pageviews on a given day and for each of these entries the last 5 days of pageviews something like:

{
  "date": "2016-10-19Z",
  "articles": [
    {
      "title": "Dog",
      "viewsLastFiveDays": [ 299326, 300858, 313782, 339489, 323108 ],
      "rank": 8,
      ...
    }, 
    ...
  ]
}

It came up in our work planning meeting today that this task could be implemented multiple ways. Here are a few of the approaches I'm aware of:

  1. Request the top pages for a day X. Filter out bots. For each remaining entry in X, request five days of pageviews
  2. Request the top pages for a day X. Filter out bots. Request the top pages for the past five days. For results that appear both in X and one or more of the previous days, aggregate the pageview historical data for each result
  3. Update the Pageview API
  4. Something else.

I'm not sure what #4 is but it sounded like there was a RESTBase alternative. I'm currently pursuing #1 and hoping that caching makes this approach practical

Just a note on NOT including dates…

We have recently seen that the 2 APIs (top read and page view API) can sometimes be out of sync. Where one API may have newer data than the other. (Specifically Top Read has newer data than the page view API)

@JoeWalsh just actually worked around this issue in the iOS app.

For this reason, it may be worthwhile to include the dates as a hash rather than just an array. Because it may not be clear to the client which days of page view data it has. Even if the API is up to date, it is still a bit ambiguous because of the time delay for results rolling in and the fact that it is UTC.

@Fjalapeno I assume with top read you mean MCS most read endpoint. I wonder how most read can have newer content if it depends on the PageView API. How did @JoeWalsh work around the issue?

Niedzielski changed the task status from Open to Stalled.Dec 5 2016, 7:11 PM

@bearND no it seems to be the Analytics endpoint itself - not the MCS.

Joe worked around it by - and this is what is weird - getting the current day from the MCS top read result. Since the Pageview API was not updated yet - but it seems that top read was already finished.

I think you are making a (reasonable) assumption that the Most read depends on the results of the PageView API. But the way that analytics ingests the page view data and makes it available doesn't need to work that way and it appears that it doesn't.

It came up in our work planning meeting today that this task could be implemented multiple ways. Here are a few of the approaches I'm aware of:

  1. Request the top pages for a day X. Filter out bots. For each remaining entry in X, request five days of pageviews
  2. Request the top pages for a day X. Filter out bots. Request the top pages for the past five days. For results that appear both in X and one or more of the previous days, aggregate the pageview historical data for each result
  3. Update the Pageview API
  4. Something else.

I'm not sure what #4 is but it sounded like there was a RESTBase alternative. I'm currently pursuing #1 and hoping that caching makes this approach practical

  • #1 would work.
  • I think #2 is a non-starter since it would result in 0 views entries for articles that are not in the most-read results for all 5 days.
  • I'm not sure what #3 is.
  • The idea I had during the meeting (#4) was similarly flawed as #2. So, let's forget about that one.
  • Another idea would use the $merge functionality if there was a new endpoint for providing the page views of the last 5 days of a given page title and date. I don't think one exists at this time. I just wanted to mention it in case we want to add sparklines somewhere else (in the page view).

So, #1 seems like the way to go IMHO.

Well, I imagine #3 would be baking in the kind of logic you're doing in #1 for the top pages endpoint, like maybe with a ?last-five-days=true flag or something similar. That would be possible and maybe it would make sense if this is something you all need to rely on long-term. I can talk in a meeting if you want to brainstorm.

I am curious what MCS is and how it's updated more reliably than the Pageview API.

Well, I imagine #3 would be baking in the kind of logic you're doing in #1 for the top pages endpoint, like maybe with a ?last-five-days=true flag or something similar. That would be possible and maybe it would make sense if this is something you all need to rely on long-term. I can talk in a meeting if you want to brainstorm.

That sounds great. I think this would make this much easier for MCS.

I am curious what MCS is and how it's updated more reliably than the Pageview API.

MCS stand for Mobile Content Service. It's the node service that, amongst other things, hosts the implementation for the components of the aggregated feed endpoint. In this task we're talking about most-read.

The MCS most-read endpoint essentially just gets data from the Pageview API[1] and massages it for our use case, so it's impossible for it to be more up-to-date than the Pageview API is. So I'm confused about what could be going on with it appearing more up-to-date and what the workaround @Fjalapeno mentioned could be doing.

Is there a Phab task with discussion?

[1] https://github.com/wikimedia/mediawiki-services-mobileapps/blob/master/lib/feed/most-read.js#L34-L61

The MCS most-read endpoint essentially just gets data from the Pageview API[1] and massages it for our use case, so it's impossible for it to be more up-to-date than the Pageview API is. So I'm confused about what could be going on with it appearing more up-to-date and what the workaround @Fjalapeno mentioned could be doing.

@Mholloway correct me if I am wrong… the MCS gets its most read view data from:
https://wikimedia.org/api/rest_v1/metrics/pageviews/top/en.wikipedia.org/all-access/2016/12/05

Which is not the same as:
https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia.org/all-access/all-agents/Dog/daily/20161205/20161205

What appears to be happening is that the https://wikimedia.org/api/rest_v1/metrics/pageviews/top/ is updated before https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/

And so we used the views from the "top" API to supplement the missing data from the "per-article" API (which to be clear is available "eventually" just not as quickly)

@Milimetric are you able to explain why we could be seeing this behavior?

Ah, I see. Yeah, we don't use the per-article pageviews endpoint at all in MCS yet, although I think @Niedzielski will be using it for the patch for this ticket. @Milimetric or one of the Services engineers would probably have a better idea on what's going on behind the scenes that could cause a discrepancy between those two.

@Mholloway correct me if I am wrong… the MCS gets its most read view data from:
https://wikimedia.org/api/rest_v1/metrics/pageviews/top/en.wikipedia.org/all-access/2016/12/05

Which is not the same as:
https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia.org/all-access/all-agents/Dog/daily/20161205/20161205

What appears to be happening is that the https://wikimedia.org/api/rest_v1/metrics/pageviews/top/ is updated before https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/

And so we used the views from the "top" API to supplement the missing data from the "per-article" API (which to be clear is available "eventually" just not as quickly)

@Milimetric are you able to explain why we could be seeing this behavior?

Yes, but it's a fundamental limitation. This is the oozie bundle of jobs that populates data for those endpoints [1]. These jobs run in parallel and the per-article one takes longer to execute due to the amount of data involved. If "n" is the number of articles, per-article needs to push O(n) data to Cassandra while top needs to push O(1). There's no real short-cut to that. Even if we were computing everything in real-time via stream processors, we still have a lot of data to copy over the network and write to disk and that takes a lot of time. So the overall time it takes us to update these endpoints might change but the top one will always update quicker than the per-article one while we are serving from disks.

This makes the need for a "views for the last 5 days" parameter or something similar even more obvious. Pageviews are not our focus for this next quarter, we're very much deep in work on editing data, but it's worthwhile to capture this requirement and start prioritizing it now. Please set up a meeting.

[1] https://github.com/wikimedia/analytics-refinery/blob/master/oozie/cassandra/bundle.xml

Change 330836 had a related patch set uploaded (by Niedzielski):
New: add last 5 days of pageviews to most-read response

https://gerrit.wikimedia.org/r/330836

Change 330836 merged by jenkins-bot:
New: add last 5 days of pageviews to most-read response

https://gerrit.wikimedia.org/r/330836