Page MenuHomePhabricator

Analyze how wrong using out of date page view data would be
Closed, ResolvedPublic

Description

We want to integrate page view information into the scoring algorithms we use for both the completion suggestions and our regular search results.

Our initial idea is we only update this page view information when doing normal document updates after a page edit. We need to analyze if this page view data will be "good enough" or if we need to do something more. Maybe any page not edited in the last 30 days needs its page view data updated, maybe not.

I'm guessing we could perhaps look at the distribution of how often pages are edited and compare that against how much page view data tends to change over that time. The page view information is all available in hive from stat1002.

Event Timeline

EBernhardson raised the priority of this task from to Needs Triage.
EBernhardson updated the task description. (Show Details)
EBernhardson subscribed.

@dcausse I suppose we also need to decide what exactly we mean by 'page view data'. I suppose i was thinking we would store a value like average page views / day over the previous week?

Yes, I think that an average is good, I don't know about the window (1 week/1 month/more?), if the window is too small maybe we won't have many pages with this data?
If we go with the option to set pageviews on doc update we should take a large window (6month maybe?) and like you said if the page has not been edited in the last 30 days we should refresh this data.

But I agree with you we should learn a bit more about how often pages are edited (I can extract a "small" csv file with pageTitle;timestamp for enwiki if that helps).

6 months might be out of our reach, but i'd have to check with analytics to be sure. Hive currently has partitions for exactly 2 months worth of webrequest data currently (july 16 -> sept 16).

Deskana subscribed.

Moving to "In progress" assuming that work on this is ongoing. However, work has been going on on this for a few weeks now. Is this still needed? Does the task need further definition? We should assess this.

@Deskana: work is imminently ongoing. @EBernhardson asked me to take this, but said it was lower priority, so it I worked on finishing up other tasks before the offsite. Didn't work on it during the offsite, and this week has been very unproductive. I was planning on making it top priority after finishing up on T114673, which which @EBernhardson also asked me to look at, and which is also dragging.

I realized this may not be the best described task, would it be too much ask to generalize this into something more like: analyze/decide the best way to aggregate hourly page view data into a 'page view score'?

@EBernhardson, I think figuring out how to use the page view data is a different task, which I or other people could work on.

The discussion includes using daily averages over the last week, so daily is the granularity I'm working with. Right now, I've fetched 30 days worth of daily page view data, identified daily outliers, and fetched edit data for those pages; I still need to correlate page view spikes with edits, since that seems to be the main method of updating page views, based on the discussion here. If we want to get more granular, at the hourly level, I can do that, too, esp. for a subset of "interesting" pages.

In terms of how to use page views, I think @dcausse has ideas on that, and @Ironholds has at least thought about finding a proxy measure (since page views may be hard to work with).

Yeah, I was going to dig into edits as a proxy. I suspect it wouldn't be a great proxy, but that was my hackathon project (which got pushed aside because I actually spent the hackathon, you know, getting to my plane ;p)

Status update: this is still lower priority than general relevance (interviewing for a Relevance Engineer and Relevance Lab T114673), but I'll get to it.

Note to self: After talking to Erik and David, I'm going to:

  • generate a plot of median vs max for the month, to get a sense of how many outliers there are, and how big the spikes are (and report #/% of spikes)
  • generate a plot of week to week change in daily average over a week, with a one-week gap, two-week gap, and three-week gap, to get a sense of how fast things change week to week (and report #/% of significant changes by time span)
  • calculate the #/% of outlier spikes that do not have an edit the same day

Bumping this out of the sprint given the fairly large number of cards in our sprint's backlog column. Feel free to move this back into the sprint and pick this up again if we have time.

I want to note here that this task does seem super important for the longer term. A few external partners (e.g. Internet Archive) commented on how using page views to influence rankings in search results is a good approach. But we just can't prioritise it right now given everything else that's going on.

I will write a proposed Q3 goal to prioritise using page view data (and other data) to influence result ranking, and we can discuss that.

Write up complete: https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/How_Wrong_Would_Using_Out_of_Date_Page_View_Data_Be%3F

Summary:

  • We can't reliably catch day-by-day outliers by using the page view information that comes along with edits because not enough edits happen.
  • Weekly averages (rather than day-by-day counts) don't usually move that much (i.e., by more than a factor of 2). If we can capture daily or weekly page view stats, that should keep us reasonably up-to-date overall, esp. if these moderate swings don't affect scoring much.
  • We could gather daily statistics from the page view API and store the high mark over the last 3-7 for the top 1K to 50K most-viewed articles. The ranking algorithm could use either the rolling daily average or the high mark (which ever is higher).
  • For "Trending" topics, looking at the top 1K page views every hour (unfortunately not currently available through the PageviewAPI) would be the best way to catch suddenly trending topics if we want to be more responsive, but it isn't clear that it's worth it.

@TJones Thank you, that analysis looks excellent. As a final step before calling this task "done", can you create any follow-up tasks that are required as a result of your analysis?

@Deskana Thanks! I'm not actually sure what the next steps are. As I noted above, figuring out what to do with this information is a different task. It's been so long since we first looked into characterizing the page view data that I'm not sure what the next steps are. @dcausse has been looking at how to use page view data in scoring, so maybe he has an idea what to do next.

I see now (and am reminded) that there was discussion above about using hourly data. I think it's clear that obvious pipelines (page views that come with edits and the PageviewAPI) aren't currently up for delivering that.

@TJones Understood. Since this just modifies our thinking about the future, and our knowledge about whether we can use hours/daily/weekly/monthly page views for things, I think we can consider this resolved by your write-up.