Page MenuHomePhabricator

Create a method for 'Avg. daily views to pages that have uploaded files'
Closed, ResolvedPublic5 Story Points

Description

Implementation details

Create a way to fetch the number of views to articles that embed files that were uploaded during the event, and display them as 'average daily views'.

The query should:

  • Fetch all files that were uploaded to Commons and to the local wikis specified in the event setting (the system currently only considers uploads to commons; this new method needs to add the ability to fetch the list of files uploaded to the local wikis too)
  • For each of those uploaded files, get the articles they are used/embedded in. Article list should be unique (even if two files were uploaded to the same article, the article should be counted once)
  • For each unique article, fetch daily pageviews from the present back to a maximum of 30 days and total that number. Then, to get the per-article average:
    • If we have 30 days of pageviews, divide the total by 30.
    • If we have fewer than 30 days of pageviews (presumably because the article was created fewer than 30 days ago), divide the total by the number of days for which we have figures.
  • Next total the individual per-page averages to get the grand-total of "Avg. daily views to (all) files uploaded."

Deeper dive

Why we're doing this

Organizers, their sponsors and partners want to understand the impact of their work. One main way to do this for files uploaded is to see the number of pageviews those files get on the various article pages to which they are added. In our discussions, it has become clear that we can't get an accurate cumulative pageviews figure because we don't know the dates when specific files were added to specific articles. So instead, we will be providing a figure for "average daily pageviews".

Parameters

  • All filetypes: The figure will track images, video files, audio files and other upload types.
  • Uploads to Commons and local Wikipedias: We will track uploads to all wikis, so long as they are specified as wikis of interest for the event.
  • Pageviews on all wikis (not just those specified): The Main space articles whose pageviews we're counting can be on any wiki; the wikis do not need to be specified as wikis of interest in setup.
  • Reports this metric appears in: "Avg. daily views to files uploaded " appears in the Event Summary reports (CSV T205561, wikitext T206692 and onscreen T216447).
  • 30-day average We'r using a 30-day average to smooth out daily or weekly fluctuations.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Nuria added a comment.Nov 9 2018, 6:45 PM

@MusikAnimal
All data is available on hive table mediacounts, you can hit as many files as neeeded with a hive sql query.See:

https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Mediacounts#Analytics_cluster

@MusikAnimal, echoing Nuria here, definitely ping me on IRC before you do anything as manual as that :) This is easy to query in Hive, and I'm here to help demystify hadoop, just ping me.

jmatazzoni renamed this task from Put 'Avg. daily views to files uploaded' metric into 'Event Summary' reports to Createa method for putting 'Avg. daily views to files uploaded' metric into 'Event Summary' reports .Nov 28 2018, 7:53 PM
jmatazzoni renamed this task from Createa method for putting 'Avg. daily views to files uploaded' metric into 'Event Summary' reports to Create a method for putting 'Avg. daily views to files uploaded' metric into 'Event Summary' reports .
Mooeypoo updated the task description. (Show Details)Jan 11 2019, 10:45 PM
Mooeypoo updated the task description. (Show Details)Jan 11 2019, 10:49 PM
MusikAnimal renamed this task from Create a method for putting 'Avg. daily views to files uploaded' metric into 'Event Summary' reports to Create a method for putting 'Avg. daily views to pages that have files uploaded' metric into 'Event Summary' reports .Jan 16 2019, 12:16 AM
Nuria added a comment.Jan 16 2019, 4:01 PM

Before advertising this metric it will be good to quantify how good this approximation is to the "real" number, that can be easily done with comparing the results of this calculation with the data on the mediacounts table on hive.

jmatazzoni updated the task description. (Show Details)
jmatazzoni renamed this task from Create a method for putting 'Avg. daily views to pages that have files uploaded' metric into 'Event Summary' reports to Create a method for 'Avg. daily views to pages that have uploaded files' .Feb 6 2019, 11:48 PM
jmatazzoni updated the task description. (Show Details)
jmatazzoni updated the task description. (Show Details)
MusikAnimal moved this task from Ready to In Development on the Community-Tech-Sprint board.

@dom_walden, if you approve this one, we can go ahead and check off on T206692 that 'Avg. daily views to files uploaded' is complete. (I see the figure in the report, but will wait for you to say it is accurate.)

From the description:

For articles that are "younger" than 30 days (were created less than 30 days ago) -- divide the monthly number by the number of days the article exists.

As we discussed, this poses a bit of a challenge. Instead we're getting the average over the past 30 days or the start of the event, whichever is shorter. When the articles were created during that time is not taken into account. Statistically I think this makes sense... as we're looking at the "big picture". Other popular tools that get pageviews of multiple articles, such as https://tools.wmflabs.org/massviews, work in the same way. Note however https://tools.wmflabs.org/pageviews is smart enough to only count since the article was created, so QA may have to do a little math. Hope this makes sense.

This comment was removed by jmatazzoni.

@MusikAnimal, I tried to rewrite the spec for creating average daily views as we discussed in the standup today, but I don't think I have it right. I've put the old and new versions below, so you can clearly see the changes I made.

[old version]

  • For each of those articles, fetch the monthly page views.
    • For articles that are older than 30 days, divide the monthly page views by 30.
    • For articles that are "younger" than 30 days (were created less than 30 days ago) -- divide the monthly number by the number of days the article exists.
    • If no days are available (i.e., if it's the first day), then display "n/a" for "not available".

[Here's the rewrite]

  • For each of those articles, fetch the daily page views:
    • If the event was fewer than 30 days ago, begin from the start-date of the event and count forward to the last update; use as many days as are available, total up the daily views and divide by the number of available days to get the average.
    • If the event was more than 30 days ago, use pageviews from the MOST RECENT 30 days only and divide by 30 to get the average.
    • If no days are available (i.e., if it's the first day), then display "n/a" for "not available".

But this can't be right, because it completely doesn't account for the case where the EVENT was more than 30 days ago but an individual ARTICLE was created yesterday. (The previous spec was about the age of the article, not the event.) Can you please take a stab at describing your scheme?

The previous spec was about the age of the article, not the event.

It's not so much the age of the event, since we always grab the past 30 days of data, or since the start of the event, whatever is shortest.

The question is do we want pageviews for the event as a whole, or do we need to respect the age of each individual article? Going by the age of the articles means we need to fire off a bunch of extra queries, which will slow things down. We should give thought to cost/benefit.

Statistically, it makes sense to me to go by the event as a whole. Here's an example; say the event was started 3 days ago (- means the article didn't exist at that time):

ArticleDay 1Day 2Day 3
Foo--3
Bar300
Baz333
Total636

(6 + 3 + 6) / 3 = average of 5 pageviews a day for the event as a whole. This is opposed to averaging the averages, which I guess is how you would do it if you went by article age: "Foo" has a daily average of 3, "Bar" has 1, and "Baz" has 3. (3 + 1 + 3) / 3 = 2.3 pageviews a day.

They are both correct, it's just a matter of which metric you want. I think the "event as a whole" is an easier concept to wrap your head around, and it is more efficient technically since we don't need to query for the article age.

MusikAnimal added a comment.EditedFeb 28 2019, 7:21 PM

Going by the age of the articles means we need to fire off a bunch of extra queries, which will slow things down. We should give thought to cost/benefit.

Hmm, actually the pageviews API conveniently gives no data for days that the article didn't exist (as opposed to 0), so maybe we don't need to query for the article age. In this example, the article was created August 29. I'm asking for pageviews from the 25th through September 1, and there are no entries for the 25th-28th:

https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia.org/all-access/user/Hanksy/daily/20150825/20150901

So I guess you can disregard the performance concern. I think the "event as a whole" metric, as I explained above, is still easier to understand, but it can skew the data if you want per-article granularity. Taking your example, if the event started > 30 days ago, and the first article only popped up yesterday with 30 pageviews, you get an average of 30 pageviews/day, as opposed to 1 if you went by the impact of the "event as a whole". Again they're both correct, depending on how you look at it.

If no days are available (i.e., if it's the first day), then display "n/a" for "not available".

I don't think we did this. I will fix this now.

@jmatazzoni Let me know which way you want to go with T206700#4990282. I'm assuming you'd prefer going by article age, as originally planned. It will take a little bit to implement this but it can be done.

This comment was removed by jmatazzoni.
MusikAnimal added a comment.EditedFeb 28 2019, 10:33 PM

Bear with me, I'm not a statistics whiz either. This stuff makes my head hurt!

But we're looking for a daily average. So even if you're not calculating per-page averages, you still need to calculate an average for each day, based on how many articles there were that day, right? That, I think, would be something more like this (pls excuse if my notation is wrong—not my area!):

(((3 + 3) / 2 articles) + ((0 + 3) / 2 articles) + ((3 + 3) / 3 articles) / 3 days) = 2.2

I think this would be the average per-article for each day, summed. Average pageviews per day would be total pageviews / number of days, no?

I think my formula that respects the article age was wrong though... the expectation is likely the sum of the averages (not average of averages):

ArticleDay 1Day 2Day 3Average
Foo--33
Bar3001
Baz3333
Total6367

And the "event as a whole" (again, using the sum of the averages for each article, except this time using the total number of days as the denominator):

ArticleDay 1Day 2Day 3Average
Foo--31
Bar3001
Baz3333
Total6365

In my opinion the latter still makes more sense. It's a simple division of the total pageviews and the number of days.

Pageviews Analysis is by no means an authority, but this is the behaviour it uses: https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&start=2015-08-25&end=2015-09-01&pages=Hanksy|The_Pizza_Underground Note that "Hanksy" was created on August 29 (only 4 days within the time period), has a total 199 pageviews / 8 days = 25. Then "The Pizza Underground" (which existed for all days in the period) has a daily average of 288. Add that to "Hanksy"'s average and you get 313 overall.

jmatazzoni added a comment.EditedFeb 28 2019, 10:51 PM

In T206700#4992105, @MusikAnimal wrote:

The previous spec was about the age of the article, not the event.

It's not so much the age of the event, since we always grab the past 30 days of data, or since the start of the event, whatever is shortest.

The question is do we want pageviews for the event as a whole, or do we need to respect the age of each individual article? Going by the age of the articles means we need to fire off a bunch of extra queries, which will slow things down. We should give thought to cost/benefit.

Statistically, it makes sense to me to go by the event as a whole. Here's an example; say the event was started 3 days ago (- means the article didn't exist at that time):

ArticleDay 1Day 2Day 3
Foo--3
Bar300
Baz333
Total636

(6 + 3 + 6) / 3 = average of 5 pageviews a day for the event as a whole. This is opposed to averaging the averages, which I guess is how you would do it if you went by article age: "Foo" has a daily average of 3, "Bar" has 1, and "Baz" has 3. (3 + 1 + 3) / 3 = 2.3 pageviews a day.

They are both correct, it's just a matter of which metric you want. I think the "event as a whole" is an easier concept to wrap your head around, and it is more efficient technically since we don't need to query for the article age.

I can't believe this sentence is about to come from me, but I think there is an error in the way you're calculating the per-page method—which accounts for the very large difference between what should both be valid methods. Your formula went something like this:

((3/1) + (3/3) + (9/3))/3 = 2.3

But that is one step too many; what you have there is the pageviews for the average article. What we want is per day, which would just be the total of each article's per-day average:

(3/1) + (3/3) + (9/3) = 7

The alternative is what you proposed, which is to average each day's per-day total:

((3+3) + (0+3) + (3+0+3))/3 = 5

Did I get that right? If so, the difference is 7 verus 5, and then yes, both seem like valid methods.

jmatazzoni added a comment.EditedFeb 28 2019, 11:06 PM

Unfortunately you saw my erroneous post before I could delete it. Sorry for the confusion! In the end, it looks like we both figured it out and came to the same place. Whew--that was fun. Now the question is which method should we use?

In T206700#4992119, @MusikAnimal wrote:

...So I guess you can disregard the performance concern. I think the "event as a whole" metric, as I explained above, is still easier to understand, but it can skew the data if you want per-article granularity.

IF performance and level of effort are the same for both methods, then I think the per-article method might still be better. The reason is that we eventually want the Files Uploaded report (T212547), which includes a per-article Avg. Daily Pageviews figure. That report is currently off our list, but I hope to get back to it some day, and it seems like doing the method on a per-article basis would set that up.

So, as I say, that's if performance AND level of effort are the same. Are they? If the per-article way is harder, we can let the future take care of itself I suppose...

Pageviews on all wikis (not just those specified): The Main space articles whose pageviews we're counting can be on any wiki; the wikis do not need to be specified as wikis of interest in setup.

@MusikAnimal I haven't been able to follow all the discussion that has gone on in this task, but do we still want the above to be true?

It appears that for images uploaded to commons we also need to specify the wikis where the uploaded images are linked (or all wikis), in order to get a figure for 'Avg. daily views to pages that have uploaded files'.

Compare event with only commons (and wikidata): https://eventmetrics-dev.wmflabs.org/programs/108/events/223
with identical event for all wikis: https://eventmetrics-dev.wmflabs.org/programs/108/events/287

In T206700#4993697, @dom_walden wrote:

Pageviews on all wikis (not just those specified): The Main space articles whose pageviews we're counting can be on any wiki; the wikis do not need to be specified as wikis of interest in setup.

do we still want the above to be true?

That is required. The whole idea of uploading files to Commons is that those images, etc. are then available to all wikis. So we want to find all the pages where they are being placed.

Pageviews on all wikis (not just those specified): The Main space articles whose pageviews we're counting can be on any wiki; the wikis do not need to be specified as wikis of interest in setup.

This was not implemented, until now :) This fix is up for code review: https://github.com/wikimedia/eventmetrics/pull/206

In relation to T206700#4992838, I'm going to investigate using the per-article strategy we talked about (which was in the original spec). This will be part of a separate PR.

@MusikAnimal and @dom_walden, now that we've selected the per-article approach and talked it through so that we have a clear idea of how it should work, I've rewritten the Description again to reflect this new understanding. Here's what I wrote; please have a look:

  • For each unique article, fetch daily pageviews from the present back to a maximum of 30 days and total that number. Then, to get the per-article average:
    • If we have 30 days of pageviews, divide the total by 30.
    • If we have fewer than 30 days of pageviews (presumably because the article was created fewer than 30 days ago), divide the total by the number of days for which we have figures.
  • Next total the individual per-page averages to get the grand-total of "Avg. daily views to (all) files uploaded."

Does that sound right?

@jmatazzoni Sounds perfect. I've got a fix in with b6b5516, but #206 should be reviewed first.

All merged! Ready for QA :)

Constructed a number of events where the pages improved were the same as the pages which had uploaded files. Checked that 'Avg. daily views to pages that have uploaded files' = 'Avg. daily views to pages improved'. (e.g. https://eventmetrics-dev.wmflabs.org/programs/126/events/286)

Then looked at events with files uploaded to commons and saw they report the same page views as the pageviews tool, regardless of whether the specific wiki was included in the event. (e.g. https://eventmetrics-dev.wmflabs.org/programs/126/events/290)

I also took a UTC event and changed the date to the equivalent time in the "Asia/Jakarta" timezone; got the same results. This was to check that we are doing timezone normalisation. (https://eventmetrics-dev.wmflabs.org/programs/126/events/291)

...I've got a fix in with b6b5516, but #206 should be reviewed first.

I could see no problems. Ready for the next commit, if that is the intention.

...I've got a fix in with b6b5516, but #206 should be reviewed first.

I could see no problems. Ready for the next commit, if that is the intention.

Sorry, I was referring to code review. Everything for this ticket is merged and ready for QA :)

...I've got a fix in with b6b5516, but #206 should be reviewed first.

I could see no problems. Ready for the next commit, if that is the intention.

Sorry, I was referring to code review. Everything for this ticket is merged and ready for QA :)

Ah, OK. In that case, I will put this in Product Sign-Off.

jmatazzoni closed this task as Resolved.Mar 4 2019, 4:46 PM

In T206700#4998284, @dom_walden wrote:

Constructed a number of events where the pages improved were the same as the pages which had uploaded files. Checked that 'Avg. daily views to pages that have uploaded files' = 'Avg. daily views to pages improved'. (e.g. https://eventmetrics-dev.wmflabs.org/programs/126/events/286)

Ooooh. That's genius Dom. Such a satisfying proof! I put a screenshot in just because I like it so much.