Maniphest T206700

Create a method for 'Avg. daily views to pages that have uploaded files'
Closed, ResolvedPublic5 Estimated Story Points
Actions

Assigned To

Authored By

	• jmatazzoni
	Oct 10 2018, 8:57 PM

Description

Implementation details

Create a way to fetch the number of views to articles that embed files that were uploaded during the event, and display them as 'average daily views'.

The query should:

Fetch all files that were uploaded to Commons and to the local wikis specified in the event setting (the system currently only considers uploads to commons; this new method needs to add the ability to fetch the list of files uploaded to the local wikis too)
- There is a task to track uploads from the individual wiki (expanding the collection from Commons only) at task T206819: Create a method to fetch information about uploaded files in local wikis. That task is slightly broader (to enable using all metrics about uploads to show local wiki as well) but depending on your implementation approach, may need to be done before this current task.
- There's another related task at T215356: Create a method for counting 'Pages with uploaded files'
For each of those uploaded files, get the articles they are used/embedded in. Article list should be unique (even if two files were uploaded to the same article, the article should be counted once)
For each unique article, fetch daily pageviews from the present back to a maximum of 30 days and total that number. Then, to get the per-article average:
- If we have 30 days of pageviews, divide the total by 30.
- If we have fewer than 30 days of pageviews (presumably because the article was created fewer than 30 days ago), divide the total by the number of days for which we have figures.
Next total the individual per-page averages to get the grand-total of "Avg. daily views to (all) files uploaded."

Deeper dive

Why we're doing this

Organizers, their sponsors and partners want to understand the impact of their work. One main way to do this for files uploaded is to see the number of pageviews those files get on the various article pages to which they are added. In our discussions, it has become clear that we can't get an accurate cumulative pageviews figure because we don't know the dates when specific files were added to specific articles. So instead, we will be providing a figure for "average daily pageviews".

Parameters

All filetypes: The figure will track images, video files, audio files and other upload types.
Uploads to Commons and local Wikipedias: We will track uploads to all wikis, so long as they are specified as wikis of interest for the event.
Pageviews on all wikis (not just those specified): The Main space articles whose pageviews we're counting can be on any wiki; the wikis do not need to be specified as wikis of interest in setup.
Reports this metric appears in: "Avg. daily views to files uploaded " appears in the Event Summary reports (CSV T205561, wikitext T206692 and onscreen T216447).
30-day average We'r using a 30-day average to smooth out daily or weekly fluctuations.

Related Objects
Search...

Status	Assigned	Task
Resolved	MaxSem	T205561 Add 'Event Summary' data to downloadable csv report
Resolved	MusikAnimal	T206692 Implement ‘Event Summary’ downloadable Wikitext report
Resolved	MusikAnimal	T206700 Create a method for 'Avg. daily views to pages that have uploaded files'
Resolved	Samwilson	T206817 Create a method to fetch page view data
Resolved	MusikAnimal	T207989 Store page IDs to speed up queries and reports
Resolved	MaxSem	T215356 Create a method for counting 'Pages with uploaded files'
Resolved	MusikAnimal	T206819 Create a method to fetch information about uploaded files in local wikis
Invalid	None	T209733 Make sure Category filter can track metrics for local file uploads

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@MusikAnimal
All data is available on hive table mediacounts, you can hit as many files as neeeded with a hive sql query.See:

https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Mediacounts#Analytics_cluster

@MusikAnimal, echoing Nuria here, definitely ping me on IRC before you do anything as manual as that :) This is easy to query in Hive, and I'm here to help demystify hadoop, just ping me.

MusikAnimal mentioned this in T210313: Statistics for views of individual Wikimedia images.Nov 24 2018, 6:35 PM

Krinkle awarded a token.Nov 24 2018, 8:34 PM

• jmatazzoni renamed this task from Put 'Avg. daily views to files uploaded' metric into 'Event Summary' reports to Createa method for putting 'Avg. daily views to files uploaded' metric into 'Event Summary' reports .Nov 28 2018, 7:53 PM

• jmatazzoni renamed this task from Createa method for putting 'Avg. daily views to files uploaded' metric into 'Event Summary' reports to Create a method for putting 'Avg. daily views to files uploaded' metric into 'Event Summary' reports .

• jmatazzoni added a project: Community-Tech-Sprint.Nov 28 2018, 9:49 PM

• jmatazzoni moved this task from Up Next (May 6-17) to In Sprint 🏃‍♀️🏃‍♂️ on the Community-Tech board.

• jmatazzoni mentioned this in T206817: Create a method to fetch page view data.Jan 2 2019, 11:26 PM

• jmatazzoni moved this task from Holding to In Sprint on the Event Metrics board.Jan 8 2019, 1:03 AM

Mooeypoo updated the task description. (Show Details)Jan 11 2019, 10:45 PM

Mooeypoo updated the task description. (Show Details)Jan 11 2019, 10:49 PM

MusikAnimal renamed this task from Create a method for putting 'Avg. daily views to files uploaded' metric into 'Event Summary' reports to Create a method for putting 'Avg. daily views to pages that have files uploaded' metric into 'Event Summary' reports .Jan 16 2019, 12:16 AM

Mooeypoo added a parent task: T206819: Create a method to fetch information about uploaded files in local wikis.Jan 16 2019, 12:29 AM

Before advertising this metric it will be good to quantify how good this approximation is to the "real" number, that can be easily done with comparing the results of this calculation with the data on the mediacounts table on hive.

• jmatazzoni mentioned this in T206819: Create a method to fetch information about uploaded files in local wikis.Jan 16 2019, 4:27 PM

• jmatazzoni removed a parent task: T206819: Create a method to fetch information about uploaded files in local wikis.

• jmatazzoni added a subtask: T206819: Create a method to fetch information about uploaded files in local wikis.

• jmatazzoni mentioned this in T212547: Add 'Files Uploaded' data to downloadable csv .Jan 29 2019, 1:39 AM

• jmatazzoni mentioned this in T214942: Add 'Files Uploaded' data to downloadable Wikitext report.Jan 29 2019, 7:34 PM

• jmatazzoni updated the task description. (Show Details)Feb 5 2019, 9:42 PM

• jmatazzoni updated the task description. (Show Details)Feb 5 2019, 9:45 PM

• jmatazzoni updated the task description. (Show Details)

• jmatazzoni updated the task description. (Show Details)Feb 5 2019, 10:49 PM

• jmatazzoni updated the task description. (Show Details)Feb 5 2019, 10:51 PM

• jmatazzoni added a subtask: T215356: Create a method for counting 'Pages with uploaded files' .Feb 5 2019, 10:55 PM

• jmatazzoni renamed this task from Create a method for putting 'Avg. daily views to pages that have files uploaded' metric into 'Event Summary' reports to Create a method for 'Avg. daily views to pages that have uploaded files' .Feb 6 2019, 11:48 PM

• jmatazzoni updated the task description. (Show Details)

• jmatazzoni updated the task description. (Show Details)Feb 6 2019, 11:50 PM

• jmatazzoni updated the task description. (Show Details)Feb 6 2019, 11:54 PM

• jmatazzoni removed a project: Community-Tech-Sprint.Feb 11 2019, 11:45 PM

• jmatazzoni moved this task from In Sprint 🏃‍♀️🏃‍♂️ to Up Next (May 6-17) on the Community-Tech board.

• jmatazzoni moved this task from Up Next (May 6-17) to Product backlog on the Community-Tech board.

• jmatazzoni moved this task from In Sprint to Holding on the Event Metrics board.

• jmatazzoni closed subtask T206819: Create a method to fetch information about uploaded files in local wikis as Resolved.Feb 14 2019, 12:06 AM

• jmatazzoni added a project: Community-Tech-Sprint.Feb 15 2019, 12:06 AM

• jmatazzoni moved this task from Product backlog to In Sprint 🏃‍♀️🏃‍♂️ on the Community-Tech board.Feb 15 2019, 10:40 PM

• jmatazzoni moved this task from Holding to In Sprint on the Event Metrics board.Feb 15 2019, 10:53 PM

MusikAnimal claimed this task.Feb 20 2019, 1:18 AM

MusikAnimal moved this task from Ready to In Development on the Community-Tech-Sprint board.

MusikAnimal mentioned this in rGEVMf49e60af565f: Add 'Avg. daily views to pages that have uploaded files' metric.Feb 22 2019, 8:42 PM

MusikAnimal mentioned this in rGEVMa37f3fe87fb8: Add 'Avg. daily views to pages that have uploaded files' metric.Feb 22 2019, 8:56 PM

Ready for review https://github.com/wikimedia/eventmetrics/pull/198

Ready for QA!

@dom_walden, if you approve this one, we can go ahead and check off on T206692 that 'Avg. daily views to files uploaded' is complete. (I see the figure in the report, but will wait for you to say it is accurate.)

• jmatazzoni closed subtask T215356: Create a method for counting 'Pages with uploaded files' as Resolved.Feb 25 2019, 5:54 PM

• jmatazzoni closed subtask T206817: Create a method to fetch page view data as Resolved.Feb 25 2019, 11:42 PM

From the description:

For articles that are "younger" than 30 days (were created less than 30 days ago) -- divide the monthly number by the number of days the article exists.

As we discussed, this poses a bit of a challenge. Instead we're getting the average over the past 30 days or the start of the event, whichever is shorter. When the articles were created during that time is not taken into account. Statistically I think this makes sense... as we're looking at the "big picture". Other popular tools that get pageviews of multiple articles, such as https://tools.wmflabs.org/massviews, work in the same way. Note however https://tools.wmflabs.org/pageviews is smart enough to only count since the article was created, so QA may have to do a little math. Hope this makes sense.

• jmatazzoni updated the task description. (Show Details)Feb 28 2019, 2:04 AM

• jmatazzoni added a comment.Feb 28 2019, 2:09 AM

This comment was removed by • jmatazzoni.

• jmatazzoni updated the task description. (Show Details)Feb 28 2019, 2:19 AM

@MusikAnimal, I tried to rewrite the spec for creating average daily views as we discussed in the standup today, but I don't think I have it right. I've put the old and new versions below, so you can clearly see the changes I made.

[old version]

For each of those articles, fetch the monthly page views.

For articles that are older than 30 days, divide the monthly page views by 30.

For articles that are "younger" than 30 days (were created less than 30 days ago) -- divide the monthly number by the number of days the article exists.

If no days are available (i.e., if it's the first day), then display "n/a" for "not available".

[Here's the rewrite]

For each of those articles, fetch the daily page views:

If the event was fewer than 30 days ago, begin from the start-date of the event and count forward to the last update; use as many days as are available, total up the daily views and divide by the number of available days to get the average.

If the event was more than 30 days ago, use pageviews from the MOST RECENT 30 days only and divide by 30 to get the average.

If no days are available (i.e., if it's the first day), then display "n/a" for "not available".

But this can't be right, because it completely doesn't account for the case where the EVENT was more than 30 days ago but an individual ARTICLE was created yesterday. (The previous spec was about the age of the article, not the event.) Can you please take a stab at describing your scheme?

The previous spec was about the age of the article, not the event.

It's not so much the age of the event, since we always grab the past 30 days of data, or since the start of the event, whatever is shortest.

The question is do we want pageviews for the event as a whole, or do we need to respect the age of each individual article? Going by the age of the articles means we need to fire off a bunch of extra queries, which will slow things down. We should give thought to cost/benefit.

Statistically, it makes sense to me to go by the event as a whole. Here's an example; say the event was started 3 days ago (- means the article didn't exist at that time):

Article	Day 1	Day 2	Day 3
Foo	-	-	3
Bar	3	0	0
Baz	3	3	3
Total	6	3	6

(6 + 3 + 6) / 3 = average of 5 pageviews a day for the event as a whole. This is opposed to averaging the averages, which I guess is how you would do it if you went by article age: "Foo" has a daily average of 3, "Bar" has 1, and "Baz" has 3. (3 + 1 + 3) / 3 = 2.3 pageviews a day.

They are both correct, it's just a matter of which metric you want. I think the "event as a whole" is an easier concept to wrap your head around, and it is more efficient technically since we don't need to query for the article age.

Going by the age of the articles means we need to fire off a bunch of extra queries, which will slow things down. We should give thought to cost/benefit.

Hmm, actually the pageviews API conveniently gives no data for days that the article didn't exist (as opposed to 0), so maybe we don't need to query for the article age. In this example, the article was created August 29. I'm asking for pageviews from the 25th through September 1, and there are no entries for the 25th-28th:

https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia.org/all-access/user/Hanksy/daily/20150825/20150901

So I guess you can disregard the performance concern. I think the "event as a whole" metric, as I explained above, is still easier to understand, but it can skew the data if you want per-article granularity. Taking your example, if the event started > 30 days ago, and the first article only popped up yesterday with 30 pageviews, you get an average of 30 pageviews/day, as opposed to 1 if you went by the impact of the "event as a whole". Again they're both correct, depending on how you look at it.

If no days are available (i.e., if it's the first day), then display "n/a" for "not available".

I don't think we did this. I will fix this now.

@jmatazzoni Let me know which way you want to go with T206700#4990282. I'm assuming you'd prefer going by article age, as originally planned. It will take a little bit to implement this but it can be done.

• jmatazzoni added a comment.Feb 28 2019, 10:03 PM

This comment was removed by • jmatazzoni.

Bear with me, I'm not a statistics whiz either. This stuff makes my head hurt!

But we're looking for a daily average. So even if you're not calculating per-page averages, you still need to calculate an average for each day, based on how many articles there were that day, right? That, I think, would be something more like this (pls excuse if my notation is wrong—not my area!):

(((3 + 3) / 2 articles) + ((0 + 3) / 2 articles) + ((3 + 3) / 3 articles) / 3 days) = 2.2

I think this would be the average per-article for each day, summed. Average pageviews per day would be total pageviews / number of days, no?

I think my formula that respects the article age was wrong though... the expectation is likely the sum of the averages (not average of averages):

Article	Day 1	Day 2	Day 3	Average
Foo	-	-	3	3
Bar	3	0	0	1
Baz	3	3	3	3
Total	6	3	6	7

And the "event as a whole" (again, using the sum of the averages for each article, except this time using the total number of days as the denominator):

Article	Day 1	Day 2	Day 3	Average
Foo	-	-	3	1
Bar	3	0	0	1
Baz	3	3	3	3
Total	6	3	6	5

In my opinion the latter still makes more sense. It's a simple division of the total pageviews and the number of days.

Pageviews Analysis is by no means an authority, but this is the behaviour it uses: https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&start=2015-08-25&end=2015-09-01&pages=Hanksy|The_Pizza_Underground Note that "Hanksy" was created on August 29 (only 4 days within the time period), has a total 199 pageviews / 8 days = 25. Then "The Pizza Underground" (which existed for all days in the period) has a daily average of 288. Add that to "Hanksy"'s average and you get 313 overall.

In T206700#4992105, @MusikAnimal wrote:

The previous spec was about the age of the article, not the event.

It's not so much the age of the event, since we always grab the past 30 days of data, or since the start of the event, whatever is shortest.

The question is do we want pageviews for the event as a whole, or do we need to respect the age of each individual article? Going by the age of the articles means we need to fire off a bunch of extra queries, which will slow things down. We should give thought to cost/benefit.

Statistically, it makes sense to me to go by the event as a whole. Here's an example; say the event was started 3 days ago (- means the article didn't exist at that time):

Article Day 1 Day 2 Day 3

Foo - - 3

Bar 3 0 0

Baz 3 3 3

Total 6 3 6

(6 + 3 + 6) / 3 = average of 5 pageviews a day for the event as a whole. This is opposed to averaging the averages, which I guess is how you would do it if you went by article age: "Foo" has a daily average of 3, "Bar" has 1, and "Baz" has 3. (3 + 1 + 3) / 3 = 2.3 pageviews a day.

They are both correct, it's just a matter of which metric you want. I think the "event as a whole" is an easier concept to wrap your head around, and it is more efficient technically since we don't need to query for the article age.

I can't believe this sentence is about to come from me, but I think there is an error in the way you're calculating the per-page method—which accounts for the very large difference between what should both be valid methods. Your formula went something like this:

((3/1) + (3/3) + (9/3))/3 = 2.3

But that is one step too many; what you have there is the pageviews for the average article. What we want is per day, which would just be the total of each article's per-day average:

(3/1) + (3/3) + (9/3) = 7

The alternative is what you proposed, which is to average each day's per-day total:

((3+3) + (0+3) + (3+0+3))/3 = 5

Did I get that right? If so, the difference is 7 verus 5, and then yes, both seem like valid methods.

Unfortunately you saw my erroneous post before I could delete it. Sorry for the confusion! In the end, it looks like we both figured it out and came to the same place. Whew--that was fun. Now the question is which method should we use?

In T206700#4992119, @MusikAnimal wrote:

...So I guess you can disregard the performance concern. I think the "event as a whole" metric, as I explained above, is still easier to understand, but it can skew the data if you want per-article granularity.

IF performance and level of effort are the same for both methods, then I think the per-article method might still be better. The reason is that we eventually want the Files Uploaded report (T212547), which includes a per-article Avg. Daily Pageviews figure. That report is currently off our list, but I hope to get back to it some day, and it seems like doing the method on a per-article basis would set that up.

So, as I say, that's if performance AND level of effort are the same. Are they? If the per-article way is harder, we can let the future take care of itself I suppose...

Pageviews on all wikis (not just those specified): The Main space articles whose pageviews we're counting can be on any wiki; the wikis do not need to be specified as wikis of interest in setup.

@MusikAnimal I haven't been able to follow all the discussion that has gone on in this task, but do we still want the above to be true?

It appears that for images uploaded to commons we also need to specify the wikis where the uploaded images are linked (or all wikis), in order to get a figure for 'Avg. daily views to pages that have uploaded files'.

Compare event with only commons (and wikidata): https://eventmetrics-dev.wmflabs.org/programs/108/events/223
with identical event for all wikis: https://eventmetrics-dev.wmflabs.org/programs/108/events/287

In T206700#4993697, @dom_walden wrote:

Pageviews on all wikis (not just those specified): The Main space articles whose pageviews we're counting can be on any wiki; the wikis do not need to be specified as wikis of interest in setup.

do we still want the above to be true?

That is required. The whole idea of uploading files to Commons is that those images, etc. are then available to all wikis. So we want to find all the pages where they are being placed.

MusikAnimal mentioned this in rGEVM335a35881811: Make 'pages-using-files-pageviews-avg' metric apply to all wikis.Mar 1 2019, 9:55 PM

MusikAnimal mentioned this in rGEVM349551ee94e9: Make 'pages-using-files-pageviews-avg' metric apply to all wikis.Mar 1 2019, 9:59 PM

Pageviews on all wikis (not just those specified): The Main space articles whose pageviews we're counting can be on any wiki; the wikis do not need to be specified as wikis of interest in setup.

This was not implemented, until now :) This fix is up for code review: https://github.com/wikimedia/eventmetrics/pull/206

In relation to T206700#4992838, I'm going to investigate using the per-article strategy we talked about (which was in the original spec). This will be part of a separate PR.

• jmatazzoni updated the task description. (Show Details)Mar 1 2019, 10:51 PM

• jmatazzoni updated the task description. (Show Details)Mar 1 2019, 11:01 PM

MusikAnimal mentioned this in rGEVMb6b55168d177: Use age of article when calculating average pageviews.Mar 1 2019, 11:07 PM

@MusikAnimal and @dom_walden, now that we've selected the per-article approach and talked it through so that we have a clear idea of how it should work, I've rewritten the Description again to reflect this new understanding. Here's what I wrote; please have a look:

For each unique article, fetch daily pageviews from the present back to a maximum of 30 days and total that number. Then, to get the per-article average:

If we have 30 days of pageviews, divide the total by 30.

If we have fewer than 30 days of pageviews (presumably because the article was created fewer than 30 days ago), divide the total by the number of days for which we have figures.

Next total the individual per-page averages to get the grand-total of "Avg. daily views to (all) files uploaded."

Does that sound right?

• jmatazzoni updated the task description. (Show Details)Mar 1 2019, 11:09 PM

@jmatazzoni Sounds perfect. I've got a fix in with b6b5516, but #206 should be reviewed first.

MusikAnimal mentioned this in rGEVM366edc30fe00: Use age of article when calculating average pageviews.Mar 1 2019, 11:33 PM

All merged! Ready for QA :)

Constructed a number of events where the pages improved were the same as the pages which had uploaded files. Checked that 'Avg. daily views to pages that have uploaded files' = 'Avg. daily views to pages improved'. (e.g. https://eventmetrics-dev.wmflabs.org/programs/126/events/286)

Then looked at events with files uploaded to commons and saw they report the same page views as the pageviews tool, regardless of whether the specific wiki was included in the event. (e.g. https://eventmetrics-dev.wmflabs.org/programs/126/events/290)

I also took a UTC event and changed the date to the equivalent time in the "Asia/Jakarta" timezone; got the same results. This was to check that we are doing timezone normalisation. (https://eventmetrics-dev.wmflabs.org/programs/126/events/291)

In T206700#4994840, @MusikAnimal wrote:

...I've got a fix in with b6b5516, but #206 should be reviewed first.

I could see no problems. Ready for the next commit, if that is the intention.

In T206700#4998284, @dom_walden wrote:

...I've got a fix in with b6b5516, but #206 should be reviewed first.

I could see no problems. Ready for the next commit, if that is the intention.

Sorry, I was referring to code review. Everything for this ticket is merged and ready for QA :)

In T206700#4998326, @MusikAnimal wrote:

In T206700#4998284, @dom_walden wrote:

...I've got a fix in with b6b5516, but #206 should be reviewed first.

I could see no problems. Ready for the next commit, if that is the intention.

Sorry, I was referring to code review. Everything for this ticket is merged and ready for QA :)

Ah, OK. In that case, I will put this in Product Sign-Off.

• jmatazzoni closed this task as Resolved.Mar 4 2019, 4:46 PM

In T206700#4998284, @dom_walden wrote:

Constructed a number of events where the pages improved were the same as the pages which had uploaded files. Checked that 'Avg. daily views to pages that have uploaded files' = 'Avg. daily views to pages improved'. (e.g. https://eventmetrics-dev.wmflabs.org/programs/126/events/286)

Ooooh. That's genius Dom. Such a satisfying proof! I put a screenshot in just because I like it so much.

Screen Shot 2019-03-04 at 10.33.07 AM.png (303×322 px, 35 KB)

MusikAnimal moved this task from Product sign-off to Q3 2018-19 on the Community-Tech-Sprint board.Mar 5 2019, 4:43 PM

	F28325666: Screen Shot 2019-03-04 at 10.33.07 AM.png
	Mar 4 2019, 6:34 PM

Create a method for 'Avg. daily views to pages that have uploaded files' Closed, ResolvedPublic5 Estimated Story PointsActions