Create a method for putting 'Avg. daily views to pages that have files uploaded' metric into 'Event Summary' reports
Open, Needs TriagePublic5 Story Points

Description

Implementation details

Create a way to fetch the number of views to articles that embed files that were uploaded during the event, and display them as 'average daily views'.

The query should:

  • Fetch all files that were uploaded to Commons and to the local wikis specified in the event setting (the system currently only considers uploads to commons; this new method needs to add the ability to fetch the list of files uploaded to the local wikis too)
  • For each of those uploaded files, get the articles they are used/embedded in. Article list should be unique (even if two files were uploaded to the same article, the article should be counted once)
  • For each of those articles, fetch the monthly page views.
    • For articles that are older than 30 days, divide the monthly page views by 30.
    • For articles that are "younger" than 30 days (were created less than 30 days ago) -- divide the monthly number by the number of days the article exists.
  • Add all individual average views to one big sum average view, and return it

Deeper dive

Organizers, their sponsors and partners want to understand the impact of their work. One main way to do this for files uploaded is to see the number of pageviews those files get on the various article pages to which they are added. This figure will be reported in the Event Summary reports (T205561 and T206692 ); it will also be used in the to-be-defined Files Uploaded report.

In our discussions, it has become clear that we can't get an accurate cumulative pageviews figure (see Problems and Alternative Approaches, below). So instead, we will be providing a figure for "average daily pageviews".

Definition and parameters

  • All filetypes: The figure will track images, video files, audio files and other upload types.
  • Uploads to Commons and local Wikipedias: Previous stats have tracked uploads to Commons only, but it is not unusual for users to upload directly to a Wikipedia. So we will track uploads to all wikis specified for the event and include those in the metric.
  • Pageviews on all wikis (not just those specified): The hypothesis here is that over time, images uploaded will spread to more and more articles and, as articles are translated, more and more wikis. We want to gauge the full impact of the upload, so it would be antithetical to our purpose to count pageviews only on the wikis specified for the event.
  • Method To make this metric as valid as possible by smoothing out daily or weekly fluctuations, I propose we do the following:
    1. Looking at the most recent day available, find the articles—on all wikis—on which the images from the event have been placed,
    2. Get the pageview count for all those articles over the past 30 days (it's OK that not all the images will have actually been on all those pages during that entire period).
    3. Average that 30 day figure and express as a daily average.
    4. If page creation date is < 30 days ago, use the number of available days and average by that number.

Problems and Alternative Approaches

  • What date was the image added to a page? We are able to provide a figure for the number of "Pages with uploaded files." So far so good. To have an accurate picture of how many pageviews the image has received, however, we need to know what date the image was added to each page it is on. Apparently, this date is not recorded.
  • The data is in the DataLake, but... The problem mentioned above would be irrelevant if we could simply get a count for how many times the image was requested. This number is is apparently available in a stream called "mediacounts that's in the Data Lake. But there is no easy way for us to get the information out of the Lake and into Tool Forge at scale. An API for this is planned, possibly for some time in the next year T88775.
jmatazzoni updated the task description. (Show Details)Oct 10 2018, 10:19 PM

@Nettrom @Nuria @MusikAnimal do you have any insights on how we might best arrive at the metric we're looking for here? In particular, is what I say above under "Issues and Approaches" correct? I.e., is there no painless way to either keep track of the dates when images land on articles or to get the "mediacount" number we want?

This number is is apparently available in a stream called "mediacounts that's in the Data Lake [..] and the Data Lake contains personal information that may make accessing it from Event Metrics' home on Tool Forge.

I think you got some wrong cut and paste here. mediacounts are available in public files downloadable by everyone: https://dumps.wikimedia.org/other/mediacounts/daily/2018/
Mediacounts count how many times a particular file has been transferred to users (not played), figures are provided per file, not wiki. Each file is about 2G decompressed.

do you have any insights on how we might best arrive at the metric we're looking for here?

There is a big difference between video/audio files and images. You might be able to tell wether user has loaded the video but -unless this is instrumented in Eventlogging- there is no way to tell whether it was played. An image however once viewed is "consumed". So really the notion of a "pageview" needs to be further defined for videos and audios as you can download the content in your browser w/o consuming it at all.

Nettrom edited subscribers, added: nettrom_WMF; removed: Nettrom.Oct 10 2018, 11:29 PM
jmatazzoni added a comment.EditedOct 11 2018, 1:39 AM

In T206700#4656851, @Nuria wrote:

This number is is apparently available in a stream called "mediacounts that's in the Data Lake [..] and the Data Lake contains personal information that may make accessing it from Event Metrics' home on Tool Forge.

I think you got some wrong cut and paste here. mediacounts are available in public files downloadable by everyone: https://dumps.wikimedia.org/other/mediacounts/daily/2018/
Mediacounts count how many times a particular file has been transferred to users (not played), figures are provided per file, not wiki. Each file is about 2G decompressed.

do you have any insights on how we might best arrive at the metric we're looking for here?

There is a big difference between video/audio files and images. You might be able to tell wether user has loaded the video but -unless this is instrumented in Eventlogging- there is no way to tell whether it was played. An image however once viewed is "consumed". So really the notion of a "pageview" needs to be further defined for videos and audios as you can download the content in your browser w/o consuming it at all.

Actually Nuria, the issue here is not plays. It is views. We want to know how many times uploaded files—primarily images—were viewed on the various articles to which they were added (at different times).

The question of plays for media files will be addressed elsewhere. As I say, images are the primary interest here. If we could get pageviews for images, then I'd be happy tp leave video and audio files out of this metric entirely. My understanding from @Catrope and @Neil_P._Quinn_WMF is that such data for the image files themselves is in the Data Lake. But they saw no way to access it on request from Tool Forge (Roan and Neil will correct me if I misstated their comments).

jmatazzoni updated the task description. (Show Details)Oct 11 2018, 1:43 AM

I think we need to establish exactly what data you're looking for first. Sounds like @Neil_P._Quinn_WMF could do that. But if he said the data's in the Data Lake, it sounds like you might need to join pageviews data to metadata about the media files themselves. These are just articles on commons, so we could theoretically create a stream or hourly updated dataset with this information. But we probably wouldn't build single-purpose infrastructure for just this. Let's wait until we have the specifics and we can brainstorm the best way forward.

Nuria added a comment.Oct 11 2018, 3:01 AM

We want to know how many times uploaded files—primarily images—were viewed on the various articles to which they were added (at different times).

The files at https://dumps.wikimedia.org/other/mediacounts/daily/2018/ provide views per image file, they are public and accessible to anyone.

While the data is not on the best consumable format (an API endpoint would be ideal) you can certainly build a tool on top of it. In fact an APi on top of this data used to be available on labs: https://phabricator.wikimedia.org/T116363

The files at https://dumps.wikimedia.org/other/mediacounts/daily/2018/ provide views per image file, they are public and accessible to anyone.

While the data is not on the best consumable format (an API endpoint would be ideal) you can certainly build a tool on top of it. In fact an APi on top of this data used to be available on labs: https://phabricator.wikimedia.org/T116363

So those dumps do include views (or loads) for still images? If it does, I don't believe @Harej's API is serving it.

Harej added a comment.Oct 11 2018, 6:03 PM

The dumps include data for still images. Due to resource limitations I don’t include that data, though I did try. At some point my duct tape-based API should be replaced with something more production-scale. (Currently I serve data out of Redis, which is a delight to work with but is very expensive to scale.)

So those dumps do include views (or loads) for still images?

Yes. The docs explain it all: https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Mediacounts

jmatazzoni updated the task description. (Show Details)Oct 16 2018, 6:52 PM
Nuria added a comment.Oct 16 2018, 6:56 PM

Workarounds: Avg. daily pageviews to uploaded files:

I want to highlite that there is no workaround needed, download of images files are available w/o having to use any workarounds using pageview data.

jmatazzoni added a comment.EditedOct 16 2018, 6:56 PM

@Mooeypoo, three questions:

  1. Have we determined for a fact that we can't, with reasonable effort, get the cumulative pageview numbers because of the difficulties described in the Description? (In particular, please see Nuria's suggested approach)
  2. If so, should I just rewrite this ticket so that I call it "Avg. daily pageviews to uploaded files" (see methodology proposed under Workarounds, in the Description)?
  3. If we redefine this as "Avg. daily pageviews," can I put it to be estimated?
Nuria added a comment.Oct 16 2018, 6:58 PM

Have we determined for a fact that we can't, with reasonable effort, get the cumulative pageview numbers because of the difficulties described in the Description?

Cumulative pageview numbers are available for image files here: https://dumps.wikimedia.org/other/mediacounts/daily/2018/

@Nuria Hi. As I understand it, we want to access this data from ToolForge. Parsing dumps is an option, but it's not really straightforward as we would want to combine this data with other SQL queries we run. Is an API for this data anywhere on the roadmap? Thanks. :)

Nuria added a comment.Oct 16 2018, 7:22 PM

Hi. As I understand it, we want to access this data from ToolForge. Parsing dumps is an option, but it's not really straightforward as we would want to combine this data with other SQL queries we run.

You just have to load the data into a data store in the cloud platform. The way counts are reported that should not be hard to do and, in fact, something to that extent is what @Harej had done earlier. So you are not parsing dumps but rather you have ingested them into a storage and you are querying your storage. Probably there is some code from @Harej you can reuse. It is some work to do this but (I think) not a lot.

Looking at the most recent day available, find the articles—on all wikis—on which the images are placed,

I *think* (and I might be wrong on this) that having an accurate historic snapshot of this information (which is what you would need to report pageviews to images using the article as a proxy) is harder to do than the ingestion of dumps into storage described above.

Is an API for this data anywhere on the roadmap?

Yes it is, but it will not happen this quarter, our upcoming API releases are around editing data.

Mooeypoo added a comment.EditedOct 16 2018, 7:50 PM

Let me try and provide some clarity from the technical perspective of the team, and see if that makes more sense:

  • There's no doubt that gathering the true, precise, and accurate views on the file itself, is the ideal here. However, because there's no API endpoint for this, it would mean that we technically need to do a lot of relatively hard work -- either build an API, or build a system that fetches those files into a storage and parses them.
  • Since the product has relatively robust spec for our resources alongside another project we're doing, the above is quite a lot of work, and is probably not feasible with the current spec, resources allocated, load balancing issues we already have and are dealing with, and the deadlines we have.
  • Fetching a metric about page views for the pages the file is embedded in is easy technically. It is inaccurate to a LARGE extent, because it does not take into account when the file was added into the article, but it can give users a general sense of where the files their participants have uploaded currently (and I mean that strictly: currently to the day of the data fetch) live. This is inaccurate, and if used, should be explained properly. However, from conversations with @jmatazzoni it sounded like while it's not ideal, it can answer the need.
  • It follows, then, that we have three main options here:
    1. We prioritize this feature as accurate, concentrate on providing a proper solution with an API we can consume, and therefore drop a bunch of other features from the roadmap.
    2. We accept a flawed metric that has caveats but that gives a good general sense of the achievement to the users; we can then plan to do a better representation of the metric as future followup work if the product is prioritized so.
    3. We drop this feature until an API is available. This seemed to be a non-option from the stakeholders.

This summarizes the concerns here. As far as I understand, #2 is what the product owners (@jmatazzoni can answer here) chose as the most feasible option. If that's the case, we need to make sure we're being very clear about what the data actually represents, because it won't show the pure and total number of views for the file.

Thanks @Mooeypoo, that sounds about right. It would be great if we could just get the number, but it sounds like building the API is a pretty big job. So we will have to wait for that. Meanwhile, the avg. daily number is imperfect but should be sufficient for our purposes in the interim.

Shall I just rewrite this ticket aimed at avg., and split off a different ticket for the someday when the API is built?

Nuria added a comment.Oct 16 2018, 8:27 PM

Good summary.

we need to make sure we're being very clear about what the data actually represents, because it won't show the pure and total number of views for the file.

Right, as files get added and removed from articles the views might change significantly in short timespans. Example: If file1 gets added to article1 which gets 1000 pageviews a minute the event organizer will see large numbers of views for file1. If couple days later file1 is removed from article1 the event organizer will go and see number of views to have decreased to a huge extent for file1 and the prior number will no longer be visible or event accessible. This has the potential of being very confusing.

T88775 is the task we need to finish to make mediacounts available on the Pageview API. This would indeed be relatively hard to build from scratch on the cloud platform, and I would go so far as to rule out that possibility. We already have the Pageview API and it's relatively easy to load new data into it, as long as it doesn't pose a huge scalability concern. Now, as far as I can tell, this data is relatively small and well aggregated if we only care about daily or monthly totals. So, new proposal:

I can mentor someone through the process of adding this endpoint, and I feel somewhat confident that it could be done by the end of this quarter, and definitely by the end of next quarter. Here's the technology that would be involved, and I can help with all of it:

@MusikAnimal @Mooeypoo: Nuria continues to urge that the proposed workaround for "Avg. daily pageviews to uploaded files" is going to be misleading and that, at a minimum, we should find out just how misleading it is. I'm wondering how hard that would be to do?

We have the data dumps referenced above, right? Would it be possible, for example, to do something like this, as a test:

  1. take a few categories, large and small (e.g., this and this and this)
  2. Pick a day (like yesterday)
  3. Use the dumps to determine the ACTUAL number of pageviews over the last 30 days and divide by 30 to get an average.
  4. Then use our proposed method to get the APPROXIMATE avg. daily pageviews
  5. Then compare, to get a sense of what type of deviation we're looking at.

Can we do that?

jmatazzoni updated the task description. (Show Details)Oct 26 2018, 9:09 PM
jmatazzoni added a comment.EditedOct 26 2018, 9:16 PM

I've rewritten the Description for this task so that it uses the flawed but we think reasonable "avg. daily pageviews" workaround. With that, I'm putting it in for estimation.

jmatazzoni updated the task description. (Show Details)Oct 26 2018, 9:20 PM
jmatazzoni renamed this task from Put ' Views to uploaded files' metric into 'Event Summary' reports to Put 'Avg. daily views to files uploaded' metric into 'Event Summary' reports .Oct 26 2018, 9:25 PM
jmatazzoni updated the task description. (Show Details)Oct 30 2018, 11:55 PM
jmatazzoni updated the task description. (Show Details)Oct 31 2018, 12:01 AM
jmatazzoni set the point value for this task to 5.

In an attempt to save some pain that T206700#4692752 may inadvertently cause, if you need to do one-off analysis using the mediacounts dataset, you can query the Hive table wmf.mediacounts (please ping me and I'm happy to show you how). You don't need to download and trudge through the dumps for one-off analysis.

In an attempt to save some pain that T206700#4692752 may inadvertently cause, if you need to do one-off analysis using the mediacounts dataset, you can query the Hive table wmf.mediacounts (please ping me and I'm happy to show you how). You don't need to download and trudge through the dumps for one-off analysis.

It won't be one-off, but if we could programmatically query the Hive database that'd be pretty neat! That's assuming the response time is suitable for automation.

Not everyone here has access to the analytics cluster, but it's safe to assume only staff will access to the VPS instance, if that means anything.

In an attempt to save some pain that T206700#4692752 may inadvertently cause, if you need to do one-off analysis using the mediacounts dataset, you can query the Hive table wmf.mediacounts (please ping me and I'm happy to show you how). You don't need to download and trudge through the dumps for one-off analysis.

It won't be one-off, but if we could programmatically query the Hive database that'd be pretty neat! That's assuming the response time is suitable for automation.

Not everyone here has access to the analytics cluster, but it's safe to assume only staff will access to the VPS instance, if that means anything.

The investigation that I was referring to, and that @jmatazzoni was suggesting, was a one-off. We don't allow querying the Hive database programmatically, it's too big and complicated a system to depend on from a front-end. For programmatic querying, we have another task open, to load the mediawiki dataset into the pageview API. I urge someone on your team to work with us to get that done. It's really quite easy and would be a fun cross-team collaboration. I can walk whoever works on it through every step.

For programmatic querying, we have another task open, to load the mediawiki dataset into the pageview API. I urge someone on your team to work with us to get that done. It's really quite easy and would be a fun cross-team collaboration. I can walk whoever works on it through every step.

That sounds great! I'd love to help :) Are there some example patches I can go off of? Also I assume I'll have no way of testing that it works.

@MusikAnimal: the task is T88775, if you want to do it together, the example is just the existing API collection. Let's go over the pieces of AQS that you'd need to gently copy/paste, with the unique devices module as an example:

https://github.com/wikimedia/analytics-aqs

To define the endpoints you need, you'd need a file like:

https://github.com/wikimedia/analytics-aqs/blob/master/sys/unique-devices.yaml, and this is configured in-depth for public consumption here: https://github.com/wikimedia/analytics-aqs/blob/master/v1/unique-devices.yaml

And to define the logic behind those endpoints, you need something like this:

https://github.com/wikimedia/analytics-aqs/blob/master/sys/unique-devices.js

And sprinkle some tests, load data, stuff that I'll mostly do, and you're done! See? Not so bad.

Nuria added a comment.Nov 7 2018, 7:23 PM

@MusikAnimal so we are clear, @Milimetric 's guidelines above are to start developing an api over the data we have on hive , from which files are derived.

To do a one-off calculation of how "approximate" is this "approximation" that is not needed.

aezell added a comment.Nov 8 2018, 9:51 PM

Thanks for the clarification @Nuria.

@MusikAnimal I think we should do the one-off to get a ballpark sense of the numbers involved and then make sure @jmatazzoni is open to our investing the time to build the API. We need to ensure it fits in scope-wise with the rest of the project.

I see, we want to do a one-off to find out if this metrics is meaningful enough for us to include? I've no problem with that. I was only clarifying that for Event Metrics it's something we'd want to programmatically query for.

I'm not necessarily committing CommTech time to building the API, for the record :) Assuming it's "easy" as stated I think it'd be a fun side project. The API is something I'd make use of in https://tools.wmflabs.org/mediaviews/ and XTools, and obviously Event Metrics is a candidate as well.

Nuria added a comment.EditedNov 8 2018, 10:09 PM

@MusikAnimal understood, just get your credentials to query cluster. To see how "approximate" is this approximation rather than look at files with raw data you can query the tables that already host that same data from which files are derived.

I have heard skepticism expressed re. the idea that creating the API will be easy. That is something the engineers need to discuss among yourselves.

The one-off test is designed to find out whether the "Avg. daily views to files uploaded" workaround metric is valid. If we thought that we could get the actual cumulative pvs number via the API with a level of effort similar to the workaround concept, then the one-off test would be unnecessary. But if we're not sure about that, then the one-off test would give us a valuable data point. How time-consuming is it to do the test I outline above?

MusikAnimal added a comment.EditedNov 9 2018, 6:22 PM

Indeed, sorry I missed the comment about the test! I have access to stat1007 so I think I can do this. Unless you kind analytics folks have a better idea, I think we're looking at some very tedious work, since I have to query for each file individually. @jmatazzoni Would a tiny tiny category not suffice? https://commons.wikimedia.org/wiki/Category:Photographs_in_the_Museum_of_Fine_Arts,_Houston seems like the only feasible one to do by hand that you mentioned, but it's still 100+ images. Lots of busy work!

@MusikAnimal I did not know you had to do each file individually. Yes, of course, a smaller category will be fine.

Nuria added a comment.Nov 9 2018, 6:45 PM

@MusikAnimal
All data is available on hive table mediacounts, you can hit as many files as neeeded with a hive sql query.See:

https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Mediacounts#Analytics_cluster

@MusikAnimal, echoing Nuria here, definitely ping me on IRC before you do anything as manual as that :) This is easy to query in Hive, and I'm here to help demystify hadoop, just ping me.

jmatazzoni renamed this task from Put 'Avg. daily views to files uploaded' metric into 'Event Summary' reports to Createa method for putting 'Avg. daily views to files uploaded' metric into 'Event Summary' reports .Nov 28 2018, 7:53 PM
jmatazzoni renamed this task from Createa method for putting 'Avg. daily views to files uploaded' metric into 'Event Summary' reports to Create a method for putting 'Avg. daily views to files uploaded' metric into 'Event Summary' reports .
Mooeypoo updated the task description. (Show Details)Fri, Jan 11, 10:45 PM
Mooeypoo updated the task description. (Show Details)Fri, Jan 11, 10:49 PM
MusikAnimal renamed this task from Create a method for putting 'Avg. daily views to files uploaded' metric into 'Event Summary' reports to Create a method for putting 'Avg. daily views to pages that have files uploaded' metric into 'Event Summary' reports .Wed, Jan 16, 12:16 AM
Nuria added a comment.Wed, Jan 16, 4:01 PM

Before advertising this metric it will be good to quantify how good this approximation is to the "real" number, that can be easily done with comparing the results of this calculation with the data on the mediacounts table on hive.