Organizers, their sponsors and partners want to understand the impact of their work. One main way to do this for files uploaded is to see the number of pageviews those files get on the various article pages to which they are added.
In our discussions, however, it appears there are some problems that may stop us from getting the cumulative pageviews figure we want here (see Issues and Approaches, below). The goal of this task, then, is to figure out how to provide the best metric we can and to add it to (as a start) the Event Summary (T205561 and T206692 ) reports.
=Parameters for the metric
- **All filetypes:** The figure will track images, video files, audio files and other upload types.
- **Uploads to commons and Wikipedias: ** Previous stats have tracked uploads to Commons only, but it is not unusual for users to upload directly to a Wikipedia. So we will track uploads to all wikis specified for the event and include those in the metric.
- **Pageviews on all wikis (not just those specified): ** The hypothesis here is that over time, images uploaded will spread to more and more articles and, as articles are translated, more and more wikis. We want to gauge the full impact of the upload, so it would be antithetical to our purpose to count pageviews only on the wikis specified for the event.
=Issues and approaches
- **What date was the image added to a page? **We are able to provide a figure for the number of "Pages with uploaded files." So far so good. To have an accurate picture of how many pageviews the image has received, however, we need to know what date the image was added to each page it is on. Apparently, this date is not recorded.
- **The data is in the DataLake, but... ** The problem mentioned above would be irrelevant if we could simply get a count for how many times the image was requested. This number is is apparently available in a stream called "[[ https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Mediacounts | mediacounts ]] that's in the Data Lake. But there is no easy way for us to get the information out of the Lake, and the Data Lake contains personal information that may make accessing it from Event Metrics' home on Tool Forge.
=Workarounds
- **Avg. daily pageviews to uploaded files: ** We'd like to provide a cumulative pageview count. But the problems above suggest that this will be difficult or impossible. So failing that, a figure for "Avg. daily pageviews to uploaded files" would provide partners and organizers with a general sense of the magnitude of their audience and would be an acceptable figure.
- To get this figure we wouldmake this metric as valid as possible by smoothing out daily or weekly fluctuations, presumablyI propose we do the following:
# Looking at the most recent day available, look at the most recent 24-hour period available, determine the pages (on all wikis) that the image is on, then find the pageviews to those pages. find the articles—on all wikis—on which the images are placed,
# Get the pageview count for all those articles over the past 30 days (it's OK that not all the images will have actually been on all those pages during that entire period).
# Average that 30 day figure and express as a daily average.