Page MenuHomePhabricator

Statistics for views of individual Wikimedia images
Open, NormalPublic

Tokens
"Stroopwafel" token, awarded by Richard_Nevell_WMUK."Like" token, awarded by WMDE-Fisch."Like" token, awarded by WMDE-leszek."Mountain of Wealth" token, awarded by Doc_James."Like" token, awarded by Krenair."Orange Medal" token, awarded by Krinkle."Love" token, awarded by MusikAnimal.
Assigned To
None
Authored By
Krinkle, Nov 24 2018

Description

I'd really like to be able to see how often a contributed file on Wikimedia Commons was viewed, specifically for images photgraphs and other images.

Source

The webrequest data collected in production with varnishkafka does already include all file urls from upload.wikimedia.org (for all file types, and all wikis).

It is aggregated for all file types by file name (e.g. hits on thumbnails and transcoded versions count toward the original). See also https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Mediacounts and https://dumps.wikimedia.org/other/mediacounts/.

Index and query

As far as I know, these are not currently indexed or made queryable in production. But if they would, it would make a very neat addition to the Analytics Query Service (AQS; the metrics at https://wikimedia.org/api/rest_v1/)

There is currently some level of aggregation happening in Toolforge somewhere, which is what powers https://tools.wmflabs.org/mediaviews/, but this seems currently limited to playable files only (e.g. file names ending with audio/video file extensions). At T149642#3028288, it was hinted at supporting images, but this doesn't currently seem to be the case. I'm not sure whether this is a filter or a technical limitation.

The aggregation is exposed at https://tools.wmflabs.org/mediaviews-api/api/2, but I couldn't find the source of it. It seems similar to https://github.com/harej/mediaplaycounts but the code hosted there seems to actually have support for images, but it seems this tool isn't operational currently (would presumably at https://tools.wmflabs.org/mediaplaycounts/).

Frontend

There are a lot of ideas of how such an API could be used, including:

I think for the purposes of this task, any one of these would suffice to close, the rest can be done later.

Related Objects

Event Timeline

Krinkle created this task.Nov 24 2018, 1:54 AM
Restricted Application added subscribers: MusikAnimal, Aklapper. · View Herald TranscriptNov 24 2018, 1:54 AM

Thanks for creating this task!

There was talk of adding image mediacounts to the API at T206700. The actual task for this is at T88775, and a high-level overview of implementation details at T206700#4729565. I don't know if you're interested in helping build it, but when I have the time I thought I'd give it a go.

After the data is consumable via the API, I will happily integrate it into all the Pageviews apps :)

Will merge others into this, but keep in mind this nice analysis about the storage in Cassandra implications: T88775#4751882

Milimetric triaged this task as High priority.Nov 29 2018, 5:55 PM
Milimetric moved this task from Incoming to Analytics Query Service on the Analytics board.

Thanks for moving this to High priority @Milimetric. I see the title refers to "images" and "Commons." I'd like to ask:

  1. I assume that instead of images what we really mean here is "files." E.g., presumably this will also give us a count of pageviews (not plays) to video or audio files?
  2. My understanding is that it is actually pretty common for users to upload images, etc., directly to individual wikis. Can this track that as well?
  1. I assume that instead of images what we really mean here is "files." E.g., presumably this will also give us a count of pageviews (not plays) to video or audio files?
  2. My understanding is that it is actually pretty common for users to upload images, etc., directly to individual wikis. Can this track that as well?

To answer both of these questions, all files served on all wikis are first uploaded to upload.wikimedia.org. That's what's counted in the Mediacounts dataset, and I recommend a close reading of the first two small sections of https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Mediacounts. There are corner cases listed there that explain more exactly how and what is being counted. But essentially, yes, images, audio, and movie files, counting the most common way that user agents would transfer them.

Tgr added a comment.Dec 1 2018, 12:29 AM

When using Media viewer to view images, some images are prefetched for better user experience, but need not yet been shown to the user. Currently, those prefetched images are getting counted, as there is as of now no way to detect whether an image was actually shown to the user or not. (Analytics/Data Lake/Traffic/Mediacounts#Corner cases)

FWIW, there is a way to detect that - virtual media views (T89088) were developed for that specific purpose (and MediaViewer sets different headers, at least on reasonably modern browsers). It just wasn't implemented in the mediacounts logic.

Tgr added a comment.Dec 1 2018, 12:31 AM

Also, I wouldn't call them corner cases - if I remember the stats correctly, for file types supported by Media Viewer preloads would comprise over half of the requests.

FWIW, there is a way to detect that - virtual media views (T89088) were developed for that specific purpose (and MediaViewer sets different headers, at least on reasonably modern browsers). It just wasn't implemented in the mediacounts logic.
Also, I wouldn't call them corner cases - if I remember the stats correctly, for file types supported by Media Viewer preloads would comprise over half of the requests.

Thanks very much for this info, I think we missed this or deprioritized it after it was set up. I'll change the docs and track work to improve the data here: T211030

This is important but orthogonal to the work of building the endpoint, I think that should still go ahead.

Nuria added a comment.EditedDec 3 2018, 6:06 PM

@Tgr imagine that special beacon was implemented in 2015 for 'virtual mediaviews' due to scaling concerns with the old eventlogging backend. Those concerns no longer exist and as such it makes sense to migrate counts to the current eventlogging infrastructure. This is so we do not have 3 ways to do the same thing (i.e. counting requests/pageviews/virtual pageviews/whatever event...).

Is there a team that now owns the mediacounts data and can take care of moving media counts to eventlogging beacon? Once that is done and we have checked on quality of data, putting an API on top of it is easy and something Analytics can do.

Tgr added a comment.Dec 3 2018, 8:14 PM

@Tgr imagine that special beacon was implemented in 2015 for 'virtual mediaviews' due to scaling concerns with the old eventlogging backend.

That was one reason, also conceptually they are more like views than like events so we assumed it's easier for analysts to work with them if they end up in the webrequest table (which I guess was also more of a concern while EventLogging went into a different storage backend). You had the same conversation around Popups virtual views, I think.

Is there a team that now owns the mediacounts data and can take care of moving media counts to eventlogging beacon?

Not that I know. The potential candidates would be Multimedia, Reading Infrastructure and Analytics I suppose?

Nuria added a comment.Dec 3 2018, 9:16 PM

Reading Infrastructure and Analytics I suppose?

Our team can support any team in reading with migration of the beacon to EL infrastructure but I think it should be Reading driving the project: the teams instrumenting should be the teams that own the features as instrumentation needs to adapt and follow feature changes.

Still, if we got all the media requests from Media Viewer into EventLogging, we would not have all media requests for mediawiki in general. To do that, we'd have to go around instrumenting any place that fetches and renders media (in core, extensions, etc.). I think if we want thorough and sensical mediacounts until all that work happens, we need to handle the filtering on Hadoop.

Nuria added a comment.Dec 4 2018, 10:10 PM

@Milimetric Right. I see your point, there needs to be some parsing of the firehouse of requests cause not all media consumption can be "eventy-fied" (true for images if not for videos). That being said my comment about the beacon data to be migrated to eventlogging still stands, we do not want to have two ways of doing the exact same thing.

jmatazzoni renamed this task from Statistics for views of individual Wikimedia Commons images to Statistics for views of individual Wikimedia images.Dec 5 2018, 5:43 PM

! In T210313#4787451, @Milimetric wrote:

To answer both of these questions, all files served on all wikis are first uploaded to upload.wikimedia.org. That's what's counted in the Mediacounts dataset...

Based on this answer, I've changed the name of the ticket, remove the word Commons

Tgr added a comment.Jan 20 2019, 10:31 PM

The aggregation is exposed at https://tools.wmflabs.org/mediaviews-api/api/2, but I couldn't find the source of it. It seems similar to https://github.com/harej/mediaplaycounts but the code hosted there seems to actually have support for images, but it seems this tool isn't operational currently (would presumably at https://tools.wmflabs.org/mediaplaycounts/).

At a glance harej/mediaplaycounts is the library for fetching media counts and harej/mediaplaycounts-app is the API built on it, which is exposed at https://tools.wmflabs.org/mediaviews-api/api/2. The API works fine with images.

... https://tools.wmflabs.org/mediaviews-api/api/2. The API works fine with images.

Are we sure? This is for January 20's Today's Featured Picture on enwiki (file lives on Commons): https://tools.wmflabs.org/mediaviews-api/api/2/file_playcount/date_range/Jan_Vermeer_van_Delft_-_Lady_Standing_at_a_Virginal_-_National_Gallery,_London.jpg/20190101/20190121 Nothing but zeros, which is the same thing I get if I put in a nonexistent file.

Tgr added a subscriber: Harej.Jan 22 2019, 6:54 PM

You are right, I didn't look at the output, just that it gives an OK response. Image views are int the same file so probably a simple fix though? Maybe @Harej remembers if that was an intentional limitation or a bug.

Harej added a comment.Jan 22 2019, 7:36 PM

Maybe @Harej remembers if that was an intentional limitation or a bug.

I wanted to expand mediaplaycounts-api to include static images but I ran into trouble scaling the service to include the extra data. So it should be possible, assuming you don't assume your dataset can fit entirely within a single Redis instance.

mforns lowered the priority of this task from High to Normal.Mar 11 2019, 3:27 PM

Getting these sort of stats is important for our Videowiki efforts.

https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Videowiki

Hope we see the tool back up soon.

I think @Doc_James is referring to T207208#5157588. Sorry, I pointed you the wrong task, this one is about still imagery.

Harej added a comment.May 18 2019, 5:41 PM

Hypothetically T207208 is a parent task.

MusikAnimal moved this task from Backlog to On hold on the Tool-Pageviews board.May 24 2019, 8:50 PM
fdans added a subscriber: fdans.Jul 22 2019, 2:24 PM

@MusikAnimal @Doc_James @Tgr the following are the endpoints we're planning to roll out from the webrequest-based data we have currently:

https://wikitech.wikimedia.org/wiki/Analytics/AQS/Media_metrics#Endpoints_in_AQS

We'd love it if you could take a look and see if there's a better metric definition for your usecases given the data that we have. The available fields the mediacounts dataset will have are:

file name
number of bytes
count
referer wiki, if available, otherwise "external", "internal" or "unknown"
file classification (audio, video, image, document...)
file extension (svg, jpg, tif...)
transcoding (is an oga file being previewed as jpg?)
agent type (user or spider)
Tnegrin added a subscriber: Tnegrin.

Adding the SDC folks as we probably want to think about how this integrates with structured data. Or not.

Tgr added a comment.Jul 22 2019, 2:46 PM

Cool!

Is the project domain in the URL the referer, or the hosting project? In the latter case, having similar endpoints (file, site, top) for querying Commons-hosted images by referer would be very valuable IMO (I don't have a usecase personally but I imagine wiki editors will want to know how much the images are viewed on their site, as opposed to anywhere in Wikimedia properties).

Ramsey-WMF moved this task from Untriaged to Tracking on the Multimedia board.

This looks very promising! I'd like to go over it with @SandraF_WMF when she gets back first week of August to consider the GLAM perspective.

In addition to what Amanda mentions above, in regards to this little bit:

referer wiki, if available, otherwise "external", "internal" or "unknown"

Is there a way for us to expose more information about external embeds than just the big old bucket of "external"? It would be very helpful to get as much external use information as possible. Thanks!

@Ramsey-WMF: we shouldn't make granular referer information public but you can always access the raw data. Talk to Product Analytics or jump on the Hadoop cluster and take a look.

Nuria added a comment.Aug 30 2019, 9:20 AM

GFrom data from 2018-08-28 (sampled 1/128) from 23 million requests for files (to upload.wikimedia.org) about 137K are recorded as views from media-viewer. I cannot see mediaviewer requests being tagged by special headers , pinging @Tgr in case I am missing something.