Page MenuHomePhabricator

Statistics for views of individual Wikimedia images
Closed, ResolvedPublic0 Estimated Story Points

Assigned To
None
Authored By
Krinkle
Nov 24 2018, 1:54 AM
Tokens
"Cookie" token, awarded by Habitator_terrae."Stroopwafel" token, awarded by Richard_Nevell_WMUK."Like" token, awarded by WMDE-Fisch."Like" token, awarded by WMDE-leszek."Mountain of Wealth" token, awarded by Doc_James."Like" token, awarded by Krenair."Orange Medal" token, awarded by Krinkle."Love" token, awarded by MusikAnimal.

Description

I'd really like to be able to see how often a contributed file on Wikimedia Commons was viewed, specifically for images photgraphs and other images.

Source

The webrequest data collected in production with varnishkafka does already include all file urls from upload.wikimedia.org (for all file types, and all wikis).

It is aggregated for all file types by file name (e.g. hits on thumbnails and transcoded versions count toward the original). See also https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Mediacounts and https://dumps.wikimedia.org/other/mediacounts/.

Index and query

As far as I know, these are not currently indexed or made queryable in production. But if they would, it would make a very neat addition to the Analytics Query Service (AQS; the metrics at https://wikimedia.org/api/rest_v1/)

There is currently some level of aggregation happening in Toolforge somewhere, which is what powers https://tools.wmflabs.org/mediaviews/, but this seems currently limited to playable files only (e.g. file names ending with audio/video file extensions). At T149642#3028288, it was hinted at supporting images, but this doesn't currently seem to be the case. I'm not sure whether this is a filter or a technical limitation.

The aggregation is exposed at https://tools.wmflabs.org/mediaviews-api/api/2, but I couldn't find the source of it. It seems similar to https://github.com/harej/mediaplaycounts but the code hosted there seems to actually have support for images, but it seems this tool isn't operational currently (would presumably at https://tools.wmflabs.org/mediaplaycounts/).

Frontend

There are a lot of ideas of how such an API could be used, including:

I think for the purposes of this task, any one of these would suffice to close, the rest can be done later.

Related Objects

StatusSubtypeAssignedTask
ResolvedNone
ResolvedNone
Resolved fdans
Resolvedmforns
Duplicate fdans
Resolved fdans
Resolved fdans
Resolved fdans
Declined fdans
Resolved fdans
Resolved fdans
Resolved fdans
Resolved fdans
Resolved fdans
Resolved fdans
OpenMusikAnimal
Resolved fdans
OpenNone
Resolved fdans
Resolved fdans
OpenMarkTraceur
OpenMarkTraceur
Resolved fdans
Resolved fdans

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Thanks for creating this task!

There was talk of adding image mediacounts to the API at T206700. The actual task for this is at T88775, and a high-level overview of implementation details at T206700#4729565. I don't know if you're interested in helping build it, but when I have the time I thought I'd give it a go.

After the data is consumable via the API, I will happily integrate it into all the Pageviews apps :)

Will merge others into this, but keep in mind this nice analysis about the storage in Cassandra implications: T88775#4751882

Thanks for moving this to High priority @Milimetric. I see the title refers to "images" and "Commons." I'd like to ask:

  1. I assume that instead of images what we really mean here is "files." E.g., presumably this will also give us a count of pageviews (not plays) to video or audio files?
  2. My understanding is that it is actually pretty common for users to upload images, etc., directly to individual wikis. Can this track that as well?
  1. I assume that instead of images what we really mean here is "files." E.g., presumably this will also give us a count of pageviews (not plays) to video or audio files?
  2. My understanding is that it is actually pretty common for users to upload images, etc., directly to individual wikis. Can this track that as well?

To answer both of these questions, all files served on all wikis are first uploaded to upload.wikimedia.org. That's what's counted in the Mediacounts dataset, and I recommend a close reading of the first two small sections of https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Mediacounts. There are corner cases listed there that explain more exactly how and what is being counted. But essentially, yes, images, audio, and movie files, counting the most common way that user agents would transfer them.

When using Media viewer to view images, some images are prefetched for better user experience, but need not yet been shown to the user. Currently, those prefetched images are getting counted, as there is as of now no way to detect whether an image was actually shown to the user or not. (Analytics/Data Lake/Traffic/Mediacounts#Corner cases)

FWIW, there is a way to detect that - virtual media views (T89088) were developed for that specific purpose (and MediaViewer sets different headers, at least on reasonably modern browsers). It just wasn't implemented in the mediacounts logic.

Also, I wouldn't call them corner cases - if I remember the stats correctly, for file types supported by Media Viewer preloads would comprise over half of the requests.

FWIW, there is a way to detect that - virtual media views (T89088) were developed for that specific purpose (and MediaViewer sets different headers, at least on reasonably modern browsers). It just wasn't implemented in the mediacounts logic.
Also, I wouldn't call them corner cases - if I remember the stats correctly, for file types supported by Media Viewer preloads would comprise over half of the requests.

Thanks very much for this info, I think we missed this or deprioritized it after it was set up. I'll change the docs and track work to improve the data here: T211030

This is important but orthogonal to the work of building the endpoint, I think that should still go ahead.

@Tgr imagine that special beacon was implemented in 2015 for 'virtual mediaviews' due to scaling concerns with the old eventlogging backend. Those concerns no longer exist and as such it makes sense to migrate counts to the current eventlogging infrastructure. This is so we do not have 3 ways to do the same thing (i.e. counting requests/pageviews/virtual pageviews/whatever event...).

Is there a team that now owns the mediacounts data and can take care of moving media counts to eventlogging beacon? Once that is done and we have checked on quality of data, putting an API on top of it is easy and something Analytics can do.

@Tgr imagine that special beacon was implemented in 2015 for 'virtual mediaviews' due to scaling concerns with the old eventlogging backend.

That was one reason, also conceptually they are more like views than like events so we assumed it's easier for analysts to work with them if they end up in the webrequest table (which I guess was also more of a concern while EventLogging went into a different storage backend). You had the same conversation around Popups virtual views, I think.

Is there a team that now owns the mediacounts data and can take care of moving media counts to eventlogging beacon?

Not that I know. The potential candidates would be Multimedia, Reading Infrastructure and Analytics I suppose?

Reading Infrastructure and Analytics I suppose?

Our team can support any team in reading with migration of the beacon to EL infrastructure but I think it should be Reading driving the project: the teams instrumenting should be the teams that own the features as instrumentation needs to adapt and follow feature changes.

Still, if we got all the media requests from Media Viewer into EventLogging, we would not have all media requests for mediawiki in general. To do that, we'd have to go around instrumenting any place that fetches and renders media (in core, extensions, etc.). I think if we want thorough and sensical mediacounts until all that work happens, we need to handle the filtering on Hadoop.

@Milimetric Right. I see your point, there needs to be some parsing of the firehouse of requests cause not all media consumption can be "eventy-fied" (true for images if not for videos). That being said my comment about the beacon data to be migrated to eventlogging still stands, we do not want to have two ways of doing the exact same thing.

jmatazzoni renamed this task from Statistics for views of individual Wikimedia Commons images to Statistics for views of individual Wikimedia images.Dec 5 2018, 5:43 PM

! In T210313#4787451, @Milimetric wrote:

To answer both of these questions, all files served on all wikis are first uploaded to upload.wikimedia.org. That's what's counted in the Mediacounts dataset...

Based on this answer, I've changed the name of the ticket, remove the word Commons

The aggregation is exposed at https://tools.wmflabs.org/mediaviews-api/api/2, but I couldn't find the source of it. It seems similar to https://github.com/harej/mediaplaycounts but the code hosted there seems to actually have support for images, but it seems this tool isn't operational currently (would presumably at https://tools.wmflabs.org/mediaplaycounts/).

At a glance harej/mediaplaycounts is the library for fetching media counts and harej/mediaplaycounts-app is the API built on it, which is exposed at https://tools.wmflabs.org/mediaviews-api/api/2. The API works fine with images.

... https://tools.wmflabs.org/mediaviews-api/api/2. The API works fine with images.

Are we sure? This is for January 20's Today's Featured Picture on enwiki (file lives on Commons): https://tools.wmflabs.org/mediaviews-api/api/2/file_playcount/date_range/Jan_Vermeer_van_Delft_-_Lady_Standing_at_a_Virginal_-_National_Gallery,_London.jpg/20190101/20190121 Nothing but zeros, which is the same thing I get if I put in a nonexistent file.

You are right, I didn't look at the output, just that it gives an OK response. Image views are int the same file so probably a simple fix though? Maybe @Harej remembers if that was an intentional limitation or a bug.

Maybe @Harej remembers if that was an intentional limitation or a bug.

I wanted to expand mediaplaycounts-api to include static images but I ran into trouble scaling the service to include the extra data. So it should be possible, assuming you don't assume your dataset can fit entirely within a single Redis instance.

mforns lowered the priority of this task from High to Medium.Mar 11 2019, 3:27 PM

Getting these sort of stats is important for our Videowiki efforts.

https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Videowiki

Hope we see the tool back up soon.

I think @Doc_James is referring to T207208#5157588. Sorry, I pointed you the wrong task, this one is about still imagery.

Hypothetically T207208 is a parent task.

@MusikAnimal @Doc_James @Tgr the following are the endpoints we're planning to roll out from the webrequest-based data we have currently:

https://wikitech.wikimedia.org/wiki/Analytics/AQS/Media_metrics#Endpoints_in_AQS

We'd love it if you could take a look and see if there's a better metric definition for your usecases given the data that we have. The available fields the mediacounts dataset will have are:

file name
number of bytes
count
referer wiki, if available, otherwise "external", "internal" or "unknown"
file classification (audio, video, image, document...)
file extension (svg, jpg, tif...)
transcoding (is an oga file being previewed as jpg?)
agent type (user or spider)
Tnegrin added a subscriber: Tnegrin.

Adding the SDC folks as we probably want to think about how this integrates with structured data. Or not.

Cool!

Is the project domain in the URL the referer, or the hosting project? In the latter case, having similar endpoints (file, site, top) for querying Commons-hosted images by referer would be very valuable IMO (I don't have a usecase personally but I imagine wiki editors will want to know how much the images are viewed on their site, as opposed to anywhere in Wikimedia properties).

This looks very promising! I'd like to go over it with @SandraF_WMF when she gets back first week of August to consider the GLAM perspective.

In addition to what Amanda mentions above, in regards to this little bit:

referer wiki, if available, otherwise "external", "internal" or "unknown"

Is there a way for us to expose more information about external embeds than just the big old bucket of "external"? It would be very helpful to get as much external use information as possible. Thanks!

@Ramsey-WMF: we shouldn't make granular referer information public but you can always access the raw data. Talk to Product Analytics or jump on the Hadoop cluster and take a look.

GFrom data from 2018-08-28 (sampled 1/128) from 23 million requests for files (to upload.wikimedia.org) about 137K are recorded as views from media-viewer. I cannot see mediaviewer requests being tagged by special headers , pinging @Tgr in case I am missing something.

@Tgr so as of today the mediaviewer sends pings to media/beacon and not sure what happens from there. Now, there are no special headers for those requests to be found.

Ping to @fdans as as part of the media requests api we need to provide a measure how many of those requests might be preloads for mediaviewer.

The way things are instrumented right now I do not think there are any headers on mediaviewer preload requests, as such those are indistinguishable from regular image requests embeded in a page.

Requests to media/beacon contain the image URI and the viewing duration in the query parameters. Currently those are not used in any way.
MediaViewer preload requests can be recognized from the CORS headers, although that's not super distinctive. Since the loading happens via an <img> tag, adding headers would be a nontrivial change.

Requests to media/beacon contain the image URI and the viewing duration in the query parameters. Currently those are not used in any way.

Understood, we filed a ticket for those to be migrated to events: T239630: Mediaviewer views should be reworked to be an eventlogging event

Nuria set the point value for this task to 0.

https://tools.wmflabs.org/mediaviews/ has been revived, making use of the new media request APIs :) Please create a task with Tool-Pageviews if you encounter issues.

Thanks! @MusikAnimal pinging Analytics so they know this is been done.

@BerndFiedlerWMDE Yes, it is the quotes and it is a known problem. moved issue to a different ticket T247333: Image files with quotes do not resolve on the mediarequest API

Should this task stay reopened? It's confusing with the current title and description.