We currently don't have any way to track the number of times our videos are viewed. This was done by https://tools.wmflabs.org/commons-video-clicks/ so far, but it appears that tool is malfunctioning. This is a child task of the greater effort to collect and report these metrics from mediawiki. Once that is underway, this task can use those metrics to produce a standard dataset and maybe expose it via the pageview api.
Description
Details
| Title | Reference | Author | Source Branch | Dest Branch | |
|---|---|---|---|---|---|
| Add DAGs for video metric aggregations + update test var JSONs | repos/data-engineering/airflow-dags!1488 | andrewtavis-wmde | wlb-video-metric-dags | main |
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Open | None | T108522 Record and report metrics for audio and video playback | |||
| Open | None | T198628 Count the number of video plays |
Event Timeline
https://tools.wmflabs.org/mediaviews/ is an alternative tool using @Harej's mediaviews API, which is populated from https://dumps.wikimedia.org/other/mediacounts/. I think the commons-video-clicks tool probably just isn't using the new API endpoint, which was changed only a week or so ago.
It would be awesome to have this data exposed in the pageviews API. One thing that is possible with harej's API is to query by category (though I still haven't added support for this in Mediaviews). You may wish to consider something like this for the Pageviews API, too.
Yes, how would this new standard dataset differ from https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Mediacounts ?
(Or if this is about work to remedy the shortcomings listed at https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Mediacounts#Selected_requests and get a more precise estimate of how many videos are actually being played / viewed by users, then yes, that would surely be worth a separate task.)
I think the difference is that this current dataset is strictly media downloads, which can happen any number of ways (clicking play on a video widget, submitting an HTTP request directly to the server). This is a superset of a more precise click metric which would capture clicking "play" as an event. Is this correct?
Yes, almost. Clicking "play" and also deciding what it means that someone "viewed" the video. Like, does watching 1/2 of it count? What about skipping through it? Those questions have standard answers on sites like youtube but our purposes may be different so our answers may be different.
Copying over the task description from the duplicate T386916: Add view count for videos on Commons from @jan-david.franke_WMDE:
Feature summary (what you would like to be able to do and where):
For Wiki Loves Broadcast, a campaign based on the cooperation with public broadcasters in Germany, Switzerland, the UK, the Czech Republic, Ukraine and so on, we require metrics for videos on Commons – i.e. a tool that tracks video file requests (rather than include instances where a video file is loaded on a page).
Use case(s) (list the steps that you performed to discover that problem, and describe the actual underlying problem which you want to solve. Do not describe only a solution):
Through the efforts of Wiki Loves Broadcast, hundreds of videos have already been released by public broadcasters in Germany and beyond (e.g. https://commons.wikimedia.org/wiki/Category:Videos_by_Terra_X). Our partners keep asking us for reliable metrics on how often these videos were viewed in the Wikimedia projects.
For years, we have been using mvc.toolforge.org developed by Amir Sarabadani which is based on the mediaviews API, thinking that it would show the actual view count of videos uploaded to Commons (and in many cases integrated in Wikipedia articles). We have since learned that it actually measures something a lot closer to page views (of the articles these videos have been integrated in) and is therefore not fit for purpose for our cooperation with public broadcasters.
Metrics for video views are essential to Wiki Loves Broadcast, however. They are the key incentive we have in convincing public broadcasters to openly license their work by proving to them that sharing their work freely can help them broaden their audiences and fulfil their public mandate.
Through internal discussions at WMDE we believe that this metric could be derived from the base webrequest data where video open events are recorded. The accuracy of this data is not the best as sometimes video opens are fired more than once. The data would allow for distinct devices (user agent x IP hashes) that have opened a given video within a given period of interest (day, month, etc).
Benefits (why should this be implemented?):
If we were to have a reliable metric on video views, not only would this significantly help the Wiki Loves Broadcast campaign, but it would bolster all GLAM cooperations. The need in the community (as this Wish shows, for instance) and in the movement at large exists.
I'd like to note one part of the above:
Through internal discussions at WMDE we believe that this metric could be derived from the base webrequest data where video open events are recorded. The accuracy of this data is not the best as sometimes video opens are fired more than once. The data would allow for distinct devices (user agent x IP hashes) that have opened a given video within a given period of interest (day, month, etc).
This is a suggested stopgap solution so that the stakeholders involved can have base "unique viewers per period" metrics in the interim while a more expansive video play metric is developed. Given the merge of T386916 into this task, the above could be made into a subtask for this task with the explicit work to be done being detailed in the description.
andrewtavis-wmde opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1488
Add DAGs for video metric aggregations + update test var JSONs
andrewtavis-wmde merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1488
Add DAGs for video metric aggregations + update test var JSONs
Note that the MR mentioned above, #1488 on airflow-dags, is a DAG that collects video view metrics specifically for videos with categories that are needed for the Wiki Loves Broadcast project. Metrics are being collected from April 1st 2025 on, with the next step being going through the process of getting them checked to hopefully release them publicly.
@AndrewTavis_WMDE just a note, that browsers can preload video. So request data does not necessarily correlate to an actual view. While the JS player does not currently preload on pageview (I think?), this is actually something that I think we should change (its status quo left over from the old implementation)
Thanks for checking in here, @TheDJ 👋 Can we check the conditions with you, as as I understood it the approach we'd taken was getting around the issue that videos are preloaded into the browsers. We were aware that the original methods used for this didn't meet requirements. The crux of the approach is a query from wmf.webrequest with the following conditions that we checked based on when the actual thumbnail is clicked:
SELECT DISTINCT regexp_extract(uri_path, '/([^/]+)$', 1) AS video_filename, ip AS ip, user_agent AS user_agent FROM wmf.webrequest WHERE uri_host = 'upload.wikimedia.org' AND content_type LIKE 'video/%' AND webrequest_source = 'upload' AND agent_type = 'user'
We're then counting unique IP x user agent combinations on a daily basis to get the rough number of actors viewing the video in a day, which is the aggregated over a month. This is not exact by any means, but we figured that the approach of "count someone twice only if they click the video more than once in a month rather than a day" made sense. The video filename is also matched with the categories that are related to Wiki Loves Broadcast.
Can you give examples of these requests that we are filtering? If they are directly on the video, are you only checking requests that have either no Range header or a Range header that begins with 0- ? Because videos are often a progressive downloads, and you'd be counting each chunk being downloaded if you don't account for that, whereas Ranges that begin with 0 at least ensure you only count the first chunk of the download.
Though I guess when you collate to per day, it doesn't matter too much, if not too many people use the same ip + browser combo.
As far as I can tell from your notes, there is nothing that guards you from preload requests, other than them currently not happening. You cannot server side tell if someone clicked on a video, or if the video was preloaded by the browser. It's just that we currently don't allow preloading.
This is an example from the current Commons main page for instance: It has "preload=none". This delays video playback when people start the video. Ideally this is not set, the browser is capable of making its own assessment. Especially on the File page I've been thinking about removing that setting, as doing so will make playback a lot faster to start in ideal conditions.
<video id="mwe_player_0" poster="https://upload.wikimedia.org/wikipedia/commons/thumb/a/ac/Cat_lapping_water_off_ground_in_slow_motion.gk.webm/450px--Cat_lapping_water_off_ground_in_slow_motion.gk.webm.jpg" controls="" preload="none" data-mw-tmh="" class="mw-file-element" width="450" height="253" data-durationhint="24" data-mwtitle="Cat_lapping_water_off_ground_in_slow_motion.gk.webm" data-mwprovider="local">
<source src="https://upload.wikimedia.org/wikipedia/commons/transcoded/a/ac/Cat_lapping_water_off_ground_in_slow_motion.gk.webm/Cat_lapping_water_off_ground_in_slow_motion.gk.webm.480p.vp9.webm" type="video/webm; codecs="vp9, opus"" data-transcodekey="480p.vp9.webm" data-width="854" data-height="480">
<source src="https://upload.wikimedia.org/wikipedia/commons/transcoded/a/ac/Cat_lapping_water_off_ground_in_slow_motion.gk.webm/Cat_lapping_water_off_ground_in_slow_motion.gk.webm.720p.vp9.webm" type="video/webm; codecs="vp9, opus"" data-transcodekey="720p.vp9.webm" data-width="1280" data-height="720">
[....]
</video>I don't think this should block you right now, but it is a big fat exception that needs to be documented, in case it ever does change.
Additional note. If you also count the thumbnail poster downloads of videos, then by combining these numbers, You can give a very rough estimate of the click through rate for the impressions (inclusions of videos in pages). In Industry, this is called the "Play rate" metric.
Posters being:
https://upload.wikimedia.org/wikipedia/commons/thumb/a/ac/Cat_lapping_water_off_ground_in_slow_motion.gk.webm/450px--Cat_lapping_water_off_ground_in_slow_motion.gk.webm.jpg (there can be multiple distinct ones of these per video due to the ability to choose a timestamp for a poster [not applied in this example])
Thanks for all the information, @TheDJ :) Bringing @Ben.buchenau in as well as we've been discussing this. Does seem like we're reliant on preload=none for this. Is there a way for us to monitor if and when this change is made? I guess we'd see the spike in the data.
I wouldn't know the exact process needed, but with preload=none the only way forward without an actual video view solution that to my knowledge isn't on a road map would be to then filter out video open events that are too close in time to the page being navigated to. We can consider this a bit later once the change to preload the videos is imminent.
This is something we at VideoWiki would love to see. Accurate metrics for number of plays of videos and duration of video played. Andrew do you plan to look at that second bit?
Would be nice to be able to break down videos by Wikipedia they are coming form, and would also be nice if we could figure out views coming from third party wikis via instant commons.
Hey @Doc_James 👋 Would be nice if we could have these kinds of high level metrics, but that's out of the scope of the current project which is just trying to get as good of metrics as possible within the current scope of the available data. Exact instrumentation of a video play including duration watched would likely require WMF data engineering to get involved. This is a known issue, but as WMDE understood isn't something that can be prioritized right now, so we chose to go with the current approach.
Okay thanks, we are funding Yaron to help with this work and hopefully make some headway with more video specific metrics.
Hi @AndrewTavis_WMDE, yes I'm supporting Yaron in his work here. He has a proof of concept metric implementation. We just have to get it incorporated and deployed, but that may take some doing.
Thanks for the note here, @Milimetric! Please let us know when all of this is finished up and we'll switch out process over to the new metrics.
Hi @Ahoelzl. We have new interest in this work and volunteers to do it. I'm also tagging Experiment Platform because we can support this work with it. cc @phuedx.
A volunteer has the ability and know-how to develop an instrument that could track video counts in JS. We could use the web_base schema for this and collect minimal context and actions to accomplish the dataset they need. On our side we'd need to create a pipeline to aggregate and publish the data. We could configure Wikistats with a new metric to visualize the data.
So I'm happy to shepherd this work and do the pipeline part. But I wanted to check with all of you that it's ok and not getting in the way of other plans you have to support this type of thing.
Okay so we have video and audio views live and functional on Basque Wikipedia.
https://commons-play-listener.toolforge.org/dashboard/?website=eu.wikipedia.org
Basically one just needs an interface admin to activate it, but rather than activating one wiki by one wiki wondering the process to roll it out more broadly?