On Commons, .webm and .ogv files currently don't have view counts, like Vimeo or YouTube has. It would be useful to see the plays and other data of videos.
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Milimetric | T77541 Per-file view stats | |||
Duplicate | None | T88775 Add mediacounts to pageview API |
Event Timeline
@VictorGrigas: Please associate a project to this task. Not even sure what this refers to - Commons?
I understand that it's indeed about Commons, and guess that Multimedia and Analytics might be relevant projects.
As @Tgr points out (and @ezachte confirmed on the talk page there), this proposal is related: https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_counts
The raw counts are at https://dumps.wikimedia.org/other/mediacounts/ since January 2015.
There is a zip file for each day with top 1000 most requested files, for videos relevant files see mediacounts.2015-07-20.v00.sorted_keyXX.ogg.csv where XX is 17/18/19/20 (more explanation in the csv files)
There is no API to query these huge files.
Once https://phabricator.wikimedia.org/T44259 is done publishing these counts as well might be a nice follow-up.
I can query the files directly on a limited ad hoc basis.
Concern here is datasize, we do have counts for media files that could be published to AQS.
If you're looking for a project that might use this API, the Event Metrics tool would love to be able to get an accurate count for pageviews to all articles on which a given file is placed.
Just so I'm clear, in addition to getting audio and video "plays," would this api provide the pageview counts I'm looking for?
this might just solve a problem we're having over here in DE. There's great content providers hesitating to contribute to wikimedia commons for they cannot implement reporting.
Let's fix this, all of you amazing coding folk.
Did some analysis in term of data size and storage:
- Our Cassandra instances (2 per host) each have 2.9Tb usable space.
- We currently use ~720Gb per instance (this accounts for all keyspaces, replication included).
- 98% of those 720Gb is used for pageview-per-article daily data
From a data-structure perspective, one row of pageview-per-article is a primary key (project, page-title, day), and 16 long values (denormalization of (desktop, mobile web, mobile app, all) x (user, spider, bot, all)). If we were to load mediacounts in cassandra, we would have a very similar structure: a primary-key (base-name, day), and ~18 long values (details to be discussed). Datastructures being very similar, we can safely extrapolate storage needed for mediacounts based on the one needed for pageview-per-article.
In terms of row, we counted how many primary keys were loaded daily for pageview-per-article in October 2018, and the average is 73.7 million. Similarly, we counted how many primary keys there would be for mediacounts daily for the same period, and the average is 22.2 million. This tells us that adding mediacounts daily data to aqs would incur a growth of our data of ~30%.
Year | Cumulative storage without mediacounts | Cumulative storage with mediacounts |
2015 | 110Gb (half a year of pageview data) | 171Gb (10 month of mediacounts data) |
2016 | 330Gb | 464Gb |
2017 | 550Gb | 758Gb |
2018 | 770Gb | 1051Gb |
2019 | 990Gb | 1344Gb |
2020 | 1210Gb | 1638Gb |
2021 | 1430Gb | 1931Gb |
Assuming we put a limit at ~2Tb of usage per instance (keeping a big contingency space), we would have enough storage for about until end of 2021.
I really would like us to try this one of this days :)