Page MenuHomePhabricator

Add mediacounts to pageview API
Closed, DuplicatePublic

Description

On Commons, .webm and .ogv files currently don't have view counts, like Vimeo or YouTube has. It would be useful to see the plays and other data of videos.

Event Timeline

VictorGrigas raised the priority of this task from to Needs Triage.
VictorGrigas updated the task description. (Show Details)
VictorGrigas subscribed.
Aklapper triaged this task as Lowest priority.EditedFeb 6 2015, 12:50 PM

@VictorGrigas: Please associate a project to this task. Not even sure what this refers to - Commons?

Tbayer added projects: Analytics, Multimedia.
Tbayer set Security to None.
Tbayer subscribed.

I understand that it's indeed about Commons, and guess that Multimedia and Analytics might be relevant projects.

Tbayer added a subscriber: Tgr.

As @Tgr points out (and @ezachte confirmed on the talk page there), this proposal is related: https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_counts

The raw counts are at https://dumps.wikimedia.org/other/mediacounts/ since January 2015.

There is a zip file for each day with top 1000 most requested files, for videos relevant files see mediacounts.2015-07-20.v00.sorted_keyXX.ogg.csv where XX is 17/18/19/20 (more explanation in the csv files)

There is no API to query these huge files.
Once https://phabricator.wikimedia.org/T44259 is done publishing these counts as well might be a nice follow-up.

I can query the files directly on a limited ad hoc basis.

Milimetric renamed this task from Video view counts? to Add mediacounts to pageview API.Dec 9 2015, 5:51 PM
Milimetric moved this task from Incoming to Event Platform on the Analytics board.

Concern here is datasize, we do have counts for media files that could be published to AQS.

Milimetric raised the priority of this task from Lowest to Medium.Aug 9 2018, 3:45 PM

If you're looking for a project that might use this API, the Event Metrics tool would love to be able to get an accurate count for pageviews to all articles on which a given file is placed.

Just so I'm clear, in addition to getting audio and video "plays," would this api provide the pageview counts I'm looking for?

this might just solve a problem we're having over here in DE. There's great content providers hesitating to contribute to wikimedia commons for they cannot implement reporting.

Let's fix this, all of you amazing coding folk.

Did some analysis in term of data size and storage:

  • Our Cassandra instances (2 per host) each have 2.9Tb usable space.
  • We currently use ~720Gb per instance (this accounts for all keyspaces, replication included).
  • 98% of those 720Gb is used for pageview-per-article daily data

From a data-structure perspective, one row of pageview-per-article is a primary key (project, page-title, day), and 16 long values (denormalization of (desktop, mobile web, mobile app, all) x (user, spider, bot, all)). If we were to load mediacounts in cassandra, we would have a very similar structure: a primary-key (base-name, day), and ~18 long values (details to be discussed). Datastructures being very similar, we can safely extrapolate storage needed for mediacounts based on the one needed for pageview-per-article.

In terms of row, we counted how many primary keys were loaded daily for pageview-per-article in October 2018, and the average is 73.7 million. Similarly, we counted how many primary keys there would be for mediacounts daily for the same period, and the average is 22.2 million. This tells us that adding mediacounts daily data to aqs would incur a growth of our data of ~30%.

YearCumulative storage without mediacountsCumulative storage with mediacounts
2015110Gb (half a year of pageview data)171Gb (10 month of mediacounts data)
2016330Gb464Gb
2017550Gb758Gb
2018770Gb1051Gb
2019990Gb 1344Gb
20201210Gb1638Gb
20211430Gb1931Gb

Assuming we put a limit at ~2Tb of usage per instance (keeping a big contingency space), we would have enough storage for about until end of 2021.
I really would like us to try this one of this days :)