Add mediacounts to pageview API
Closed, DuplicatePublic
Actions

Assigned To

None

Authored By

	• VictorGrigas
	Feb 6 2015, 5:03 AM

Description

On Commons, .webm and .ogv files currently don't have view counts, like Vimeo or YouTube has. It would be useful to see the plays and other data of videos.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Milimetric	T77541 Per-file view stats
		Duplicate		None	T88775 Add mediacounts to pageview API

Event Timeline

• VictorGrigas created this task.Feb 6 2015, 5:03 AM

• VictorGrigas raised the priority of this task from to Needs Triage.

• VictorGrigas updated the task description. (Show Details)

• VictorGrigas subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 6 2015, 5:03 AM

@VictorGrigas: Please associate a project to this task. Not even sure what this refers to - Commons?

I understand that it's indeed about Commons, and guess that Multimedia and Analytics might be relevant projects.

Tgr added a parent task: T77541: Per-file view stats.Feb 8 2015, 6:19 AM

As @Tgr points out (and @ezachte confirmed on the talk page there), this proposal is related: https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_counts

• brooke subscribed.Jul 21 2015, 1:16 AM

Restricted Application added a subscriber: Matanya. · View Herald TranscriptJul 21 2015, 1:16 AM

The raw counts are at https://dumps.wikimedia.org/other/mediacounts/ since January 2015.

There is a zip file for each day with top 1000 most requested files, for videos relevant files see mediacounts.2015-07-20.v00.sorted_keyXX.ogg.csv where XX is 17/18/19/20 (more explanation in the csv files)

There is no API to query these huge files.
Once https://phabricator.wikimedia.org/T44259 is done publishing these counts as well might be a nice follow-up.

I can query the files directly on a limited ad hoc basis.

Jdforrester-WMF moved this task from Untriaged to Backlog on the Multimedia board.Sep 4 2015, 5:53 PM

• Tbayer mentioned this in T108522: Record and report metrics for audio and video playback.Dec 3 2015, 6:00 AM

Milimetric renamed this task from Video view counts? to Add mediacounts to pageview API.Dec 9 2015, 5:51 PM

Milimetric moved this task from Incoming to Event Platform on the Analytics board.

Kelson subscribed.Jan 11 2016, 12:25 PM

Concern here is datasize, we do have counts for media files that could be published to AQS.

• Nuria moved this task from Event Platform to Wikistats on the Analytics board.Aug 15 2016, 3:58 PM

• Nuria moved this task from Wikistats to Dashiki on the Analytics board.Mar 16 2017, 5:20 PM

• Nuria moved this task from Dashiki to Wikistats on the Analytics board.Jul 3 2017, 4:53 PM

• Nuria moved this task from Wikistats to Backlog (Later) on the Analytics board.Jan 11 2018, 5:39 PM

Jdforrester-WMF subscribed.Feb 22 2018, 7:36 PM

• Nuria moved this task from Backlog (Later) to Analytics Query Service on the Analytics board.Jun 26 2018, 4:17 PM

Milimetric raised the priority of this task from Lowest to Medium.Aug 9 2018, 3:45 PM

WMDE-leszek mentioned this in T201180: How to get display statistics of the content publised on Commons.Aug 13 2018, 7:55 AM

Milimetric mentioned this in T206700: Create a method for 'Avg. daily views to pages that have uploaded files' .Oct 17 2018, 8:11 PM

If you're looking for a project that might use this API, the Event Metrics tool would love to be able to get an accurate count for pageviews to all articles on which a given file is placed.

Just so I'm clear, in addition to getting audio and video "plays," would this api provide the pageview counts I'm looking for?

this might just solve a problem we're having over here in DE. There's great content providers hesitating to contribute to wikimedia commons for they cannot implement reporting.

Let's fix this, all of you amazing coding folk.

Did some analysis in term of data size and storage:

Our Cassandra instances (2 per host) each have 2.9Tb usable space.
We currently use ~720Gb per instance (this accounts for all keyspaces, replication included).
98% of those 720Gb is used for pageview-per-article daily data

From a data-structure perspective, one row of pageview-per-article is a primary key (project, page-title, day), and 16 long values (denormalization of (desktop, mobile web, mobile app, all) x (user, spider, bot, all)). If we were to load mediacounts in cassandra, we would have a very similar structure: a primary-key (base-name, day), and ~18 long values (details to be discussed). Datastructures being very similar, we can safely extrapolate storage needed for mediacounts based on the one needed for pageview-per-article.

In terms of row, we counted how many primary keys were loaded daily for pageview-per-article in October 2018, and the average is 73.7 million. Similarly, we counted how many primary keys there would be for mediacounts daily for the same period, and the average is 22.2 million. This tells us that adding mediacounts daily data to aqs would incur a growth of our data of ~30%.

Year	Cumulative storage without mediacounts	Cumulative storage with mediacounts
2015	110Gb (half a year of pageview data)	171Gb (10 month of mediacounts data)
2016	330Gb	464Gb
2017	550Gb	758Gb
2018	770Gb	1051Gb
2019	990Gb	1344Gb
2020	1210Gb	1638Gb
2021	1430Gb	1931Gb

Assuming we put a limit at ~2Tb of usage per instance (keeping a big contingency space), we would have enough storage for about until end of 2021.
I really would like us to try this one of this days :)

MusikAnimal mentioned this in T210313: Statistics for views of individual Wikimedia images.Nov 24 2018, 6:35 PM

Krinkle awarded a token.Nov 24 2018, 10:42 PM

Milimetric closed this task as a duplicate of T210313: Statistics for views of individual Wikimedia images.Nov 29 2018, 5:55 PM

Add mediacounts to pageview APIClosed, DuplicatePublicActions

Description

Related ObjectsSearch...

Event Timeline

Add mediacounts to pageview API
Closed, DuplicatePublic
Actions

Related Objects
Search...