Copyright detection (acoustic fingerprint matching) for audio files
Open, MediumPublic
Actions

Assigned To

None

Authored By

	Dispenser
	Apr 14 2016, 12:44 AM

Description

With Wikipedia Zero becoming a piracy host (T129845) we should consider implementing an acoustic fingerprint matching system (AcoustID, Echoprint, Gracenote MusicID, Shazam) to quickly alert admins. Like what CorenSearchBot does for text.

See T132650#3273150 for candidates

Related Objects
Search...

Status	Assigned	Task
Resolved	kaldari	T120288 Enable MP3 uploads on Wikimedia Commons and TMH playback
Resolved	kaldari	T162395 Add .mp3 to the list of accepted file types on Wikimedia Commons uploads
Open	None	T134802 Improve the curator workflow for reviewing new files
Open	None	T120453 Copyright violation detection tool for Commons
Open	None	T132650 Copyright detection (acoustic fingerprint matching) for audio files

Event Timeline

Dispenser created this task.Apr 14 2016, 12:44 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 14 2016, 12:44 AM

What projects is this task supposed to be in?

Gunnex subscribed.Apr 14 2016, 7:19 AM

Peachey88 subscribed.Apr 14 2016, 12:10 PM

zhuyifei1999 subscribed.Apr 14 2016, 2:53 PM

Riley_Huntley added projects: Commons, Wikimedia-Site-requests.Apr 14 2016, 11:28 PM

Riley_Huntley subscribed.

Restricted Application added subscribers: Poyekhali, JEumerus, Steinsplitter, Matanya. · View Herald TranscriptApr 14 2016, 11:28 PM

Not sure which projects this would fit into. It could be written as a bot or an extension. However, it would probably be wise to avoid WMF involvement lest Rights Holders point out policing to judges.

I wanted to put it up for grabs (I don't have time, cross posted to Commons and enwiki) and it supposedly easy to set up.

Riley_Huntley renamed this task from Music fingerprint coppyright bot to Copyright detection (acoustic fingerprint matching) for audio files.Apr 14 2016, 11:30 PM

Riley_Huntley triaged this task as Medium priority.

Dispenser added projects: Commons, Wikimedia-Site-requests.Apr 15 2016, 12:21 AM

zhuyifei1999 removed a project: Wikimedia-Site-requests.Apr 15 2016, 12:54 AM

How about this as a GSoC task?

On a somewhat related note, I don't think we have a copyright detection bot for images

Aye, we don't have bots checking images for copyright violations.

Poyekhali awarded a token.Apr 15 2016, 11:49 AM

@zhuyifei1999 Most of the work for this task isn't in setup (a few afternoons), but in the ongoing maintenance (updating code/databases, adding new databases, reviewing results). So the overhead of running a GSoC isn't worth it.

Regarding images there are several commercial cloud API CV systems now available. Rate limited and will likely only talk to @wikimediafoundation.org email addresses. But that's for another task.

@Riley_Huntley: As you prioritized this task, do you plan to work on this task? If yes, please also claim the task by setting yourself as assignee. Thanks!

Aklapper added a project: Possible-Tech-Projects.Apr 16 2016, 4:45 PM

@Aklapper; This was originally proposed as a bot task in which I was interested in attempting. However, after other options proposed (extension), I am waiting to hear what others think the best option would be. Therefore it'd be inappropriate for me to claim this task without having the skills (if it were an extension) to complete it. If you disagree with the priority, by all means change it.

An extension is unlikely to be installed since it needs to be written by a WMF employee/contractor (some trust thing), would take years for approval, and WMF doesn't clean up messes (They got unpaid volunteers to do that). A bot that hits watchlists with potential speedy deletions is "good enough".

In T132650#2212665, @Dispenser wrote:

An extension is unlikely to be installed since it needs to be written by a WMF employee/contractor

@Dispenser: That statement is incorrect. Steps and criteria for deploying extensions are available. For general meta-level discussions unrelated to this task's topic, please use better suited venues. Thanks.

@Riley_Huntley Are you still interested? MP3 support is coming T162395, so I'd like something to deal with the inevitable flood of music.

Oh gawd.

zhuyifei1999 mentioned this in T162395: Add .mp3 to the list of accepted file types on Wikimedia Commons uploads.Apr 11 2017, 5:17 PM

Framawiki subscribed.Apr 16 2017, 9:33 AM

To clarify my last comment, if we enable mp3 we are going to be swamped with massive amounts of copyvios.

El_Grafo subscribed.Apr 16 2017, 6:09 PM

How could this possibly work for Wikis that permit fair-use? Concerned there will be *many* false alarms.

In T132650#3185630, @Fastily wrote:

How could this possibly work for Wikis that permit fair-use? Concerned there will be *many* false alarms.

I think we talk about non fair use wikis here.

BethNaught subscribed.Apr 17 2017, 11:50 AM

zhuyifei1999 added a parent task: T162395: Add .mp3 to the list of accepted file types on Wikimedia Commons uploads.Apr 30 2017, 3:37 PM

Jeff_G subscribed.May 2 2017, 4:00 AM

TheDJ mentioned this in T120288: Enable MP3 uploads on Wikimedia Commons and TMH playback.May 10 2017, 8:31 AM

Krinkle awarded a token.May 12 2017, 9:39 PM

Krinkle subscribed.

zhuyifei1999 added a parent task: T120453: Copyright violation detection tool for Commons.May 13 2017, 6:07 PM

There alot of Acoustic fingerprint services, so many we nuked our list on Wikipedia. ACRCloud has a rundown on their competitors, with some praise for FLOSS AcoustID. A 2012 Investigation picked Last.fm (90 million songs).

AcoustID - Open Sourced. API Free for non-commercial and limited to 3 requests/sec. The database is CC-BY-SA 3.0 and PD and using MusicBrainz Picard and VLC player. Ubuntu package [[http://packages.ubuntu.com/python-acoustid|python-acoustid]]; Needs full clean source for matching. 8.3 or 25.5 million fingerprints.
Gracenote SDK (GNSDK) - Free for non-commercial, limited to 100-1,000 API calls per day or less. 200 million tracks.
Echo Nest Audio Fingerprinting - Open source (GitHub). Spotify bought Echo Nest in Mar-2014, the Song identification API was shutdown Jan-2015, Email & telephone are dead (May-2017), and Spotify has no Upload features. It covered 13 million songs (Public) and 30 million songs (proprietary).
- MooMa.sh released 7.8 million fingerprints (102 GB) before discontinuing the Public API
ACRCloud Music Recognition - 14-day trial, Commercial (Pricing). Emailed, they're willing to work with us. 3rd party linking (for verification): Spotify, Apple Music, YouTube. GitHub: Audio recognition file scanner. 40 million fingerprints.
Doreso - developer.doreso.com seems to be down since end 2015.
Axwave - Corporate licensing only?
Rovi Media Recognition - Tivo, so likely commercial only
No APIs: Shazam or SoundHound

Since the public database of acoustic fingerprints is small, a useful application of the tech is to identify re-uploads of deleted content.

• MZMcBride subscribed.May 22 2017, 1:30 AM

• WingedBladesofGodric subscribed.May 25 2017, 11:37 AM

@Fastily I think the main concern here, really, is simply Commons.... for any other wiki, the volume of 'valid' audio or video uploads should be sufficiently low that abuse would be easily recognizable.

Keegan subscribed.May 29 2017, 3:39 AM

In T132650#3297350, @Revent wrote:

@Fastily I think the main concern here, really, is simply Commons.... for any other wiki, the volume of 'valid' audio or video uploads should be sufficiently low that abuse would be easily recognizable.

If that is the case, then I can see this being suitable as a bot task for Commons, as opposed to a check (or extension) coded into the MediaWiki software.

Implementation update. See T132650#3273150 for the survey of technology.

AcoustID (Python lib): Integrated with an IRC bot for testing. Match count is low. There should be no obstacles bringing it to Tool Labs.
Echoprint with MooMa.sh fingerprints: The 100 GB is non-trivial, but doable on Tool Labs
ACRCloud: Signed up for trial, used sample code and the binary blob to test. Spoke with staff, willing to give access with our low usage in exchange mention on user page and blog post (I think theirs) on their technology (Like how we used it).
Gracenote: Downloaded proprietary SDK, still need to compile it.

Krinkle unsubscribed.Jun 18 2017, 6:32 PM

@Dispenser: Can you elaborate on "Match count is low."?

The front-end implementation could be a tool like CopyPatrol (but for audio instead of text). Or if we wanted to do something quick and dirty, like CorenSearchBot instead.

• ZhouZ subscribed.Jun 21 2017, 10:17 PM

@kaldari I've experimentally added AcoustID into an IRC bot notifying Commons admins of audio and video files uploaded by newbies. Three files were recognized in the past week of WP0 abuse, I'd say it's <10% effective.

Could be a demographics thing, Moroccan Teenagers (WP0) vs Audiophiles (MusicBrainz Picard).

zhuyifei1999 mentioned this in T167815: Conduct MP3 patrol discussion.Jul 21 2017, 4:06 PM

• Elitre subscribed.Jul 25 2017, 9:14 AM

RP88 subscribed.Nov 9 2017, 12:21 AM

Liuxinyu970226 subscribed.Nov 27 2017, 10:27 AM

Rapid grant for hosting properitary tools?

Dereckson updated the task description. (Show Details)Nov 29 2017, 7:26 PM

In T132650#3185853, @Steinsplitter wrote:

In T132650#3185630, @Fastily wrote:

How could this possibly work for Wikis that permit fair-use? Concerned there will be *many* false alarms.

I think we talk about non fair use wikis here.

And also only wikis that can be read by anonymous users. Uploading fingerprints to third-party services reveals what files are in use, which we can't do for private wikis.

Dispenser updated the task description. (Show Details)Dec 2 2017, 5:03 AM

Liuxinyu970226 awarded a token.Dec 4 2017, 11:07 AM

Removing the Possible-Tech-Projects tag as we are planning to kill it soon! This project does not seem to fit in the Outreach-Programs-Projects category in its current state, so I am not adding that tag right now!

Aklapper added a project: Technical-Tool-Request.Jul 16 2018, 11:52 AM

Frostly mentioned this in T120453: Copyright violation detection tool for Commons.Jan 14 2023, 10:28 PM

Copyright detection (acoustic fingerprint matching) for audio filesOpen, MediumPublicActions

Description

Related ObjectsSearch...

Event Timeline

Copyright detection (acoustic fingerprint matching) for audio files
Open, MediumPublic
Actions

Related Objects
Search...