Page MenuHomePhabricator

Copyright detection (acoustic fingerprint matching) for audio files
Open, MediumPublic

Description

With Wikipedia Zero becoming a piracy host (T129845) we should consider implementing an acoustic fingerprint matching system (AcoustID, Echoprint, Gracenote MusicID, Shazam) to quickly alert admins. Like what CorenSearchBot does for text.

See T132650#3273150 for candidates

Event Timeline

What projects is this task supposed to be in?

Not sure which projects this would fit into. It could be written as a bot or an extension. However, it would probably be wise to avoid WMF involvement lest Rights Holders point out policing to judges.

I wanted to put it up for grabs (I don't have time, cross posted to Commons and enwiki) and it supposedly easy to set up.

Riley_Huntley renamed this task from Music fingerprint coppyright bot to Copyright detection (acoustic fingerprint matching) for audio files.Apr 14 2016, 11:30 PM
Riley_Huntley triaged this task as Medium priority.

How about this as a GSoC task?

On a somewhat related note, I don't think we have a copyright detection bot for images

Aye, we don't have bots checking images for copyright violations.

@zhuyifei1999 Most of the work for this task isn't in setup (a few afternoons), but in the ongoing maintenance (updating code/databases, adding new databases, reviewing results). So the overhead of running a GSoC isn't worth it.

Regarding images there are several commercial cloud API CV systems now available. Rate limited and will likely only talk to @wikimediafoundation.org email addresses. But that's for another task.

@Riley_Huntley: As you prioritized this task, do you plan to work on this task? If yes, please also claim the task by setting yourself as assignee. Thanks!

@Aklapper; This was originally proposed as a bot task in which I was interested in attempting. However, after other options proposed (extension), I am waiting to hear what others think the best option would be. Therefore it'd be inappropriate for me to claim this task without having the skills (if it were an extension) to complete it. If you disagree with the priority, by all means change it.

An extension is unlikely to be installed since it needs to be written by a WMF employee/contractor (some trust thing), would take years for approval, and WMF doesn't clean up messes (They got unpaid volunteers to do that). A bot that hits watchlists with potential speedy deletions is "good enough".

An extension is unlikely to be installed since it needs to be written by a WMF employee/contractor

@Dispenser: That statement is incorrect. Steps and criteria for deploying extensions are available. For general meta-level discussions unrelated to this task's topic, please use better suited venues. Thanks.

@Riley_Huntley Are you still interested? MP3 support is coming T162395, so I'd like something to deal with the inevitable flood of music.

To clarify my last comment, if we enable mp3 we are going to be swamped with massive amounts of copyvios.

How could this possibly work for Wikis that permit fair-use? Concerned there will be *many* false alarms.

How could this possibly work for Wikis that permit fair-use? Concerned there will be *many* false alarms.

I think we talk about non fair use wikis here.

There alot of Acoustic fingerprint services, so many we nuked our list on Wikipedia. ACRCloud has a rundown on their competitors, with some praise for FLOSS AcoustID. A 2012 Investigation picked Last.fm (90 million songs).

Since the public database of acoustic fingerprints is small, a useful application of the tech is to identify re-uploads of deleted content.

@Fastily I think the main concern here, really, is simply Commons.... for any other wiki, the volume of 'valid' audio or video uploads should be sufficiently low that abuse would be easily recognizable.

@Fastily I think the main concern here, really, is simply Commons.... for any other wiki, the volume of 'valid' audio or video uploads should be sufficiently low that abuse would be easily recognizable.

If that is the case, then I can see this being suitable as a bot task for Commons, as opposed to a check (or extension) coded into the MediaWiki software.

Implementation update. See T132650#3273150 for the survey of technology.

  • AcoustID (Python lib): Integrated with an IRC bot for testing. Match count is low. There should be no obstacles bringing it to Tool Labs.
  • Echoprint with MooMa.sh fingerprints: The 100 GB is non-trivial, but doable on Tool Labs
  • ACRCloud: Signed up for trial, used sample code and the binary blob to test. Spoke with staff, willing to give access with our low usage in exchange mention on user page and blog post (I think theirs) on their technology (Like how we used it).
  • Gracenote: Downloaded proprietary SDK, still need to compile it.

@Dispenser: Can you elaborate on "Match count is low."?

The front-end implementation could be a tool like CopyPatrol (but for audio instead of text). Or if we wanted to do something quick and dirty, like CorenSearchBot instead.

@kaldari I've experimentally added AcoustID into an IRC bot notifying Commons admins of audio and video files uploaded by newbies. Three files were recognized in the past week of WP0 abuse, I'd say it's <10% effective.

Could be a demographics thing, Moroccan Teenagers (WP0) vs Audiophiles (MusicBrainz Picard).

How could this possibly work for Wikis that permit fair-use? Concerned there will be *many* false alarms.

I think we talk about non fair use wikis here.

And also only wikis that can be read by anonymous users. Uploading fingerprints to third-party services reveals what files are in use, which we can't do for private wikis.

srishakatux subscribed.

Removing the Possible-Tech-Projects tag as we are planning to kill it soon! This project does not seem to fit in the Outreach-Programs-Projects category in its current state, so I am not adding that tag right now!