Page MenuHomePhabricator

Copyright violation detection tool for Commons
Open, Needs TriagePublic


The "anyone can upload" nature of Commons is, at times, subject to abuse by copyright violators. Some violating files are left for long periods of time because people are afraid to tag them, and other times files are subject to the subjective discretion of admins. It would be helpful if there were tools to both help detect possible violations (perhaps through reverse-search APIs) and a better method for screening files once detected. Specifically, images (this task's main focus) and audio files (see subtask T132650) should be targeted.

This was originally proposed in the 2015 and 2022 Community Wishlist Surveys, placing #26 and #4 respectively.

Related Objects

Resolved Prtksxna

Event Timeline

DannyH raised the priority of this task from to Needs Triage.
DannyH updated the task description. (Show Details)
DannyH added a subscriber: DannyH.

Some specific ideas from the wishlist page:

I'm wondering if using google image search at the time of upload would be useful for creating an autodetection method? We would also need to build a whitelist to reduce false positives. --@Doc_James

I think a good first step would be use perceptual hashing to see if a similar image was previously deleted. I imagine lots of copyvios are uploaded again and again. --@Bawolff

We could add new filters to [[c:Special:NewFiles]] tool so Commons users can browse for new uploads that are:

  • from new users (we already have newbie-upload tool),
  • from users with a lot of recent deletions,
  • do not have EXIF data,
  • are small,
  • known to google image search,
  • were deleted before
  • do or do not claim {{own}} ("own work")
  • do or do not use [[c:template:Custom license]]

etc. All those factors increase a chance that an image is a Copyvio and it would be nice if we could add and remove those filters in any combination. --@Jarekt

See T31793 for an update, all theoretical right now

IMPORTANT: If you are a community developer interested in working on this task: The Wikimedia Hackathon 2016 (Jerusalem, March 31 - April 3) focuses on #Community-Wishlist-Survey projects. There is some budget for sponsoring volunteer developers. THE DEADLINE TO REQUEST TRAVEL SPONSORSHIP IS TODAY, JANUARY 21. Exceptions can be made for developers focusing on Community Wishlist projects until the end of Sunday 24, but not beyond. If you or someone you know is interested, please REGISTER NOW.
Poyekhali triaged this task as Medium priority.Apr 13 2016, 5:06 AM

Pokefan95 triaged this task as "Normal" priority.

@Pokefan95 Are you planning to code the tool or why you set the priority?

Steinsplitter raised the priority of this task from Medium to Needs Triage.Apr 16 2016, 10:49 AM
MusikAnimal added a subscriber: MusikAnimal.

I don't think Earwig's Copyvios tool was built to search imagery and other media. Neither is CopyPatrol, hence removing the tag.

I was interested in this and decided to tackle at least some of it. I found the Bing Image Search API which seemed appropriate, however, after speaking with somebody at Microsoft and taking a look at usage documentation I was informed that the API should only be used with user interaction to fulfil a search request and cannot be used for this purpose unless it was some sort of button that would have to be pressed by a patroller which may defeat the entire point.

So I'm currently looking at Google's Cloud Vision API. I know using commercial APIs isn't the most appropriate, however for this purpose they are in many ways our best choice. This is what I'm currently looking at:

We'd honestly need a big player on board for something fully effective but I will keep working on this and see what works.

I looked at this in mid-December but forgot to write my reports here.

Let's say I want to take a look at non-patrolled file uploads.

Pricing: From the samples I got, if I wanted to get the cheapest package of tineye ($200), it would be only useful for a day worth of uploads. Doing some changes and trying to reduce the number by excluding admins in other wikis and such didn't help much. I decided that it was simply too expensive for too little gain if I wanted to pay out of my own packet. WMF OTOH could buy two enterprise licenses and it'll be covered for a full year (meaning $20K annually) which I don't think it would be too expensive. Maybe we could negotiate and get a discount *shrugs*

What other thing I realized was that very little number of new uploads get actually patrolled:

mysql:research@s4-analytics-replica.eqiad.wmnet [commonswiki]> select rc_patrolled, count(*) from recentchanges where rc_log_type = 'upload' group by rc_patrolled;
| rc_patrolled | count(*) |
|            0 |   218813 |
|            1 |     1043 |
|            2 |   605773 |
3 rows in set (58.944 sec)

i.e. 218K new uploads in the last month are waiting to be reviewed while only 1K got reviewed (and 600K uploads are automatically patrolled given that the uploader is trusted). This is quite concerning to me. if you think NPP in enwiki is on the verge of collapse under the load, IMHO New uploads patrol has already collapsed. Thus I think having a service in toolforge that gives overview to patrollers and admins and help them do their work in a central place (and batched, e.g. mark all uploads by this user as patrolled or delete all uploads by this user, etc.) would definitely help, later tineye could come and help even more (i.e. I think first we need an infrastructure for patrolling new uploads then copyvio). I wrote some ideas on how to do this but didn't get to finish it :/ I will try to build something soon *fingers crossed*

I agree 20k is not money for this.

I was going to sat this is not a complete picture, since out of 195k non-patrolled uploads.

select rc_patrolled, count(*) from recentchanges where rc_log_type = 'upload' and rc_timestamp > '20221200000000' AND rc_timestamp < '20230101000000' group by rc_patrolled;
| rc_patrolled | count(*) |
|            0 |   195988 |
|            1 |      934 |
|            2 |   544372 |

A number of them have already been deleted or tagged for deletion (so actually reviewed). But only an additional 11k have already been deleted (or renamed, maybe):

select COUNT(*) from recentchanges LEFT JOIN page ON (page_namespace=rc_namespace AND page_title=rc_title) where rc_log_type = 'upload' and rc_timestamp > '20221200000000' AND rc_timestamp < '20230101000000' AND rc_patrolled=0 AND page_id IS NULL;
| COUNT(*) |
|    10935 |

Existing files have been uploaded by 21k different users. Looking at the number of uploads-per-user:

  • 1 user with 3339
  • 1 user with 2881
  • 1 user with 2370
  • 1 user with 2297
  • 1 user with 2044
  • 1 user with 1867
  • 1 user with 1810
  • 1 user with 1625
  • 1 user with 1554
  • 1 user with 1545
  • 1 user with 1381
  • 1 user with 1365
  • 1 user with 1291
  • 1 user with 1154
  • 1 user with 1051
  • 1 user with 1025
  • 1 user with 1020
  • 1 user with 1002
  • 1 user with 931
  • 1 user with 918
  • 1 user with 913
  • 1 user with 858
  • 1 user with 847
  • 1 user with 807
  • 4 users with 800-800
  • 8 users with 600-700
  • 8 users with 500-600
  • 20 users with 400-500
  • 17 users with 300-400
  • 45 users with 200-299
  • 134 users with 100-199
  • 35 users with 90-99
  • 31 users with 80-89
  • 58 users with 70-79
  • 59 users with 60-69
  • 97 users with 50-59
  • 132 users with 40-49
  • 206 users with 30-39
  • 26 users with 20-29
  • 61 users with 19
  • 59 users with 18
  • 59 users with 17
  • 94 users with 16
  • 89 users with 15
  • 89 users with 14
  • 102 users with 13
  • 133 users with 12
  • 164 users with 11
  • 178 users with 10
  • 207 users with 9
  • 262 users with 8
  • 323 users with 7
  • 439 users with 6
  • 619 users with 5
  • 917 users with 4
  • 1551 users with 3
  • 3272 users with 2
  • 11166 users with 1 upload

I started a very basic tool now, I hope it helps. If people start using it, I will add more features and if anyone wants to help, please do!

See this previous grant proposal for some previous planning I've done on this topic; it may be relevant in any future implementations. The grant proposal page contains a draft extension-based architecture, but an external Toolforge tool would also help, even it if only presents links to the reverse search services (available APIs are certainly quite expensive, especially at scale).

Other than Google, TinEye, Bing and Yandex, Pixsy may also be useful.

Editing; while this was originally proposed in the 2015 CWS and the text was copied from there, I think some clarity in the description wouldn't hurt. Also noting that this came #4 in the 2022 survey.

English Wikipedia screens articles for copyright by using a bot (EranBot) to scan new articles, and when it detects a violation, marking them with the pagetriagetagcopyvio API. This API is part of the PageTriage extension. Then I believe a Toolforge tool called CopyPatrol is used to list these articles marked as copyvios, and provides buttons to the user to take further actions.

This tech stack could be copied for commons. Would need to adjust everything, but this way you wouldn't need to start from scratch.