Page MenuHomePhabricator

Prevent Wikipedia Zero exploitation of uploads to share copyrighted media
Closed, DuplicatePublic

Description

This is supposed to be a tracking bug for finding ways to prevent uploading of copyrighted and obfuscated chunks of copyrighted files through (but potentially no only) Wikipedia zero.

Event Timeline

Vituzzu created this task.Mar 23 2016, 7:18 PM
Restricted Application added subscribers: Matanya, Aklapper. · View Herald TranscriptMar 23 2016, 7:18 PM

I think a first step could be creating a series of abusefilter variables:
*some flag for edits made through zero
*upload dimension (afair there isn't)
*number of pages for documents uploads

Restricted Application added a project: Commons. · View Herald TranscriptMar 23 2016, 8:13 PM
Restricted Application added a subscriber: Steinsplitter. · View Herald Transcript
jayvdb added a subscriber: jayvdb.Mar 24 2016, 4:50 AM

I think a first step could be creating a series of abusefilter variables:
*some flag for edits made through zero

This would expose a PII component. We should at least restrict it to uploads, and advise the uploader the upload will be flags accordingly.

Why not only allow geocoded uploads to commons via zero, with the uploader aware that they are publishing geo info. That will mean the majority of good zero uploads are permitted, and will make it technically difficult to upload offending material.

I don't think we could ensure that the user was presented with a notice about publishing info though, e.g. if they're using some app that just uses the action=upload API

This would expose a PII component. We should at least restrict it to uploads, and advise the uploader the upload will be flags accordingly.

Currently TOR nodes are marked though not allowed to edit. Anyway I see your point!

Why not only allow geocoded uploads to commons via zero, with the uploader aware that they are publishing geo info. That will mean the majority of good zero uploads are permitted, and will make it technically difficult to upload offending material.

This could prevent potential though useful uploads. Also it would be not so hard to circumvent.

IMHO we shouldn't look for a way to *prevent* those uploads instead we should be able to flag them with a good level of confidence.

agray added a subscriber: agray.Mar 24 2016, 6:44 PM

Focusing specifically on WP0 uploaders doesn't seem to be the most effective approach here - there'd be nothing stopping a small number of non-WP0 users seeding this content onto Commons for anyone else to retrieve. (Likewise, there's no particular reason the downloaders have to be on WP0).

As @Vituzzu says, look at the files not the upload mechanism.

I think a first step could be creating a series of abusefilter variables:
*some flag for edits made through zero
*upload dimension (afair there isn't)
*number of pages for documents uploads

What exactly is the filter you had in mind? Its pretty trivial to embed files in other files, and have the top level file be whatever dimensions seem appropriate.

Although it might make sense if all these people aren't all that technically sophisticated and are all using the same program to create embedded files.

Could someone link to some example files? Perhaps there's commonalities that could be identified by looking closer at the file.

One way of identifying such files, is to convert the file to some other format and then back, and see the difference in compression ratio. For common methods of embedding files, things with embedded files will shrink considerably. @Dispenser used to have a bot that did this, I believe. (This approach won't work for people who are super sneaky though, and for example encoding stuff in the low-order bits of the image data, etc).

Could someone link to some example files? Perhaps there's commonalities that could be identified by looking closer at the file.

https://commons.wikimedia.org/wiki/User:Teles/Angola_Facebook_Case <-- here you can find some interesting stuffs an example is https://commons.wikimedia.org/wiki/Special:Undelete/File:Chris_Brown_X.ogg which shows another potentially interesting variable: height and width.

One way of identifying such files, is to convert the file to some other format and then back, and see the difference in compression ratio. For common methods of embedding files, things with embedded files will shrink considerably. @Dispenser used to have a bot that did this, I believe. (This approach won't work for people who are super sneaky though, and for example encoding stuff in the low-order bits of the image data, etc).

Not a bot, but a tool. There are some details in https://commons.wikimedia.org/wiki/User:Dispenser/Absurd_overhead - It uses some non-free software though, so cannot be run on Tool Labs (but let's not tangent into that issue, here!)

Are there any other ways to determine that the file extension doesn't match the file's contents? Headers or something?

(Duplicate of T129845? Or just otherwise related?)

Gunnex added a subscriber: Gunnex.Mar 26 2016, 2:48 PM
Teles added a subscriber: Teles.Mar 26 2016, 6:39 PM
Dispenser added a comment.EditedMar 26 2016, 8:48 PM

Background for the absurd overhead script on enwiki and commons discussions. Most users are simply concatenating files. The correct solution is format validation.

Could someone link to some example files? Perhaps there's commonalities that could be identified by looking closer at the file.

I've seen uploads of lots of non-free films and music on Commons. Would it be enough to block film and music uploads through Wikipedia Zero? Or would it be an option to restrict film and music uploads to members of a specific user group? Most files on Commons are pictures, so I don't think that a restriction on film & music uploads would affect a lot of users.

Film and music uploads show up at https://commons.wikimedia.org/wiki/User:OgreBot/Notable_uploads and seem to be tracked, but I'm not sure to what extent other projects are monitored.

In T130761#2149875, @Bawolff wrote on Thu, Mar 24, 10:20 PM:
Could someone link to some example files? Perhaps there's commonalities that could be identified by looking closer at the file.

See my comment at T129845#2153512.

Example: File:Story of the White Coat Upload By MHN.webm (uploaded 11:05 am, 27.03.2016)* and shared on Facebook via https://www.facebook.com/groups/243122426032648/permalink/247537522257805/ (01:10 pm, 27.03.2016).

For recent files, see e.g. uploads by this sockfarm Category:Sockpuppets of Nayon061215, like (example) uploads by Jhpabel2, probadly all shared on Facebook.

(* In some cases I overwrite related uploads with a small, nonsense file, to prevent further sharing.)

Vituzzu added a comment.EditedMar 28 2016, 9:31 AM

See my comment at T129845#2153512.
Example: File:Story of the White Coat Upload By MHN.webm (uploaded 11:05 am, 27.03.2016)* and shared on Facebook via https://www.facebook.com/groups/243122426032648/permalink/247537522257805/ (01:10 pm, 27.03.2016).
For recent files, see e.g. uploads by this sockfarm Category:Sockpuppets of Nayon061215, like (example) uploads by Jhpabel2, probadly all shared on Facebook.

(* In some cases I overwrite related uploads with a small, nonsense file, to prevent further sharing.)

An evil option would be adding some malware (I'm kidding)
Seriously your idea is good since it disrupt the file sharing system.

Maybe we can restrict video uploads to users with at least 10 edits, but this should be discussed on Commons before.

Anyway format validation and an extension of abusefilter's variables set are definitely needed.

Don't forget lots of "first-world" corporate/government/education networks are filtered but they don't block our sites, so in the future we can expect to see similar abuses flowing from potentially any kind of connection. So this is, definitely, not a 0-only problem and any solution shouldn't rely upon abusers-are-using-0-assumption.

He7d3r added a subscriber: He7d3r.Apr 2 2016, 3:09 PM
matmarex closed this task as a duplicate of Restricted Task.Apr 2 2016, 9:47 PM

I've merged this into T129845, as that was the earlier task and that is apparently linked in some discussions already. I'll summarize the discussion that happened here on that task.