Page MenuHomePhabricator

Disable serving unpatrolled new files to Wikipedia Zero users
Closed, DeclinedPublic

Description

Due to sustained Wikipedia Zero abuse T129845 we should disallow WP0 users from receiving files until they can be reviewed for copyright.


Implementation plan

Core functionality:

  • Make File::getContentHeaders() look up patrol status of the page (probably with some kind of hook so that FlaggedRevs wikis with patrolling disabled can plug in review status instead) and add an X-MediaWiki-Patrol-Status header accordingly.
    • It doesn't seem like File::getContentHeaders() is acutally used anywhere; LocalFile::upload() uses MediaHandler::getContentHeaders() instead (and we don't want to depend on that as this should work for files which have no handler). That's probably a bug and should be fixed.
    • On copy/move, seems like SwiftFileBackend::doXXXInternal() takes care of persisting the headers.
    • It seems like all header options go through SwiftFileBackend::sanitizeHdrs() which ends up removing almost everything. So that should be fixed to keep X-MediaWiki- headers.
  • Make Varnish split the cache on zero status and return an error response with short-lived cache if the client is zero-rated and the backend response has X-MediaWiki-Patrol-Status: unpatrolled.
  • In RecentChange::doMarkPatrolled() (eww, but seems the least bad location) remove the Swift header with something like FileBacken::doQuickOperations( [ [ 'op' => describe', 'src' => $file->getPath(), 'headers' => [ 'X-MediaWiki-Patrol-Status' => '' ] ] ] ) (while making sure not to re-trigger T178849) and also call WikiFilePage::doPurge().

Transcodes:

Most of the piracy is video-related and using transcodes would be a trivial workaround so TimedMediaHandler needs to copy the header to those (and hook into the purging of the header to remove them).

FlaggedRevs wikis:

(Some) FlaggedRevs wikis use a separate review mechanism for images alongside or instead of patrolling. (At least huwiki has review for files enabled, patrolling disabled, and is an occasional WP0 piracy target; so while less important, this hole should be plugged too eventually.) To support them, a new hook should be added to File::getContentHeaders() so they can set the patrolled flag; and the purge mechanism should be made available so that FlaggedRevs wikis can call it when a file is reviewed.

Thumbnails:

This is less important as images are mostly used by the pirates as containers for hidden files, and it's unlikely those would survive thumbnailing. (I.e. we probably don't need to bother.)

  • On thumbnailing, File::generateAndSaveThumb() calls FileRepo::quickImport() and does not pass any headers except Content-Disposition, and as far as I can see there is no other magic to copy them over from the original, so we'd need to re-inject the headers there. Probably the easiest way is to fetch the headers of the originals from Swift and copy them (avoids special-casing of foreign DB images etc).
  • Also update Thumbor accordingly.
  • Maybe also make thumb.php add the header? Not sure if that's actually useful - for sites using Swift, like Wikimedia, the thumbnail will always be streamed from Swift I think, so it's enough to have them there. For sites using file storage, exposed via the web server, there is no way to set custom headers. For the default MediaWiki config (everything goes through thumb.php), this would make the headers more consistent, but it's unlikely that anyone using such a low-performance config would need to filter on patrol status. Same goes for img_auth.php, it's unlikely that private wikis would ever need this feature.

Patches are under the wp0 topic. Testing plan:

  • enable swift and varnish roles in vagrant
  • install all the patches; instead of the mediawiki-config patch use something like
foreach ( $wgFileBackends as &$backend ) {
    if ( $backend['name'] === 'swift-backend' ) {
        $backend['headerWhitelist'] = [ '/^x-mediawiki-/' ];
    }
}
  • the testing patch for vagrant uses the browser language as a proxy for Zero status so add Arabic to the list of accepted languages to simulate coming from a Zero domain.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Please also keep in mind that blocking all unpatrolled files for all anons would have a lot of collateral impact: articles with recently-added images will frequently look broken, and the editor won't notice because they are logged in. Whatever measure we decide on, I'd prefer we make it as targeted as we reasonably can, both in terms of WP0 vs other anons, and in terms of video abuse vs legitimate images.

Unless Commons have a lot of new file patrollers who really mark new files as patrolled when they are finished reviewing (which I see is not the case), I wouldn't support BBlack's counter-proposal.

Unless Commons have a lot of new file patrollers who really mark new files as patrolled when they are finished reviewing (which I see is not the case), I wouldn't support Bawolff's counter-proposal.

To clarify, what do you think my counter proposal was? I dont think i support anything broader than the original proposal.

To clarify, I would support sending 403s for original file assets (not thumbnails) that exceed a certain size and associated video transcodes for zero-rated connections until the file is reviewed. Possibly with some sort of timeout where after a week if the file is not reviewed it becomes public anyways in case any get missed.

Unless Commons have a lot of new file patrollers who really mark new files as patrolled when they are finished reviewing (which I see is not the case), I wouldn't support Bawolff's counter-proposal.

To clarify, what do you think my counter proposal was? I dont think i support anything broader than the original proposal.

Ow, I meant BBlack's, I apologize. Your and BBlack's profile pic are similar, that's why I thought it is yours.

To clarify, I would support sending 403s for original file assets (not thumbnails) that exceed a certain size and associated video transcodes for zero-rated connections until the file is reviewed. Possibly with some sort of timeout where after a week if the file is not reviewed it becomes public anyways in case any get missed.

I strongly support your suggestion.

Hello ! I have a large size image (12.3MB) of a high quality and i want to upload it here to show you some information about organized groups on Facebook.
Can you help me please ?
And what should I do ?

@Younes19956 Please stop spamming different tasks. You can upload files on external providers such https://imageshack.us/, and post the link to the file here. I hope your picture will have an interest related to current posts.

@Younes19956 Please stop spamming different tasks. You can upload files on external providers such https://imageshack.us/, and post the link to the file here. I hope your picture will have an interest related to current posts.

i am sorry but this is my work to upload the news interest related to current posts with this picture.

[...]
To clarify, I would support sending 403s for original file assets (not thumbnails) that exceed a certain size and associated video transcodes for zero-rated connections until the file is reviewed. Possibly with some sort of timeout where after a week if the file is not reviewed it becomes public anyways in case any get missed.

Sounds reasonable.

Perhaps a default text/picture like "This file can't be shown for now" can be better than a 403.

However, this carries the caveat that there's no explicit realtime update to the patrol status of a cached file. The first time any user fetches the file through a given cache (regardless of Zero status), the file will be loaded into cache and kept, and the code above simply masks the response to the user when appropriate. Therefore, we'd also need to trigger a cache invalidation on the file when the patrolled status changes, just like we would on content replacement. Otherwise the initial unpatrolled state will persist too long in the cache. We could perhaps hack around that a bit by giving unpatrolled files a shorter TTL cap (e.g. 10 minutes instead of the usual 24h), but invalidate-on-patrol would be much cleaner and saner.

However log in cookies arent set on upload.wikimedia.org - so upload wouldnt know if someone is logged in afaik.

If we truly need actual user based permissions we would probably have to make unpatrolled uploads go through img_auth.php on the commons.wikimedia.org (or send some sort of token in the GET parameter.). however i do think blocking just zero is sufficient for this problem. Nonzero users have easier ways to get movies.

Indeed img_auth.php in the varnish -> swift path is doable but only if we really need as that's a non-trivial amount of resources in terms of req/s and bandwidth.

re: the file header I'm assuming it'd be set for both the original and carried over to thumbnails, and purged from all once the file is patroled

re: the file header I'm assuming it'd be set for both the original and carried over to thumbnails, and purged from all once the file is patroled

The issue is people uploading hollywood movies. In the case of embedding in images, while I imagine there are ways to embed things in images such that they are preserved during the thumbnailing process, it seems unlikely that these people will do that. And for actual movies (.webm) the thumbnail is just a screenshot. So for this use case blocking the thumbs probably aren't needed. We would of course have to block the scaled down transcodes though.

Unless Commons have a lot of new file patrollers who really mark new files as patrolled when they are finished reviewing (which I see is not the case), I wouldn't support BBlack's counter-proposal.

I don't even see the Mark file as patrolled link…

Unless Commons have a lot of new file patrollers who really mark new files as patrolled when they are finished reviewing (which I see is not the case), I wouldn't support BBlack's counter-proposal.

I don't even see the Mark file as patrolled link…

The link can be found at the top of the file history section.

A media block was implemented into Phabricator to combat T168142 WP0 abuse. Since then, we've discovered T174342 that the WP0 configuration is inconsistent with issued IP ranges.

Since this seems to be impossible. Would adding interstitial for Facebook referrers be doable? I imagine that we'd want to discourage people from thinking our content comes from Facebook.

Preventing WP0 users from doing file patrolling seems like an acceptable level of collateral damage (maybe with a wiki whitelist so that wikis which get their traffic predominantly from WP0 can still use local file patrolling - although I don't think there are any such wikis that allow unprivileged local uploads).

T133821: Make CDN purges reliable is already a major PITA and this would make it a lot worse. Combining content purges with a short (say <1h) TTL for the 403 responses should be a reasonable compromise, I think.

So to summarize / spec out the code parts a bit:

Core functionality:

  • Make File::getContentHeaders() look up patrol status of the page (probably with some kind of hook so that FlaggedRevs wikis with patrolling disabled can plug in review status instead) and add an X-MediaWiki-Patrol-Status header accordingly. (Should it also reduce cache TTL? Can cache headers be put into Swift?)
    • It doesn't seem like File::getContentHeaders() is acutally used anywhere; LocalFile::upload() uses MediaHandler::getContentHeaders() instead (and we don't want to depend on that as this should work for files which have no handler). That's probably a bug and should be fixed.
    • On copy/move, seems like SwiftFileBackend::doXXXInternal() takes care of persisting the headers.
    • It seems like all header options go through SwiftFileBackend::sanitizeHdrs() which ends up removing almost everything. So that should probably be fixed to keep X-MediaWiki- headers? (Or the header should be called X-Content-Patrol-Status which is has a whitelisted prefix.)
  • Split the cache on zero status (in Varnish config? via a Vary header? I guess the latter makes more sense as we don't need a cache split for patrolled files, but do we want Vary headers to live in Swift?).
  • Make Varnish return an error response with short-lived cache if the client is zero-rated and the backend response has X-MediaWiki-Patrol-Status: unpatrolled.
  • In RecentChange::doMarkPatrolled() (eww, but seems the least bad location) remove the Swift header with something like FileBacken::doQuickOperations( [ [ 'op' => describe', 'src' => $file->getPath(), 'headers' => [ 'X-MediaWiki-Patrol-Status' => '' ] ] ] ) (while making sure not to re-trigger T178849) and also call WikiFilePage::doPurge().

FlaggedRevs wikis:

(Some) FlaggedRevs wikis use a separate review mechanism for images alongside or instead of patrolling. (At least huwiki has review for files enabled, patrolling disabled, and is an occasional WP0 piracy target; so while less important, this hole should be plugged too eventually.) To support them, a new hook should be added to File::getContentHeaders() so they can set the patrolled flag; and the purge mechanism should be made available so that FlaggedRevs wikis can call it when a file is reviewed.

Thumbnails:

This is less important as images are mostly used by the pirates as containers for hidden files, and it's unlikely those would survive thumbnailing. (I.e. we probably don't need to bother.)

  • On thumbnailing, File::generateAndSaveThumb() calls FileRepo::quickImport() and does not pass any headers except Content-Disposition, and as far as I can see there is no other magic to copy them over from the original, so we'd need to re-inject the headers there. Given that patrol flags are in the RC table, the headers would have to fetched from Swift (also avoids special-casing of foreign DB images); or we could just assume that any file that lived long enough to age out of RC is effectively patrolled.
  • Maybe also make thumb.php add the header? Not sure if that's actually useful - for sites using Swift, like Wikimedia, the thumbnail will always be streamed from Swift I think, so it's enough to have them there. For sites using file storage, exposed via the web server, there is no way to set custom headers. For the default MediaWiki config (everything goes through thumb.php), this would make the headers more consistent, but it's unlikely that anyone using such a low-performance config would need to filter on patrol status. Same goes for img_auth.php, it's unlikely that private wikis would ever need this feature.

@Bawolff @BBlack does that sound about right?

So yes, this sounds sane to me (With the caveat, I haven't looked at the multimedia code in a while). Some comments:

It seems like all header options go through SwiftFileBackend::sanitizeHdrs() which ends up removing almost everything. So that should probably be fixed to keep X-MediaWiki- headers? (Or the header should be called X-Content-Patrol-Status which is has a whitelisted prefix.)

I would lean towards whitelisting X-MediaWiki here.

Split the cache on zero status (in Varnish config? via a Vary header? I guess the latter makes more sense as we don't need a cache split for patrolled files, but do we want Vary headers to live in Swift?).

My initial reaction would be that varnish config would be better (Try and make it that files with the header have split cache but other files don't). But my knowledge of varnish is very limited

Make Varnish return an error response with short-lived cache if the client is zero-rated and the backend response has X-MediaWiki-Patrol-Status: unpatrolled.

403 would be the most obvious thing to do here, but I'm not sure if 403's get cached at all.

Maybe also make thumb.php add the header? Not sure if that's actually useful - for sites using Swift, like Wikimedia, the thumbnail will always be streamed from Swift I think, so it's enough to have them there

Its been a long time since I looked at it, but if I remember right - on the first hit, where there is no thumbnail in swift, varnish passes directly to thumb.php, so you would need the headers in thumb.php too

Although I agree thumbs seem like they could be ignored, as they probably lose most of the hidden secret embedded files.


Also to consider, is the transcodes. If the person uploads a webm/ogv file. Unlike thumbs, these are probably actually useful to pirates

Apparently Swift has a whitelist for allowed headers and the configuration for that is somewhat broken :/

Change 402470 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[mediawiki/vagrant@master] Whitelist X-MediaWiki-Patrol-Status header in Swift

https://gerrit.wikimedia.org/r/402470

Change 402471 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[operations/puppet@production] Whitelist X-MediaWiki-Patrol-Status header in Swift

https://gerrit.wikimedia.org/r/402471

Change 402472 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[mediawiki/core@master] Add X-MediaWiki-Patrol-Status to file content headers

https://gerrit.wikimedia.org/r/402472

Change 402474 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[mediawiki/core@master] Pass through X-MediaWiki- headers in SwiftFileBackend

https://gerrit.wikimedia.org/r/402474

Change 402475 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[mediawiki/core@master] Remove X-MW-Patrol-Status header when file is patrolled/purged

https://gerrit.wikimedia.org/r/402475

Change 402476 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[mediawiki/core@master] Add hooks for modifying file content headers

https://gerrit.wikimedia.org/r/402476

Change 402571 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[mediawiki/vagrant@master] [TEST] Reject requests based on patrol status

https://gerrit.wikimedia.org/r/402571

Change 402572 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[operations/puppet@production] Handle X-MediaWiki-Patrol-Status header in Varnish

https://gerrit.wikimedia.org/r/402572

The above patches take care of the basic case (just the original file, assuming core patrolling feature) on MediaWiki, Swift and Varnish. Thumbnails, transcodes and FlaggedRevs TBD, but this should be enough to cover most of the attack surface.

Given that the vast majority of the abuse was with video files, transcode support seems like a must have. As it stands, the workaround is very easy for any original that gets blocked with this new feature: just watch a transcode listed on the file page.

Thumbnails are too much work for what they're worth, imho, unless there is demonstrated abuse leveraging thumbnails. If people are trying to watch videos for free, thumbnails won't help them.

Given that the vast majority of the abuse was with video files, transcode support seems like a must have. As it stands, the workaround is very easy for any original that gets blocked with this new feature: just watch a transcode listed on the file page.

Yeah, there were a bunch of APKs hidden in images early on but now it's pretty much 100% audio/video.

Thumbor is not involved with transcodes, right?

Thumbnails are too much work for what they're worth, imho, unless there is demonstrated abuse leveraging thumbnails. If people are trying to watch videos for free, thumbnails won't help them.

You can use steganography to put arbitrary data in a large image, so I'd rather not leave that hole open unless it turns out particularly hard to plug.

Thumbor isn't involved with transcodes, only thumbnails. Taking care of transcode logic for this should all happen within Mediawiki, afaik.

As for thumbnails, it's not that it's hard to implement, but you'll create a lot of extra purge traffic and cache invalidation for a threat that is currently only theoretical. Consult the Traffic team about this, imho it's not a worthy tradeoff until the loophole is really exploited.

My advice would be to implement it, but not enable it until we really need to. Purge traffic is already a problem at the moment without this, it's something we're trying to avoid increasing if we can.

My advice would be to implement it, but not enable it until we really need to.

Makes sense, I'll do that.

Change 402471 merged by Filippo Giunchedi:
[operations/puppet@production] Whitelist X-MediaWiki-Patrol-Status header in Swift

https://gerrit.wikimedia.org/r/402471

Mentioned in SAL (#wikimedia-operations) [2018-01-16T10:59:44Z] <godog> roll-restart swift object server - T167400

LGoto subscribed.

Hi @Tgr as per grooming, moving this to backlog for now. If you're ready to start work, feel free to move to kanban.

Aklapper lowered the priority of this task from Medium to Low.Nov 7 2018, 8:27 AM
MaxSem subscribed.

No more Zero.

Change 402571 abandoned by Gergő Tisza:
[TEST] Reject requests based on patrol status

https://gerrit.wikimedia.org/r/402571

Change 402470 abandoned by Gergő Tisza:
Whitelist X-MediaWiki-Patrol-Status header in Swift

https://gerrit.wikimedia.org/r/402470

Change 402572 abandoned by Gergő Tisza:
Handle X-MediaWiki-Patrol-Status header in Varnish

Reason:
WP0 has been sunset

https://gerrit.wikimedia.org/r/402572

Change 402472 abandoned by Gergő Tisza:
Add X-MediaWiki-Patrol-Status to file content headers

Reason:
WP0 has been sunset. I think there is still value in a file patrolling feature but I won't have time to work on it; feel free to reopen if you think it's useful.

https://gerrit.wikimedia.org/r/402472

Change 402475 abandoned by Gergő Tisza:
Remove X-MW-Patrol-Status header when file is patrolled/purged

Reason:
WP0 has been sunset. I think there is still value in a file patrolling feature but I won't have time to work on it; feel free to restore if you think it's useful.

https://gerrit.wikimedia.org/r/402475

Change 402474 abandoned by Gergő Tisza:
Pass through X-MediaWiki- headers in SwiftFileBackend

Reason:
WP0 has been sunset. I think there is still value in a file patrolling feature but I won't have time to work on it; feel free to restore if you think it's useful.

https://gerrit.wikimedia.org/r/402474

Change 402476 abandoned by Gergő Tisza:
Add hooks for modifying file content headers

Reason:
WP0 has been sunset. I think there is still value in a file patrolling feature but I won't have time to work on it; feel free to reopen if you think it's useful.

https://gerrit.wikimedia.org/r/402476

Change 482262 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[operations/puppet@production] Revert "Whitelist X-MediaWiki-Patrol-Status header in Swift"

https://gerrit.wikimedia.org/r/482262

Change 482262 merged by Filippo Giunchedi:
[operations/puppet@production] Revert "Whitelist X-MediaWiki-Patrol-Status header in Swift"

https://gerrit.wikimedia.org/r/482262