Page MenuHomePhabricator

CL support for Wikipedia Zero piracy problems
Closed, ResolvedPublic

Description

Some background:
https://commons.wikimedia.org/wiki/Commons:Village_pump/Proposals/Archive/2017/06#Restrict_Video_Uploading

  1. What is the problem? Wikipedia Zero users are using our projects for piracy. Attempts to mitigate the problem have so far had mixed results, but mostly the problem is still a problem.
  2. How does success of this task look like? How do we know when we are done? Success is either Wikipedia Zero users no longer using the projects for widespread piracy or administrators being happy with their ability to monitor and address such abuse.
  3. Is there any goal, program, project, team related with this request? Please provide links and use the corresponding Phabricator tags when available. This involves several teams and projects. Ops will be implementing some debugging tools to figure out why deleted files aren't getting purged (T109331, Z567). The MediaWiki core team will be looking at the deletion code in MediaWiki and possibly implementing a manual file purge interface for administrators. The Wikipedia Zero team will be working with partners to throttle downloading (if possible).
  4. What is your expected timeline from start to end? Is there a hard deadline? Unknown, probably no hard deadline.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Elitre subscribed.

Keegan, questions, thoughts, ...?

Keegan, questions, thoughts, ...?

Making sure I have this right:

How does success of this task look like? How do we know when we are done? Success is either Wikipedia Zero users no longer using the projects for widespread piracy or administrators being happy with their ability to monitor and address such abuse.

So the end goals are:

  1. Reduce or remove the administrative burden on Commons
  2. Reduce or remove WP0 ability to upload copyvios
  3. Reduce or remove access to infringements

It looks to me like goals 1 and 2 have been temporarily accomplished by a filter on Commons, but filters are expensive + other reasons we'd like to make this more robust and efficient, and goal 3 is a work in progress in dealing with cache issues that have both back and front-end solutions.

@kaldari Is this about right for where we are at the moment?

Someone hit English Wikinews the other day as well (https://en.wikinews.org/wiki/Special:Contributions/Edman_Musik3). According to @Dispenser , the files were deleted within ten minutes but still accessed ~100,000 times.

This is clearly spreading beyond Commons, so we'll keep that in mind in the communications plan.

This is clearly spreading beyond Commons, so we'll keep that in mind in the communications plan.

Other major affected wikis in the past: testwiki, test2wiki, hewiki, huwiki

Yes, thank you. I'm now subscribed to T129845 (restricted), which contains all the useful links.

Keegan triaged this task as Medium priority.Jul 28 2017, 7:13 PM

@Keegan: Just wanted to say I'm glad you're on this :)

Someone hit English Wikinews the other day as well (https://en.wikinews.org/wiki/Special:Contributions/Edman_Musik3). According to @Dispenser , the files were deleted within ten minutes but still accessed ~100,000 times.

See here for the detailed numbers from that day (non-public). It's important thought that they don't correspond 1:1 to e.g. video views or audio file listens. E.g. they are not filtered for successful requests (like the Mediacounts dataset is), i.e. a large part might have been requests for already unavailable files. What's more (but this is anecdotal so far), it appears that the Mediacounts data itself is (even beyond the already documented corner cases) often too high in that a single viewing of a longer video by a single users can apparently generate 100 or more requests in some circumstances. Thus the request number is included in these internal query results for detection purposes only, not as an estimate of actual usage.

This is clearly spreading beyond Commons, so we'll keep that in mind in the communications plan.

+1, and the Hungarian and Hebrew Wikipedias have already been a frequent target for about a month or so. At some point it may become worth looping in the SWMT too.

Someone hit English Wikinews the other day as well (https://en.wikinews.org/wiki/Special:Contributions/Edman_Musik3). According to @Dispenser , the files were deleted within ten minutes but still accessed ~100,000 times.

See here for the detailed numbers from that day (non-public). It's important thought that they don't correspond 1:1 to e.g. video views or audio file listens. E.g. they are not filtered for successful requests (like the Mediacounts dataset is), i.e. a large part might have been requests for already unavailable files. What's more (but this is anecdotal so far), it appears that the Mediacounts data itself is (even beyond the already documented corner cases) often too high in that a single viewing of a longer video by a single users can apparently generate 100 or more requests in some circumstances. Thus the request number is included in these internal query results for detection purposes only, not as an estimate of actual usage.

Thank you for all of this, Tilman. Dispenser gave me corrected numbers on Friday that I meant to update here, but I forgot :)

This is clearly spreading beyond Commons, so we'll keep that in mind in the communications plan.

+1, and the Hungarian and Hebrew Wikipedias have already been a frequent target for about a month or so. At some point it may become worth looping in the SWMT too.

Yes, absolutely will include them and other related global functionaries in communications once there's a plan-of-sorts to discuss.

@kaldari It seems we have goal #4 in T167400: Disable serving unpatrolled new files to Wikipedia Zero users, and it is an important one to the Commons community to end the piracy issue. The uploads of IP protected media are not necessarily coming from WP0, but it is large groups of WP0 users that are organized to download said material. We can play whack-a-mole all day with uploaders, but it is believed the situation will not go away until we remove access to downloading the files. Is this something we're going to look into?

@Keegan: Good question for @DFoy or maybe @MarkTraceur. Thoughts on whether T167400 is worth looking at more deeply?

@Keegan @kaldari

I agree, and mentioned this as one of the options we would explore on my initial post to the commons village pump. We're initially focusing on a couple of the other suggestions made there right now, but this is the next to investigate.

@kaldari @DFoy @Tbayer

Had a brief chat with Tilman about all of our various plans at Wikimania. Should we all have a meeting and get on the same page, and maybe have some sort of working timeframe?

Qgil subscribed.

Back to the Community-Relations-Support backlog. Is this support request still active? Is it expected to be continued in Community-Relations-Support (Oct-Dec 2017) ?

I'm not sure. @Keegan: Can you confirm whether or not the caching problem is still a problem. Ops says the job queue should no longer be backed up, but there is disagreement about whether or not the job queue congestion was likely to have affected deletion requests.

I'm not sure. @Keegan: Can you confirm whether or not the caching problem is still a problem. Ops says the job queue should no longer be backed up, but there is disagreement about whether or not the job queue congestion was likely to have affected deletion requests.

Investigating.

@kaldari There were nine files uploaded today. Eight have been deleted, and none of those eight needed purging. We'll see what happens with the ninth file. Seems like the purge is working so far.

I'm not sure. @Keegan: Can you confirm whether or not the caching problem is still a problem. Ops says the job queue should no longer be backed up, but there is disagreement about whether or not the job queue congestion was likely to have affected deletion requests.

@kaldari, do you happen to know when the job queue backlog was resolved? Per https://phabricator.wikimedia.org/Z591#12253 (access required) , the purge problem was still happening yesterday (October 3) .

BTW, for cross-reference, I assume these are among the relevant tasks: T173710: Job queue is increasing non-stop, T133821: Make CDN purges reliable

@Tbayer: It looks like the job queue has been back to reasonable levels since the beginning of September.

10 out of the 37 files deleted in the past 24 hours are still accessible (with the script I ran for @Keegan yesterday)

@Dispenser: Thanks for the update. I've pinged Ops to let them know and see where we can go from here.

Today's list of deleted, but not yet purged files: Z591#12288.

@kaldari - I'm not sure what "reasonable levels" is, but T173710#3646384 was showing commons queue backlogs in the low millions as recently as a week ago, and there still seem to be unresolved questions there about how to address the overall event rate. I've re-run those queries myself just now:

bblack@terbium:~$ /usr/local/bin/foreachwikiindblist /srv/mediawiki/dblists/group1.dblist showJobs.php --group | awk '{if ($3 > 10000) print $_}'
commonswiki:  refreshLinks: 1628957 queued; 1762 claimed (4 active, 1758 abandoned); 0 delayed
commonswiki:  htmlCacheUpdate: 1482335 queued; 722 claimed (1 active, 721 abandoned); 0 delayed
[...? I stopped waiting here]

Today's list of deleted, but not yet purged files related to WP0 abuse: Z591#12300.

Today's list of deleted, but not yet purged files related to WP0 abuse: Z591#12330.

@Jdx: Thanks! I've informed Ops that the problem has not been resolved.

@kaldari : @Dispenser says:

Just run https://phabricator.wikimedia.org/P5972 on production or ToolForge, feed the output into the purge script, hourly.

If we could automate that, it would be good.

Could someone elaborate on what 'purge script' we're talking about? Are we talking about eraseArchivedFile.php or something else? Who is running this currently?

Could someone elaborate on what 'purge script' we're talking about? Are we talking about eraseArchivedFile.php or something else?

https://www.mediawiki.org/wiki/Manual:PurgeList.php

Who is running this currently?

MaxSem did this a while ago. Idk if anyone is still doing this :/

So, I noticed again today and figured I should bring it up here: it seems highly fishy that most of the files that end up on the Files to purge lists on the WP0 Reporter's room have parentheses in their titles, either literally or as the %-encoded %28...%29 pair. Has anyone deeply investigated the angle that there's an encoding problem here? E.g. that the actual URL of the file on upload and the URL being PURGEd differ in parentheses-encoding details in some way, or that there's not some fault that causes PURGE URL parentheses to be double-encoded, etc?

@BBlack: It's an interesting theory. As far as I know, no one has investigated the file name encoding though the purging workflow. Would it be possible to look at the purge requests on the CDN side to see what they look like? @aaron, would it be possible to do some logging/debugging on the MediaWiki side as well to see what is being sent?

As a test, I tried uploading two files, one with parentheses and one without, and then deleting them both. I then tried reloading the raw images and thumbnails. In both cases, however, the raw images and thumbnails were also gone and returned 404s as they should. Of course, I was only checking against a single server's caches, so it was a very limited test.

Yes, it's an interesting theory, but please note that the reports in that channel are not listing all new files or deleted files in general, but those likely to be WP0 piracy uploads. And the uploaders of these problematic files have adapted a naming scheme that includes parentheses.

I think the hints about parentheses are pointing in a useful direction, but I think I was thinking about the repercussions incorrectly above. It's not a question of an encoding failure of some kind in the purging pipeline. It's that the fetching pipeline (as in User->Varnish->Swift->(MW|Thumbor|etc?)) accepts multiple possible encodings of a file's URI, without consistent normalization (or rejection) across layers, while the PURGEs are only sent for whatever is considered the canonical encoding of the filename. And I think the pirate uploaders have figured this out and are exploiting it.

For example, the following 4x URLs are all alternate encodings of the same underlying file just by messing with literal vs percent-encoded parentheses on the 4 near the end of the filename:

https://upload.wikimedia.org/wikipedia/commons/thumb/4/49/Porto%2C_vista_da_S%C3%A9_do_Porto_%284%29.jpg/800px-Porto%2C_vista_da_S%C3%A9_do_Porto_(4).jpg
https://upload.wikimedia.org/wikipedia/commons/thumb/4/49/Porto%2C_vista_da_S%C3%A9_do_Porto_%284%29.jpg/800px-Porto%2C_vista_da_S%C3%A9_do_Porto_%284%29.jpg
https://upload.wikimedia.org/wikipedia/commons/thumb/4/49/Porto%2C_vista_da_S%C3%A9_do_Porto_%284%29.jpg/800px-Porto%2C_vista_da_S%C3%A9_do_Porto_%284).jpg
https://upload.wikimedia.org/wikipedia/commons/thumb/4/49/Porto%2C_vista_da_S%C3%A9_do_Porto_%284%29.jpg/800px-Porto%2C_vista_da_S%C3%A9_do_Porto_(4%29.jpg

Requesting all of these through the public caches return objects of the same size and ETag, but separate ages and cache hit-counts.

I still have some more digging to go before I know more about what's happening re: normalization at various layers here. I know Varnish isn't normalizing this at all (it considers all these variations to be distinct). Probably the right answer is to make the normalization consistent at all layers rather than trying to purge all possible encoding variants.

What I really need to dig on this further is an easy way to see a list of recent WP0-abuse-related deletions on various wikis. Am I missing some way to use the deletion log search interfaces?

Z591 should be the best list we have.

What I really need to dig on this further is an easy way to see a list of recent WP0-abuse-related deletions on various wikis. Am I missing some way to use the deletion log search interfaces?

Z591 should be the best list we have.

If I understand correctly, that's reporting on popular WP0 downloads (which is where I was noticing the parenetheses). I was looking for logs of WP0-abuse-related administrative deletions, to compare against that and find the change in URL encoding between the deletion and the WP0 accesses.

Would it be fair to assume that the URL-encoding normalization rules for the upload.wikimedia.org URLs should be the same as the one we use for MediaWiki? Anyone know if there's any reason for it to vary from that?

Would it be fair to assume that the URL-encoding normalization rules for the upload.wikimedia.org URLs should be the same as the one we use for MediaWiki? Anyone know if there's any reason for it to vary from that?

I would check with @Gilles and @Krinkle as the authorities in this area, but I can't think of any reason that the canonical form would be different than what MediaWiki produces for the equivalent URIs.

What I really need to dig on this further is an easy way to see a list of recent WP0-abuse-related deletions on various wikis. Am I missing some way to use the deletion log search interfaces?

Z591 should be the best list we have.

If I understand correctly, that's reporting on popular WP0 downloads (which is where I was noticing the parenetheses). I was looking for logs of WP0-abuse-related administrative deletions, to compare against that and find the change in URL encoding between the deletion and the WP0 accesses.

To obtain some examples, one could start from Z591#12542 (@Jdx' most recent report on files that had already been deleted but needed purging) and search Special:Log on the corresponding wiki for the file names derived from each URL, arriving at these entries.
Unless I'm overlooking something, there are no encoding discrepancies in these four examples.

Would it be fair to assume that the URL-encoding normalization rules for the upload.wikimedia.org URLs should be the same as the one we use for MediaWiki? Anyone know if there's any reason for it to vary from that?

Do you refer to normalization done in Varnish, or normalization done on the backend (in the form of redirects)? I think there is some special casing for wiki pages (e.g. to handle things like /wiki/Quo_vadis? which would normally resolve to /w/index.php?title=Quo_vadis but the user probably meant /w/index.php?title=Quo_vadis%3F) but I don't think any of that happens in Varnish.

On the cache_text side for the actual wikis, Varnish does do some normalization, but not complete normalization. Varnish basically just decodes a special handful of %-escapes based on what wfUrlencode does, and that code is here: https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/varnish/templates/normalize_path.inc.vcl.erb;c817459c34aa7ab815da266496864125b470b04a$40 . It's been a known issue for quite a long while that we could/should be doing better on that normalization, but hasn't been a priority because there's not much pragmatic fallout other than slight impact on cache hitrates.

On the Mediawiki end of that issue, MW apparently has a single canonical encoding representation for each title. Some non-canonical forms will result in 301 redirects, while others are silently accepted but emit no-cache headers to avoid PURGE issues. As an example on the wiki side of things, we can look at a few possible URL encodings for the enwiki article about AT&T:

https://en.wikipedia.org/wiki/AT%26T -> This is the canonical encoding, and returns the normal, cacheable article content containing a rel=canonical link to itself.
https://en.wikipedia.org/wiki/AT&T -> Returns an identical article (no redirect) containing a rel=canonical link to the canonical encoding, has Cache-control: private, must-revalidate, max-age=0 to prevent Varnish caching this encoding.
https://en.wikipedia.org/wiki/%41T%26T -> Returns a 301 redirect the canonical encoding.

When I manually trigger a PURGE of the non-canonical variant via https://en.wikipedia.org/wiki/AT&T?action=purge, the observed PURGEs that arrive on cache_text are for the canonical encoding AT%26T.


As part of looking into all of this, I've been working on a patch to improve the cache_text Varnish normalization, and during that I think I've found the One True Canonical Encoding rules for MediaWiki on cache_text. I think every possible character has a definite canonical encoding for MW as either being correctly %-encoded or correctly decoded (or is disallowed or impossible, in which case consistency is all that really matters). I'm working on that here: https://gerrit.wikimedia.org/r/#/c/391216/7/modules/varnish/templates/normalize_path.inc.vcl.erb .


cache_upload Varnishes currently don't do any encoding normalization at all. If we assume the upload.wikimedia.org URLs want the same canonical encoding form as MediaWiki over in cache_text, then the above could also be applied to cache_upload to eliminate all path-encoding ambiguities there, too.

However, observationally, it seems that if there is a canonical encoding of upload URLs, it's not the same as MediaWiki's. Observe https://commons.wikimedia.org/wiki/File:Sweet_William_in_Aspen_(91167).jpg (which is the correct canonical form for that URL/Title for MediaWiki), which emits links to the actual file contents on cache_upload as https://upload.wikimedia.org/wikipedia/commons/4/40/Sweet_William_in_Aspen_%2891167%29.jpg . Observation of PURGE traffic seems to indicate that purges occur in this form (the %28/%29 form) on cache_upload as well. However, as pointed out earlier, that same content is also available in other URL encoding variants like https://upload.wikimedia.org/wikipedia/commons/4/40/Sweet_William_in_Aspen_(91167).jpg , which would not be purged by PURGEing the previous upload URL.


Looking at @Tbayer's links above: WP0 Reporter reported WP0 users hitting https://upload.wikimedia.org/wikinews/en/e/e4/YOUNG_KILLER_MUSIC-yabara(Hindio-News).ogg . The deletion log shows 22:45, 8 November 2017 Pi zero (talk | contribs) deleted page File:YOUNG KILLER MUSIC-yabara(Hindio-News).ogg (Mass deletion of pages added by Aivaldo Angel).

But if you plug this together with the information above, what's happening here is that the WP0 users are downloading from the URL encoded as https://upload.wikimedia.org/wikinews/en/e/e4/YOUNG_KILLER_MUSIC-yabara(Hindio-News).ogg, but the deletion of File:YOUNG KILLER MUSIC-yabara(Hindio-News).ogg on commonswiki probably causes the emission of a PURGE for the differently-encoded URL: https://upload.wikimedia.org/wikinews/en/e/e4/YOUNG_KILLER_MUSIC-yabara%28Hindio-News%29.ogg (which is the form it would've considered canonical for the upload.wm.o link and linked to from the commonswiki page, before deletion).

So, my top questions at this point on all things related are:

  1. Over on the cache_text / MediaWiki side of things, I need a knowledgeable review of the new strict Varnish normalization proposed in https://gerrit.wikimedia.org/r/#/c/391216/7/modules/varnish/templates/normalize_path.inc.vcl.erb . Don't worry about the C implementation details, mostly the question is whether there's any real issue with us doing a strict normalization dividing the characters into the two groups proposed in the comments at the top.
  1. What are the current encoding/normalization rules MW applies when it generates upload.wikimedia.org "canonical" URLs for image links and upload PURGE traffic, and is there a good reason for them differing from the MediaWiki normalization of the same characters? (perhaps Swift doesn't like unencoded slashes, at least?). If there's no good reason for the deviation, we could change the normalization MW applies to upload URLs to match the normalization it does for its own URLs, and apply the above to cache_upload as well. If there is a good reason for deviation, we can document how the upload normalization differs and apply *that* over in cache_upload.

On the MW side of (2) above, it appears the swiftFileBackend code in MW uses PHP's urlencode to transform the filenames into upload URL paths. urlencode documentation claims that it percent-encodes everything but alphanumerics and -_. (so the set it does not encode is almost the official Unreserved Set, but it's missing the tilde). It also encodes spaces as + rather than %20 because it's meant for query strings rather than paths. PHP's rawurlencode would probably have been more appropriate here as it conforms to the RFC and excludes from encoding exactly the Unreserved Set and doesn't do the +-for-spaces thing. However, in practice, we can deal with the ~ issue and spaces have already been made into underscores, so the plusses shouldn't ever actually appear.

Regardless, this explanation seems consistent with observations of the upload.wm.o paths I've seen. We can normalize on similar rules there (but leave spaces as %20 just to be technically-correct, which again won't matter in practice). If at some later date we want to use a prettier normalization we can do that, too, but for now it would be simplest to leave the MediaWiki side alone and just conform everything else to its expectations.

Change 391216 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] Fully normalize upload paths

https://gerrit.wikimedia.org/r/391216

Change 391216 merged by BBlack:
[operations/puppet@production] Fully normalize upload paths

https://gerrit.wikimedia.org/r/391216

The above is in effect on all upload caches since about 17:23 UTC (just before this post), and doesn't seem to be causing any adverse effects.

Assuming the encoding theory is correct, this should stop the problem with delete->PURGE not affecting Zero downloaders.

All the URLs coming from user requests are being coerced into a normalized encoding before other processing (e.g. caching, fetching). That normalized encoding is believed to match the same encoding that would be used by MW for the matching PURGEs (but just in case it isn't, the same unique normalization is also applied the URLs coming from the PURGE requests themselves before they're processed).

On the MW side of (2) above, it appears the swiftFileBackend code in MW uses PHP's urlencode to transform the filenames into upload URL paths. urlencode documentation claims that it percent-encodes everything but alphanumerics and -_. (so the set it does not encode is almost the official Unreserved Set, but it's missing the tilde). It also encodes spaces as + rather than %20 because it's meant for query strings rather than paths. PHP's rawurlencode would probably have been more appropriate here as it conforms to the RFC and excludes from encoding exactly the Unreserved Set and doesn't do the +-for-spaces thing. However, in practice, we can deal with the ~ issue and spaces have already been made into underscores, so the plusses shouldn't ever actually appear.

Regardless, this explanation seems consistent with observations of the upload.wm.o paths I've seen. We can normalize on similar rules there (but leave spaces as %20 just to be technically-correct, which again won't matter in practice). If at some later date we want to use a prettier normalization we can do that, too, but for now it would be simplest to leave the MediaWiki side alone and just conform everything else to its expectations.

The only urlencode() I see in SwiftFileBackend (or any of filebackend/) is in a length check in resolveContainerPath(). That should use rawurlencode, though it would effect what URLs are used.

Yeah, you're right, I see that now. So it's all using rawurlencode() in practice, which is better! Since literal spaces aren't allowed in Title or File URLs (underscores), I think the only real impact here is how canonicalization works for the tilde. Do we have examples for File: URLs -> upload that have tildes to verify on?

Nevermind, I took a quick look at some of the PURGE logs and found some. The tildes are decoded canonically (as we should expect from rawurlencode() and RFC) so I'll update the canonicalization to decode tildes as well, it will avoid some pointless URL rewrites in the canonical cases (which should be the most-common).

Change 392178 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] normalize_path_encoding: decode tilde for upload

https://gerrit.wikimedia.org/r/392178

Change 392178 merged by BBlack:
[operations/puppet@production] normalize_path_encoding: decode tilde for upload

https://gerrit.wikimedia.org/r/392178

I checked the daily reports for four recent days since November 19 (in each case soon after they came out). All of them covered several admin-deleted files, but none of them needed purging (i.e. was still accessible) at the time I checked. I believe @Jdx did some checks too with the same result.
So it appears that @BBlack has fixed that issue. \o/
It looks like this has already greatly reduced the impact of this form of abuse (by making admin deletions effective immediately - although there are still cases where it takes a bit too long until that admin deletion occurs, e.g. in this case; running the reports more often than daily could potentially help with that).

@Keegan: Can you circle back with the Commons community and let them know that the purging issue seems to be resolved (as of last month)?

@kaldari I'll get something posted in the next few days, yup.

Posted yesterday: https://commons.wikimedia.org/w/index.php?title=Commons:Village_pump&oldid=272175571#Update_on_purging_deleted_files

@kaldari Do we anticipate working on this anymore next quarter, or are we done here (for now, at least)?

@Keegan: I would say we're done here, barring any unforeseen events.

I'm planning to work on T167400: Disable serving unpatrolled new files to Wikipedia Zero users in the next couple weeks. Not sure if that's what you meant by "this".

I assumed Keegan was asking about needing continued CL support.

I assumed Keegan was asking about needing continued CL support.

This.

Thanks, all!

@Tgr ping me in that task once it's completed if you would like me to communicate with the Commons community, more than happy to.