Page MenuHomePhabricator

Tool labs tools should have a method of identifying Zero traffic
Closed, DeclinedPublic

Description

Because of {T129845}, some tools, such as my video2commons tool, are being abused to upload massive copyvios by Bangladesh FB. Unlike direct uploads to production where T131211: Surface Wikipedia Zero traffic in AbuseFilter (X-Carrier) can be used to identify Zero traffic, tool labs tools are unable to receive any information to identify them, and thus unable to prevent such abuse.

Also videoconvert is being abused.

Event Timeline

Restricted Application added projects: Operations, Cloud-Services. · View Herald TranscriptApr 6 2016, 2:47 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Does Wikipedia Zero include non-wikipedia domains? I would expect tools.wmflabs.org to fall out of scope.

BBlack added subscribers: DFoy, Yurik, dr0ptp4kt.

Well the whitelists are by network range, not by hostname. Still, I wouldn't expect the public IPs for labs-y things to be whitelisted. I think the question-mark here is whether Facebook's zero stuff is using the same limited Zero whitelisting of our networks that the carriers do, or is using some more-broad definition like "all wikimedia networks"...

Gunnex added a subscriber: Gunnex.Apr 12 2016, 7:33 AM

So far I can analyze the situation (I am not a technician, just giving some feedback from the user front about what I am currently monitoring...), only a few uploads of copyrighted filmes/videos and music were triggered by abuse filter 149 (Wikipedia Zero uploads), assuming that most of them are using paid mobil (flat-rates), probadly also with better bandwidth to handle uploads up to 500 MB — 1 GB (complete films: Superman/Star Wars etc.). I remember the "Angola Fecebook Case" where especially some admins of related Facebook groups wasted mobile credits for "their" Facebook audience, providing music & video files via Commons. Btw, the "Angola Fecebook Case" was focused more on music (mostly ogg-files). The "Bangladesh Facebook Case" is more focused on films & videos (webm, ogv).

If you go into details in the video2commons filter you will see that:

  1. +/- 98-99 % of the red (deleted) links were uploaded via the "Bangladesh Facebook Case"...
  2. ...which - btw - is increasing in user numbers significantly: two weeks ago around 70 accounts were identified, on 09.04.2016 we had 148 accounts. Today we have over 190 accounts.
  3. most of them are using "video2commons" in combination with "googlevideo.com" as source (typical url: "https://r8---sn-4g57kn7e.googlevideo.com/videoplayback?i(...)". Info: they are NOT exclusively using "video2commons" as upload tool, uploading files also using standard tools... but often triggered also via the mobile edit filter.

So, 1-3 leads to the question: would blocking "googlevideo.com" in a video upload help? Well, googlevideo.com redirects to the Google Video Search...
Or: Is preventing/blocking mobile uploads of videos & music files an option?

In other words: we need urgently a solution for this.

It is nice to have some tools for monitoring but if you have no one who tags the files quickely, the tools will become useless. We are talking about several Bangladesh Facebook groups (who are sharing these files instantely) with +10.000 members. And we have Millions of members out there in related FB groups just looking for free file hosters like... Commons.

Btw, not only for this specific case. So far what I have seen (this is my personal impression, but I am almost daily involved in cleaning these files), the whole Wikipedia Zero traffic on Commons (since now we can trigger it since 06.04.2016) is (especially the uploads) mostly useless, grabbed from Internet, out of scope, etc.. binding additional forces from an already understaffed team. Is Wikipedia Zero even justifiable with a bad ratio of (let's say) +/- 90—95 %?

mark added a subscriber: mark.Apr 12 2016, 10:28 AM

Awaiting more info from Zero folks.

It seems this is really a question for @BBlack - is it possible to make all of wmflabs traffic go through the varnish layer so that it gets tagged with the zero headers? If so, than the custom tool can take into account those headers and refuse to work. Or am I misunderstanding this task?

jayvdb added a subscriber: jayvdb.Apr 19 2016, 3:01 AM

Maybe an alternative is to block specific tools being accessed by zero, or kill the connection if the zero connection uploads more than a preset limit to these tools.
That simpifies the task a bit, which maybe allows a quicker fix and maybe more appropriate given zero uploads isnt the main source of these uploads

DFoy added a comment.Apr 19 2016, 4:00 AM

@BBlack @jayvdb My expectation (and current understanding) is that the IP ranges we provide to for zero whitelisting are only for the production servers, and not for any sort of labs or beta system. This was initially decided to avoid the possibility of someone setting up a proxy that would allow users to go through the mobile operators whitelisting and then have the zero-rated traffic redirected outside WMF via that proxy.

If my understanding is still correct (that only our production servers are allowed to use the agreed-upon IP ranges for whitelisting), I don't know what good it will do to try to detect if people using non-production servers/tools are coming from a Zero partner. In a non-production case, they would not be zero rated, and therefore have no advantage or need to use that network.

Regarding Facebook and Free Basics, they should also be using the same IP ranges to zero rate as our Wikipedia Zero partners use. If you believe this is not the case, let me know and I will check into it with them.

@DFoy - I think all of your assumptions are correct above. This really isn't a Zero-related problem. It's just that the abusive traffic happens to come in over mobile networks, and some mobile networks happen to be Zero networks. Lots of other mobile networks won't be Zero networks, and either way Labs shouldn't be in any Zero whitelists from the carriers' POV.

If I had to guess why mostly-mobile, probably because they're the most common networks for certain classes/geographies of users in general, or it helps them anonymize their traffic against DMCA enforcement using burner SIMs?

Can we start with the broader scope questions here? Commons is an open platform for anonymous content sharing where the sharing happens first and the administrative vetting happens later. That seems like it will always be ripe for abuse and overwhelming admins by design. How would you fix these problems without playing whack-a-mole trying to block out specific source networks? ...

@DFoy , thanks for explaining that the 'zero' customer is being charged for using wmflabs.

The "problem" is that these tools download data from other sites, so the zero customer (accessing a non-zero rated tool on wmflabs) is not incurring a significant charge. They only post a URL, and the wmf tool does the rest of the work.

I see now there is another possible quick fix.
The video2commons tool uses OAuth, so all users of this tool are being sent to production servers. The production servers (which know the user is zero rated) could reject OAuth access to this tool (and others) from zero customers, possibly returning an error message to the video2commons tool which mentions that it is denied as they are zero rated.

I still think we're barking up the wrong tree here trying to identify them by their Zero rating. The Zero rating is irrelevant. They just happen to be on mobile networks that happen to be zero-rated...

Forgive my ignorance of commons uploading tools, but... "video2commons tool uses OAuth" means they have to have a commons wiki login to use the video2commons uploader? Can we not ban accounts? Are they creating a constant stream of new junk accounts? Can we place some limits and delays on how much data a newly-created account can upload?

zhuyifei1999 closed this task as Invalid.Apr 19 2016, 5:10 AM

I guess this is impossible then, given that the traffic to labs are not zero.

The production servers (which know the user is zero rated) could reject OAuth access to this tool (and others) from zero customers

Umm how? Tool labs tools do not know the IP addresses of users, so it's difficult to check if the user is using a mobile carrier or not. And with the default https, the only places where X-Carrier header (the header used in production to identify zero traffic) can be added is user browser (forget it) and tool labs webproxy (which is not doing so). As a result, a tool cannot identify such traffic, and will not send the header to production.
With the oauth api request originating from labs, the production varnish layer cannot add the header to the oauth api request. And then mediawiki does not know if the user is using zero carrier, non-zero mobile carrier, or something else.

... unless, mediawiki stores who's zero and who's not.

Can we not ban accounts? Are they creating a constant stream of new junk accounts?

Please see https://commons.wikimedia.org/wiki/User:NahidSultan/Bangladesh_Facebook_Case/Accounts

Can we place some limits and delays on how much data a newly-created account can upload?

Such are subject to complicated community discussion, as it not only limits usage from these users, but also good-faith users. And I don't think it will be effective, some users in the list sleep for days before starting to upload, and most can only upload one or two files before they get blocked/locked.

I guess this is impossible then, given that the traffic to labs are not zero.

The production servers (which know the user is zero rated) could reject OAuth access to this tool (and others) from zero customers

Umm how? Tool labs tools do not know the IP addresses of users, so it's difficult to check if the user is using a mobile carrier or not. And with the default https, the only places where X-Carrier header (the header used in production to identify zero traffic) can be added is user browser (forget it) and tool labs webproxy (which is not doing so). As a result, a tool cannot identify such traffic, and will not send the header to production.
With the oauth api request originating from labs, the production varnish layer cannot add the header to the oauth api request. And then mediawiki does not know if the user is using zero carrier, non-zero mobile carrier, or something else.
... unless, mediawiki stores who's zero and who's not.

During the tool's OAuth process, the user must log into MediaWiki.org, and while the user is on MediaWiki.org the user must grant OAuth access to 'video2commons'. It is during that part of the process that MediaWiki.org could block 'zero' customers access to 'video2commons', or any other OAuth based tool, because they will (by default) access MediaWiki.org via the 'zero' rated service.

I still think we're barking up the wrong tree here trying to identify them by their Zero rating. The Zero rating is irrelevant. They just happen to be on mobile networks that happen to be zero-rated...

Before declaring one method as irrelevant or invalid, please find another vector which as a higher correlation with the problem *and* the community will accept the false positive rate.

More general solutions would be lovely, if the false positive rate is low. However this is a specific problem, and lots of the potential solutions are being declared invalid because there are other possible solutions.

Forgive my ignorance of commons uploading tools, but... "video2commons tool uses OAuth" means they have to have a commons wiki login to use the video2commons uploader? Can we not ban accounts? Are they creating a constant stream of new junk accounts?

Yes, they are creating a constant stream of junk accounts ; if your goal was to upload copyright content to Commons, you would also , yes ?? They are not silly.
See the following link (mentioned in the task description) and other similar lists being used to track the problem.

https://commons.wikimedia.org/w/index.php?title=Special:RecentChanges&limit=500&tagfilter=OAuth+CID%3A+394

Can we place some limits and delays on how much data a newly-created account can upload?

As @zhuyifei1999 has indicated, this isnt likely to be a seriously entertained approach. It is far too likely to hurt more than it helps. But, if you think this is a viable approach, lets get some user analytics to evaluate it ...? The community will look at this proposal more seriously if you can show throttling parameters which will reduce the amount of problematic uploads without seriously reducing the size of good uploads occurring.

zhuyifei1999 reopened this task as Open.Apr 19 2016, 11:32 AM

During the tool's OAuth process, the user must log into MediaWiki.org, and while the user is on MediaWiki.org the user must grant OAuth access to 'video2commons'. It is during that part of the process that MediaWiki.org could block 'zero' customers access to 'video2commons', or any other OAuth based tool, because they will (by default) access MediaWiki.org via the 'zero' rated service.

Good point. (Though it's actually commons.wikimedia.org granting the access, not mediawiki.org)

On a side note, my tool is not the only one being abused. videoconvert is another example, although to a lesser extent.

jayvdb updated the task description. (Show Details)Apr 19 2016, 11:41 AM

I get that the current abusers happen to be on a Zero network, but since their traffic into these tools isn't actually Zero-rated, that's merely a coincidence. If we dropped Zero-rating for that carrier, you'd probably still be getting abuse from that carrier. If we start blanket denying Zero networks access to tools for uploading content, are we not having a negative effect on the global south market we're trying to get more good-faith users contributing from?

Restricted Application added a subscriber: TerraCodes. · View Herald TranscriptApr 19 2016, 9:47 PM
DFoy added a comment.Apr 19 2016, 11:06 PM

At this point, we have established that we have the technical capability to identify Zero traffic on tool labs tools / video2commons. However, we’ve also learned that this access is not zero-rated, even on Zero partner networks. In addition, this tool does not use client bandwidth for the video transfer, making even paid mobile access cost effective for the abuse we’re experiencing. Because of this, I believe that identifying Zero network users is not going to help us much here (if at all).

The one connection we do know between users and the risk of abusive behavior is the age of the user account. That is, pirates create new accounts as needed and then use them for abuse, although they may wait a short time (several days) before using those new accounts. Therefore, I propose we investigate the option of limiting the access to video2commons to user accounts that have a certain minimum age.

In T131934#2217482, @jayvdb wrote:
As @zhuyifei1999 has indicated, this isnt likely to be a seriously entertained approach. It is far too likely to hurt more than it helps. But, if you think this is a viable approach, lets get some user analytics to evaluate it ...? The community will look at this proposal more seriously if you can show throttling parameters which will reduce the amount of problematic uploads without seriously reducing the size of good uploads occurring.

I agree here about getting some data together. To proceed, I think we need to know what the typical age of user accounts are that are already using video2commons for legitimate purposes. With this information, we may be able to figure out an account age that allows the existing users of video2commons to continue using it, but stop more recently created accounts from gaining access. I’m expecting that this ‘waiting period’ threshold for new accounts would be on the order of months, not days, which could help make this an effective deterrent for abuse. Can someone get data together on this topic to see if this is a viable approach? Also, are there additional data points we could access regarding the user that could help strengthen this evaluation?

With this approach, we only rely on information that we have under our control. Therefore, it completely takes away the ability for the pirates to spoof information that would be used to thwart a network-based access method.

If we dropped Zero-rating for that carrier, you'd probably still be getting abuse from that carrier.

Why do you believe that? If they can't download the content free, I don't believe they will upload copyright violations. I'd like to understand how you see they will still benefit from abusing Wikimedia, or why they would continue to do it even if they do not benefit. @Gunnex earlier indicated that some of them are doing it to gain social status in their group, and some might be 'altruistically' doing it to help their friends on a different zero rated carrier...?

Any data you have that can support your argument would be most welcome.

If we start blanket denying Zero networks access to tools for uploading content, are we not having a negative effect on the global south market we're trying to get more good-faith users contributing from?

Nobody is suggesting blanket denying Zero network access, especially not indefinitely. Please do not imply that other people are wanting to hurt 'global south' access. It is disrespectful (or bad-faith, if you prefer). fwiw, I *live* in the 'Global South', and work *full time* (and more, i.e. 200%, at significant loss of all sorts of creature comforts) on Wikimedia projects 100% focused on benefiting the local 'Global South' community.

We are looking for options that are optimal for a specific and complex problem, and polite respectful discussions are necessary, ideally appreciating the skills and experience each person brings to the table. Also there are timeliness aspects to this; disrupting these activities is a social element in discouraging it - if we can show that as a community we will effectively fight back, via technical and social means, it will reduce people attempting to do it. Once they learn we will effectively resist new approaches as they attempt, and if they stop (big if, I appreciate that), preventative barriers can be lowered. Just because people suggest draconian approaches doesnt mean they like those approaches. Admins on projects regularly need to apply broad range ip blocks, which adversely impact our global south audience, but typically the intention is to best manage the situation given available admin resources.

...

  1. most of them are using "video2commons" in combination with "googlevideo.com" as source (typical url: "https://r8---sn-4g57kn7e.googlevideo.com/videoplayback?i(...)". Info: they are NOT exclusively using "video2commons" as upload tool, uploading files also using standard tools... but often triggered also via the mobile edit filter.

...
So, 1-3 leads to the question: would blocking "googlevideo.com" in a video upload help? Well, googlevideo.com redirects to the Google Video Search...
...

Yea, I noticed that also, and it has continued to be a source of problem uploads since your analysis on the 12th, though slightly less so in the last few days. Blocking that URL certainly looks like a viable approach to reduce the size of the problem. I've initiated a request on meta to block it, as meta has more tools/expert community members to evaluate this type of approach.

https://meta.wikimedia.org/wiki/Talk:Spam_blacklist#redirector.googlevideo.com_or_.2A.googlevideo.com

If that request fails, we can initiate a local request on Commons, where the cost/benefit is more clearly in favour of blocking that domain, and could request that each relevant tool also blocks it.

At this point, we have established that we have the technical capability to identify Zero traffic on tool labs tools / video2commons. However, we’ve also learned that this access is not zero-rated, even on Zero partner networks. In addition, this tool does not use client bandwidth for the video transfer, making even paid mobile access cost effective for the abuse we’re experiencing. Because of this, I believe that identifying Zero network users is not going to help us much here (if at all).

In part, I question this assertion. There is a large cohort of people in developing countries who can only afford to have one provider. If they are using a carrier that registers their activity as a zero rated IP when they request OAuth on Commons, and are prevented on that basis, most of them are unable to participate in the abusive uploading process. This wont affect abusive uploads who can afford to have multiple active SIMs, often with multiple smartphones, but that cohort also has less direct benefit from participating in the abusive uploading.

I can provide data for the general case if requested, but it is only general data, not Wikimedia specific, and not specific to this problem. If WMF has data that shows many of the abusive uploaders are switching carriers, then your assertion is spot on.

The aspect of potential IP spoofing is an interesting one, but the cohort that can do that is also much smaller, however they are highly motivated and their motives are much less predictable. If this is a significant element of the abusive uploading case, much more targeted tasks should be created.

The one connection we do know between users and the risk of abusive behavior is the age of the user account. That is, pirates create new accounts as needed and then use them for abuse, although they may wait a short time (several days) before using those new accounts. Therefore, I propose we investigate the option of limiting the access to video2commons to user accounts that have a certain minimum age.

They will quickly adjust to using sleeper accounts, especially if the error message explains why they are unable to use the accounts.
But this approach has merit. Adding a requirement that 'new' users have at least some edit count may help, but a great restriction would be to limit 'new' users of these tools to one upload per day, which gives admins time to evaluate their upload and block the account. That means the abusive accounts need to create lots of sleepers, as they only get one abusive upload each before they are blocked. Beyond the scope of this task, we can also implement additional restrictions on how easy it is to create lots of sleepers - I've not kept up with MediaWiki advances in that area, but I do know it is still quite trivial to create lots of sleepers.

(...)
I agree here about getting some data together. To proceed, I think we need to know what the typical age of user accounts are that are already using video2commons for legitimate purposes. With this information, we may be able to figure out an account age that allows the existing users of video2commons to continue using it, but stop more recently created accounts from gaining access. I’m expecting that this ‘waiting period’ threshold for new accounts would be on the order of months, not days, which could help make this an effective deterrent for abuse. Can someone get data together on this topic to see if this is a viable approach? Also, are there additional data points we could access regarding the user that could help strengthen this evaluation?

For this, http://quarry.wmflabs.org/query/9093 (Newbie Deleted Audio/Video) may eventually help, which shows all deleted videos (using all upload tools) on Commons, uploaded by users with < 100 edits < 3 month age — with infos about date of upload, deleted, user registration etc. and columns like (rightmost) "Description" etc.. The related uploads started around Mar 1. So far I could see, most of the user are more or less "instant uploaders".

In the meantime there might be a chance of some kind of turnaround (check T129845#2218072). But the uploads are still coming....

...

  1. most of them are using "video2commons" in combination with "googlevideo.com" as source (typical url: "https://r8---sn-4g57kn7e.googlevideo.com/videoplayback?i(...)". Info: they are NOT exclusively using "video2commons" as upload tool, uploading files also using standard tools... but often triggered also via the mobile edit filter.

...
So, 1-3 leads to the question: would blocking "googlevideo.com" in a video upload help? Well, googlevideo.com redirects to the Google Video Search...
...

Yea, I noticed that also, and it has continued to be a source of problem uploads since your analysis on the 12th, though slightly less so in the last few days. Blocking that URL certainly looks like a viable approach to reduce the size of the problem. I've initiated a request on meta to block it, as meta has more tools/expert community members to evaluate this type of approach.
https://meta.wikimedia.org/wiki/Talk:Spam_blacklist#redirector.googlevideo.com_or_.2A.googlevideo.com

@Beetstra has done the deed with https://meta.wikimedia.org/w/index.php?title=Talk:Spam_blacklist&diff=0&oldid=15539395 .

DFoy added a comment.Apr 20 2016, 7:17 PM

In T131934#2221686, @jayvdb wrote:
In part, I question this assertion. There is a large cohort of people in developing countries who can only afford to have one provider. If they are using a carrier that registers their activity as a zero rated IP when they request OAuth on Commons, and are prevented on that basis, most of them are unable to participate in the abusive uploading process. This wont affect abusive uploads who can afford to have multiple active SIMs, often with multiple smartphones, but that cohort also has less direct benefit from participating in the abusive uploading.
I can provide data for the general case if requested, but it is only general data, not Wikimedia specific, and not specific to this problem. If WMF has data that shows many of the abusive uploaders are switching carriers, then your assertion is spot on.

Fair question about carrier switching. In my experience traveling to less developed countries in Asia and Africa, the least well off are more likely to use multiple carriers than people who are better off. That's because the mobile operators there are primarily based on a pay-as-you-go model, with no monthly bills, contracts, etc. The people who are watching their money the closest switch sim cards practically per task, since one carrier may offer lower cost for calling out vs sending SMS messages, and other promotional pricing. The users are amazingly adept at the process of switching sim cards around, and some phone models offer a multi-sim capacity so you can switch between carriers without opening the case. So to the point of the discussion, there is a very low barrier for most people to quickly switch between networks.

! In T131934#2224940, @DFoy wrote:
(...)
The users are amazingly adept at the process of switching sim cards around, and some phone models offer a multi-sim capacity so you can switch between carriers without opening the case. So to the point of the discussion, there is a very low barrier for most people to quickly switch between networks.

(...)

Okay, until that explanation I wondered some time ago (last paragraph) why a related user uploaded a file (not discovered by WO 149 filter) and 1 min. later (now discovered by WO 150 filter) added categories to the file... :-)

chasemp closed this task as Declined.May 31 2016, 3:21 PM
chasemp triaged this task as Normal priority.
chasemp added a subscriber: chasemp.

This is a lot of back story my friends. AFAICT this is declined.