Investigation: Block by combination of hashed identifiable information (e.g. user agent, screen resolution, etc.) in addition to IP range.
Closed, ResolvedPublic5 Estimated Story Points
Actions

Assigned To

Authored By

	• TBolliger
	Feb 24 2018, 12:46 AM

Description

As the blocking consultation reaches a stopping point, let's take a sprint to investigate the top ideas from a technical POV.

Project description

With this project, we would create a browser fingerprint with some specific identifiable pieces of data about the user's computer and store it as a hash. Admins could then set an IP range block that also includes a match for this fingerprint, but would not be able to see the hashed information.

We aim to do this within the current Privacy Policy and with data that is already being gathered/sent.

Questions to answer

Implementation
- If we are to build this, how would we proceed? (rough implementation plan)
- What is the delta between building this and just building T100070
Data collection & retention
- What data is currently being collected?
- How long is this data kept?
- Can we hash this data and keep it for longer than 90 days?
Is a hash actually unique enough, given the small ecosystem of browsers?
Part of core or an extension?
How would client-side detection interact with backend?
More...???

Tracking

Example of tracking data that could be hashed:
https://panopticlick.eff.org/

Next steps

AHT to hold a 'Privacy by Design' meeting with WMF Legal to discuss a potential implementation

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		• TBolliger	T120734 Epic ⚡️ Improve MediaWiki's blocking tools
		Resolved		dmaza	T188160 Investigation: Block by combination of hashed identifiable information (e.g. user agent, screen resolution, etc.) in addition to IP range.

Event Timeline

• TBolliger created this task.Feb 24 2018, 12:46 AM

Restricted Application added subscribers: MGChecker, Aklapper. · View Herald TranscriptFeb 24 2018, 12:46 AM

• TBolliger added subscribers: • SPoore, Jalexander, kaldari and 2 others.Feb 24 2018, 12:47 AM

• TBolliger added a parent task: T120734: Epic ⚡️ Improve MediaWiki's blocking tools.Feb 24 2018, 12:49 AM

• TBolliger triaged this task as Medium priority.Feb 24 2018, 12:56 AM

• TBolliger updated the task description. (Show Details)Feb 26 2018, 11:08 PM

MaxSem subscribed.Feb 27 2018, 7:21 PM

dbarratt updated the task description. (Show Details)Feb 27 2018, 7:25 PM

• TBolliger updated the task description. (Show Details)Feb 27 2018, 7:26 PM

• TBolliger updated the task description. (Show Details)Feb 27 2018, 7:28 PM

• TBolliger updated the task description. (Show Details)

MaxSem updated the task description. (Show Details)Feb 27 2018, 7:31 PM

• TBolliger updated the task description. (Show Details)Feb 27 2018, 7:32 PM

• TBolliger set the point value for this task to 5.

• TBolliger moved this task from Triage/To be Estimated to Cards ready for development on the Anti-Harassment board.

• TBolliger mentioned this in T176162: Reach a decision on which blocking tools AHT will build.Feb 28 2018, 6:25 PM

• TBolliger moved this task from Cards ready for development to AHT Sprint 16 on the Anti-Harassment board.Feb 28 2018, 7:47 PM

• TBolliger edited projects, added Anti-Harassment (AHT Sprint 16); removed Anti-Harassment.

dmaza claimed this task.Mar 6 2018, 9:19 PM

dmaza moved this task from Ready to In progress on the Anti-Harassment (AHT Sprint 16) board.

OK, I'm trying to think about how this will work in logs and I'm at a loss: Let's say Bad Bobby uses Firefox/Win10 inside the range User:68.112.39.0/27 and is blocked both by IP range and device. Good Gary tries to edit within that range with a different device. The edit should go through, correct? And it will be attributed to an IP inside User:68.112.39.0/27. Should it be annotated in any way? I would think not, but would it cause confusion why the range is blocked but this edit came through?

And what if Gary turns out to be evil too, and the IP range block needs to be changed to include all devices, not just Firefox/Win10? I assume setting another block would do the trick.

Ping @dmaza @dbarratt @kaldari @MaxSem @MusikAnimal for their thoughts.

Good Gary tries to edit within that range with a different device. The edit should go through, correct? And it will be attributed to an IP inside User:68.112.39.0/27. Should it be annotated in any way?

Yes, it should go through. What do you mean by annotated? I don't think we need to keep track of this.

And what if Gary turns out to be evil too, and the IP range block needs to be changed to include all devices, not just Firefox/Win10? I assume setting another block would do the trick.

I would assume so.

I was under the impression that blocking IP/IP range + device would only be set when the IP in question is known to be used by multiple users to minimize collateral damage. I don't know if this is something that admins have access to.

I was under the impression that blocking IP/IP range + device would only be set when the IP in question is known to be used by multiple users to minimize collateral damage. I don't know if this is something that admins have access to.

That's correct. I don't actually know the end goal of this task, but I'm assuming this feature is just for CheckUsers? Only they can see user agents.

If you're tossing in screen resolution along with the user agent, you're probably safe to block a range based on the edits alone, in most cases... But for instance a mobile range will be problematic. A lot of iPhones have the same screen size and the newest version of the browser. The UA on desktop is also commonly the newest popular browser/OS, so we can't go by that alone either, but screen size I imagine varies quite a bit.

So overall I'd be weary of this sort of smart block automation, but if it is reserved for CheckUsers, there shouldn't be many mistakes as they can verify the absence of collateral damage prior to blocking.

In T188160#4037108, @dmaza wrote:

I was under the impression that blocking IP/IP range + device would only be set when the IP in question is known to be used by multiple users to minimize collateral damage. I don't know if this is something that admins have access to.

Correct. No one should ever see the data.

After some more thinking, I think we will need to annotate the block log of the IP range block, not any edits made within the block range.

In T188160#4037108, @dmaza wrote:

Good Gary tries to edit within that range with a different device. The edit should go through, correct? And it will be attributed to an IP inside User:68.112.39.0/27. Should it be annotated in any way?

Yes, it should go through. What do you mean by annotated? I don't think we need to keep track of this.

Right now it would be incredibly confusing to admins if they see edits from a range that's currently blocked.

edit: I think this could be avoided with other solutions though.

In T188160#4037126, @MusikAnimal wrote:

If you're tossing in screen resolution along with the user agent, you're probably safe to block a range based on the edits alone, in most cases... But for instance a mobile range will be problematic. A lot of iPhones have the same screen size and the newest version of the browser. The UA on desktop is also commonly the newest popular browser/OS, so we can't go by that alone either, but screen size I imagine varies quite a bit.

So overall I'd be weary of this sort of smart block automation, but if it is reserved for CheckUsers, there shouldn't be many mistakes as they can verify the presence of collateral damage prior to blocking.

We're also investigating T188161: Investigation: Anon cookie blocking which will cause fewer false positives (in theory).

In T188160#4037130, @jrbs wrote:

Right now it would be incredibly confusing to admins if they see edits from a range that's currently blocked.

I don't think it would be so confusing. The block would simply include a note of "block only the device that made this edit", just like some blocks contain "anons only".

I find this proposal problematic for other reasons:

There may be some weird outcomes such as "The students at university Y with an iPhone X cannot edit Wikipedia", albeit this could be assumible.
AFAIK we are not extracting screen resolution. That may need amending the privacy policy (unless it can be considered covered by (1)).
It means processing and storing identifying information from everyone in order to block a few vandals. Probably wouldn't be well-received.
Even if hashing it, it seems would be easy to recover the original data.
The block would act as an oracle that could leak the identifiable information being used. If John was blocked with this option on a range I can connect through, I might try with several common profiles and find out eg. that he was using an iPhone.

T188161 would be much safer, indeed.

Amorymeltzer subscribed.Mar 9 2018, 2:04 AM

dmaza updated the task description. (Show Details)Mar 9 2018, 8:54 PM

SQL subscribed.Mar 13 2018, 4:43 AM

In T188160#4037184, @Platonides wrote:

I find this proposal problematic for other reasons:

There may be some weird outcomes such as "The students at university Y with an iPhone X cannot edit Wikipedia", albeit this could be assumible.

True, these blocks would have to be used responsibly and collateral damage should be properly evaluated before using it. Still, blocking iPhone X from university Y is more restrictive than blocking university Y. Which is what we are trying to accomplish here

AFAIK we are not extracting screen resolution. That may need amending the privacy policy (unless it can be considered covered by (1)).

We are not, and legal has to check/approve this changes 'cause we would likely want collect more data than just screen resolution

It means processing and storing identifying information from everyone in order to block a few vandals. Probably wouldn't be well-received.

I think it is a necessary evil but it is not for me to decided. We are just adding a few things to what we already have on CheckUser.

Even if hashing it, it seems would be easy to recover the original data.

The block would act as an oracle that could leak the identifiable information being used. If John was blocked with this option on a range I can connect through, I might try with several common profiles and find out eg. that he was using an iPhone.

Ideally we will use enough parameters to create very unique fingerprints making it harder to "brute force" the hash

• Niharika subscribed.Mar 13 2018, 6:48 PM

Will continue this in other meetings and documentation. A lot of product definition required here, and we need to check with Legal.

Thank you for your work on this investigation Dayllan!

Ideally we will use enough parameters to create very unique fingerprints making it harder to "brute force" the hash

In my opinion, there's a problem with that: The more different parameters we use for creating a hash, the easier is it to circumvent.

In T188160#4051229, @MGChecker wrote:

Ideally we will use enough parameters to create very unique fingerprints making it harder to "brute force" the hash

In my opinion, there's a problem with that: The more different parameters we use for creating a hash, the easier is it to circumvent.

How so? The hash will be a combination of all parameters. In other words, there is an AND relationship between the params, not an OR relationship.

If you just change one of these many parameters, the block isn't triggered anymore as the resulting hash is different. The more different parameters we're hashing, the easiert it gets to find some to easily change to do this.

Ideally we will use enough parameters to create very unique fingerprints making it harder to "brute force" the hash

Late comment, but - I'm doubtful of this. If we do do this, hashes should probably be kept secret (or at least it should be an hmac with a secret key)

Investigation: Block by combination of hashed identifiable information (e.g. user agent, screen resolution, etc.) in addition to IP range. Closed, ResolvedPublic5 Estimated Story PointsActions