Page MenuHomePhabricator

Investigation: Block by combination of hashed identifiable information (e.g. user agent, screen resolution, etc.) in addition to IP range.
Closed, ResolvedPublic5 Estimated Story Points

Description

As the blocking consultation reaches a stopping point, let's take a sprint to investigate the top ideas from a technical POV.


Project description

With this project, we would create a browser fingerprint with some specific identifiable pieces of data about the user's computer and store it as a hash. Admins could then set an IP range block that also includes a match for this fingerprint, but would not be able to see the hashed information.

We aim to do this within the current Privacy Policy and with data that is already being gathered/sent.


Questions to answer
  • Implementation
    • If we are to build this, how would we proceed? (rough implementation plan)
    • What is the delta between building this and just building T100070
  • Data collection & retention
    • What data is currently being collected?
    • How long is this data kept?
    • Can we hash this data and keep it for longer than 90 days?
  • Is a hash actually unique enough, given the small ecosystem of browsers?
  • Part of core or an extension?
  • How would client-side detection interact with backend?
  • More...???

Tracking

Example of tracking data that could be hashed:
https://panopticlick.eff.org/


Next steps
  • AHT to hold a 'Privacy by Design' meeting with WMF Legal to discuss a potential implementation

Event Timeline

TBolliger set the point value for this task to 5.

OK, I'm trying to think about how this will work in logs and I'm at a loss: Let's say Bad Bobby uses Firefox/Win10 inside the range User:68.112.39.0/27 and is blocked both by IP range and device. Good Gary tries to edit within that range with a different device. The edit should go through, correct? And it will be attributed to an IP inside User:68.112.39.0/27. Should it be annotated in any way? I would think not, but would it cause confusion why the range is blocked but this edit came through?

And what if Gary turns out to be evil too, and the IP range block needs to be changed to include all devices, not just Firefox/Win10? I assume setting another block would do the trick.

Ping @dmaza @dbarratt @kaldari @MaxSem @MusikAnimal for their thoughts.

Good Gary tries to edit within that range with a different device. The edit should go through, correct? And it will be attributed to an IP inside User:68.112.39.0/27. Should it be annotated in any way?

Yes, it should go through. What do you mean by annotated? I don't think we need to keep track of this.

And what if Gary turns out to be evil too, and the IP range block needs to be changed to include all devices, not just Firefox/Win10? I assume setting another block would do the trick.

I would assume so.

I was under the impression that blocking IP/IP range + device would only be set when the IP in question is known to be used by multiple users to minimize collateral damage. I don't know if this is something that admins have access to.

I was under the impression that blocking IP/IP range + device would only be set when the IP in question is known to be used by multiple users to minimize collateral damage. I don't know if this is something that admins have access to.

That's correct. I don't actually know the end goal of this task, but I'm assuming this feature is just for CheckUsers? Only they can see user agents.

If you're tossing in screen resolution along with the user agent, you're probably safe to block a range based on the edits alone, in most cases... But for instance a mobile range will be problematic. A lot of iPhones have the same screen size and the newest version of the browser. The UA on desktop is also commonly the newest popular browser/OS, so we can't go by that alone either, but screen size I imagine varies quite a bit.

So overall I'd be weary of this sort of smart block automation, but if it is reserved for CheckUsers, there shouldn't be many mistakes as they can verify the absence of collateral damage prior to blocking.

I was under the impression that blocking IP/IP range + device would only be set when the IP in question is known to be used by multiple users to minimize collateral damage. I don't know if this is something that admins have access to.

Correct. No one should ever see the data.

After some more thinking, I think we will need to annotate the block log of the IP range block, not any edits made within the block range.

Good Gary tries to edit within that range with a different device. The edit should go through, correct? And it will be attributed to an IP inside User:68.112.39.0/27. Should it be annotated in any way?

Yes, it should go through. What do you mean by annotated? I don't think we need to keep track of this.

Right now it would be incredibly confusing to admins if they see edits from a range that's currently blocked.

edit: I think this could be avoided with other solutions though.

If you're tossing in screen resolution along with the user agent, you're probably safe to block a range based on the edits alone, in most cases... But for instance a mobile range will be problematic. A lot of iPhones have the same screen size and the newest version of the browser. The UA on desktop is also commonly the newest popular browser/OS, so we can't go by that alone either, but screen size I imagine varies quite a bit.

So overall I'd be weary of this sort of smart block automation, but if it is reserved for CheckUsers, there shouldn't be many mistakes as they can verify the presence of collateral damage prior to blocking.

We're also investigating T188161: Investigation: Anon cookie blocking which will cause fewer false positives (in theory).

Right now it would be incredibly confusing to admins if they see edits from a range that's currently blocked.

I don't think it would be so confusing. The block would simply include a note of "block only the device that made this edit", just like some blocks contain "anons only".

I find this proposal problematic for other reasons:

  • There may be some weird outcomes such as "The students at university Y with an iPhone X cannot edit Wikipedia", albeit this could be assumible.
  • AFAIK we are not extracting screen resolution. That may need amending the privacy policy (unless it can be considered covered by (1)).
  • It means processing and storing identifying information from everyone in order to block a few vandals. Probably wouldn't be well-received.
  • Even if hashing it, it seems would be easy to recover the original data.
  • The block would act as an oracle that could leak the identifiable information being used. If John was blocked with this option on a range I can connect through, I might try with several common profiles and find out eg. that he was using an iPhone.

T188161 would be much safer, indeed.

I find this proposal problematic for other reasons:

  • There may be some weird outcomes such as "The students at university Y with an iPhone X cannot edit Wikipedia", albeit this could be assumible.

True, these blocks would have to be used responsibly and collateral damage should be properly evaluated before using it. Still, blocking iPhone X from university Y is more restrictive than blocking university Y. Which is what we are trying to accomplish here

  • AFAIK we are not extracting screen resolution. That may need amending the privacy policy (unless it can be considered covered by (1)).

We are not, and legal has to check/approve this changes 'cause we would likely want collect more data than just screen resolution

  • It means processing and storing identifying information from everyone in order to block a few vandals. Probably wouldn't be well-received.

I think it is a necessary evil but it is not for me to decided. We are just adding a few things to what we already have on CheckUser.

  • Even if hashing it, it seems would be easy to recover the original data.
  • The block would act as an oracle that could leak the identifiable information being used. If John was blocked with this option on a range I can connect through, I might try with several common profiles and find out eg. that he was using an iPhone.

Ideally we will use enough parameters to create very unique fingerprints making it harder to "brute force" the hash

TBolliger moved this task from In progress to Done on the Anti-Harassment (AHT Sprint 16) board.

Will continue this in other meetings and documentation. A lot of product definition required here, and we need to check with Legal.

Thank you for your work on this investigation Dayllan!

Ideally we will use enough parameters to create very unique fingerprints making it harder to "brute force" the hash

In my opinion, there's a problem with that: The more different parameters we use for creating a hash, the easier is it to circumvent.

Ideally we will use enough parameters to create very unique fingerprints making it harder to "brute force" the hash

In my opinion, there's a problem with that: The more different parameters we use for creating a hash, the easier is it to circumvent.

How so? The hash will be a combination of all parameters. In other words, there is an AND relationship between the params, not an OR relationship.

If you just change one of these many parameters, the block isn't triggered anymore as the resulting hash is different. The more different parameters we're hashing, the easiert it gets to find some to easily change to do this.

Ideally we will use enough parameters to create very unique fingerprints making it harder to "brute force" the hash

Late comment, but - I'm doubtful of this. If we do do this, hashes should probably be kept secret (or at least it should be an hmac with a secret key)