Page MenuHomePhabricator

Draft: Masking IP addresses for increased privacy
Open, Needs TriagePublic

Description

This task represents the beginning of a discussion and not a decision of what work anyone will do. There is much more research and community consultation to do. This task exists to think through some of the technical implications even as the requirements will almost certainly change.

Problem

MediaWiki exposes our anonymous user's IP addresses to the public. This isn't a good privacy practice and should be avoided if possible. The IP address can reveal information about a user (like their relative location) that makes MediaWiki less safe for our users.

Proposed Solution

The IP Address of anonymous users should be hashed (with a key) to prevent unprivileged users from knowing the IP of that user. Of course, this would be a major shift in the way that MediaWiki operates. It will require shifting all of the tools in core, extensions, and external tools to use the hash rather than the IP.

There are many pieces of identification that could be used to associate multiple actions by the same anonymous user, as the same anonymous user: IP address, user agent, session, etc. (or a combination of any of them). To ease the transition away from IP addresses on a technical and social level, the identification should continue to be tied directly to IP address. This should reduce the burden on developers and functionaries who rely on this behavior. However, this could be changed in the future.

The phases listed below may be separated by hours, days, weeks, months, or even years.

Phase I (Prep)
NOTE: This phase has the most work for DBA.

Thankfully, most of the IP addresses being used in MediaWiki have already been moved into the actor table. To fascilate the changes, a new column will be added to the actor table:

+------------+---------------------+------+-----+---------+----------------+
| Field      | Type                | Null | Key | Default | Extra          |
+------------+---------------------+------+-----+---------+----------------+
| actor_mask | varbinary(255)      | NO   | UNI | NULL    |                |
+------------+---------------------+------+-----+---------+----------------+

(The field can be a different name, the author of this task has no preference).
This field should be exposed to database dumps and the Toolforge replicas.

Phase II (Hashing)

MediaWiki will be updated to generate a hash of the IP address when an IP is inserted into the actor table:

$mask = hash_hmac( 'sha1', '127.0.0.1', 'UNSAFEKEY' );

(The hashing algorithm may change, the author of this task has no preference, but must be cryptographically secure to prevent reverse-engineering).

A maintenance script will be created to back-fill the existing IP address in the actor table.

The key is unsafe because, even if the IP addresses were removed from the database dumps and Toolforge replicas, a user could utilize existing public database dumps to engineer an exhaustive list of IP address masks. Therefore, the masks added to the database, should not be treated as private.

Phase III (Reveal Mask)

Anywhere IP addresses are being displayed, the mask should (in some way) be displayed (or accessible) to the users. This will allow users to start using masks rather than IP addresses where they can be used.

A Special Page (and API) should be created for users with the block right (or a new right) to get information about the IP address, without actually revealing the IP address.

For instance, if a user were to input a mask like abcdef they might get an answer back like:

Organization: Charter Communications, Inc (CC-3518)
CIDR: a36b5d/15

The CIDR would be a masked CIDR that could be used for blocking, etc. Whenever this mask is shown to a user, the mask should be inserted into the actor table so it can be retrieved later.

Also, it should be possible to input two (or more?) masks, like abcdef and 1f1f1f and get information about both of them together without having to reveal the IP:

City: Not Same
State: Same
Country: Same
CIDR: a5b6d7/15

The CIDR would be a range wide enough to include both IPs. Like before, the mask (upon generation) would be inserted into the actor table for later reference.

This should allow admins to make decisions about vandalism, without having to actually reveal the IP address itself.

Phase IV (Accept Mask)
NOTE: This phase consists of a lot of work for developers across the organization.

Existing systems that do not use the actor table, should be updated to use the actor table.

Anywhere IP addresses are used for input or output, should accept either an IP address OR a mask. Whatever type is given, should be what is returned.

Tasks:
T228077: Migrate ipblocks to actor table

Phase V (Default to Mask)

Now that all of the tools available accept a mask rather than an IP address, the display in MediaWiki (and it's APIs) should default to the mask, rather than the IP address. The IP address should still be able to be retrieved in case they are needed during the transition, but the use should be highly discouraged and any use should be documented so the use can be resolved.

Phase VI (Reject IPs)

Inputs (pages, tools, APIs, etc.) should now reject IP address input as invalid. This will prevent the input from accidentally revealing the IP address behind the mask.

Phase VII (Remove IPs)

The actor.actor_name column should be removed from database dumps and from Toolforge replicas. MediaWiki should refuse to display the contents of the field to anyone (unless they have signed a non-disclosure agreement).

Phase VIII (The Big Switch)
NOTE: This phase will have the most work for functionaries.

Now that the IP addresses are inaccessible and users are using masks rather than IPs, the final phase is to change the hash key to a secret key:

$mask = hash_hmac( 'sha1', '127.0.0.1', 'SUPERSECRETKEYUNIQUETOTHISIP' );

Doing this will break the edit history of an IP address. Before the switch, your mask could be abcd, after the switch, your mask could be e4f5. The reason this is necessary is to actually protect the IP address from being reverse-engineered from the existing publicly available database dumps.

After this switch, it will be impossible to get the IP address from the mask, without checkuser privileges.

Since the IPs will continue to exist in the database, blocks before the switch (IPs or masks) will continue to be enforced.

It would be best if the key were unique to each IP address to make re-identification more difficult.

Possible Future Enhancements.

After this process is complete, there are many enhancements that could be made to further increase the privacy and safety of our users. This list, is a non-exhaustive list of examples.

Switch Hash Key (again)

Technically, the hash key (changed in Phase VIII (The Big Switch) could be changed as frequently as we want to (daily, monthly, yearly, etc.). Doing this makes identification of the IPs more difficult, but also breaks the revision history.

Some sort of rolling scheme could also be introduced. For instance, perhaps the key is unique *per-ip* and perhaps that key could change if there haven't been edits from that IP within 30 (?) days.

Remove IPs from Database

The actor.actor_name column could be removed from the database completely. Tools (like CheckUser) that currently return an IP address could instead return a mask. However, this will take a lot of work because tools like the Mask Info special page described in Phase III (Reveal Mask) wont work because there will be no way to run a WHOIS lookup on a mask. Other types of abuse mitigation will need to be implemented before this becomes a possibility.

Base mask on session instead of IP.

Once everything has been switched to using masks instead of IPs, and those masks have no inherent meaning. It would be possible to change what the mask represents. For instance, we could change the mask to be encoded like this:

$mask = hash_hmac( 'sha1', 'SESSION_ID', 'SUPERSECRETKEY' );

Where SESSION_ID is the users session id. This would create a new mask every time the user's session was generated (i.e. each new device and browser, etc.). This would, of course, break the social contract of what the mask represents, but would be technically trivial to implement as the masks would function identically to the IP masks.

Doing this may open the door to being able to send "private" notifications to anonymous users. Or being able to "take over" edits by your standard user account that were made while accidentally being logged out.


See also:

Event Timeline

dbarratt created this task.Jul 11 2019, 5:44 AM
Restricted Application added subscribers: MGChecker, Aklapper. · View Herald TranscriptJul 11 2019, 5:44 AM
JJMC89 updated the task description. (Show Details)Jul 11 2019, 6:39 AM
JJMC89 added subscribers: MusikAnimal, JJMC89.

While this would be great for privacy, it would make counter-vandalism efforts more difficult. See @MusikAnimal's comments starting at T133452#4313877.

aezell added a subscriber: aezell.Jul 11 2019, 3:07 PM

This is a good collection of ideas.

However, there's a lot of research to be done with the communities around how we can continue to enable counter-vandalism and support changing data privacy standards.

So, we can discuss the technical merits of this proposal or its constituent parts but there are likely new requirements that will be discovered that may make some aspects of this untenable.

In short, this should be considered a WIP and not a formal proposal for THE solution.

This change could actually reduce the privacy of registered users. If "privileged user" means CheckUser, this will require massively increasing the number of CheckUsers needed to deal with routine day-to-day vandalism. A compromised (or rogue) CheckUser account is a disaster, so the privacy of registered users depends on keeping the number of such accounts as low as possible.

The only way this could be workable, from an anti-vandalism perspective, would be if the "reveal IP of masked anon" user right were separate from the CheckUser and probably even the admin toolkit. It would have to be given out as freely as, say, rollback rights on enwiki. But then it would only be a marginal improvement in anon privacy; sooner or later some LTA would gain such rights, run a bot to unmask every anon in the past year, and post the results on one of the Wikipedia-hate sites. So why bother?

As far as the "Possible Future Enhancements", you might as well rename the place Vandalpedia at the same time. There'd be no way whatosever of stopping users with highly dynamic IPs (I.E. every mobile phone user in the world) from doing what they please, when they please, forever.

I do appreciate the efforts to increase user privacy. I certainly wish my IP wasn't logged anywhere. But I'm not sure the problem here, ("allow true anonymity without being overrun by assholes") has been solved by anyone yet.

aezell renamed this task from MediaWiki should mask IP addresses for anonymous users to WIP: Masking IP addresses for increased privacy.Jul 11 2019, 4:31 PM
aezell updated the task description. (Show Details)
dbarratt renamed this task from WIP: Masking IP addresses for increased privacy to Draft: Masking IP addresses for increased privacy.Jul 15 2019, 5:42 PM
dbarratt updated the task description. (Show Details)
dbarratt updated the task description. (Show Details)
dbarratt updated the task description. (Show Details)Jul 15 2019, 5:50 PM

While this would be great for privacy, it would make counter-vandalism efforts more difficult. See @MusikAnimal's comments starting at T133452#4313877.

What I've described in this task, I believe is quite different than what is described in T133452. Instead of creating a new account for anonymous users automatically, the user's IP address would still be used, but would be "masked" from view. There would be a 1:1 relationship between masks and IP Addresses.

This change could actually reduce the privacy of registered users. If "privileged user" means CheckUser, this will require massively increasing the number of CheckUsers needed to deal with routine day-to-day vandalism. A compromised (or rogue) CheckUser account is a disaster, so the privacy of registered users depends on keeping the number of such accounts as low as possible.

I'm not sure I understand why the number of CheckUsers would need to be increased? I've added some details in Phase III (Reveal Mask). I think we could show non-CheckUser admins information about the IP without revealing the IP. Would that work? What kinds of information would we need to reveal?

The only way this could be workable, from an anti-vandalism perspective, would be if the "reveal IP of masked anon" user right were separate from the CheckUser and probably even the admin toolkit. It would have to be given out as freely as, say, rollback rights on enwiki. But then it would only be a marginal improvement in anon privacy; sooner or later some LTA would gain such rights, run a bot to unmask every anon in the past year, and post the results on one of the Wikipedia-hate sites. So why bother?

I'm wondering if revealing information about the IP, rather than the IP itself, would be a better solution? What kinds of information do admins gather from IP addresses? Is there perhaps a "safe" way to expose this information to admins who have blocking privileges?

As far as the "Possible Future Enhancements", you might as well rename the place Vandalpedia at the same time. There'd be no way whatosever of stopping users with highly dynamic IPs (I.E. every mobile phone user in the world) from doing what they please, when they please, forever.

I've added some details about that to the description. You're right, there would need to be additional anti-abuse mechanisms put in place before this would work. I added this section, because I wanted to express that masking the IP address is a step towards potentially doing something like T133452. In other words, it's not an "either or" proposition.

From a Wikipedia anti-vandal perspective, I suspect the hardest sell would be not being able to see patterns related to ranges/ip-distance, at a glance.

I think there would be a high risk of re-identification in this scheme. There is a large data-set (history pages) of which IPs edit wikipedia logged out at what frequency. IP addresses (Esp. IPv4) are a relatively small space of values with a very non-flat histogram. This seems to me to be the perfect setup for a frequency based attack. And that's before even taking into account auxiliary data that will be associated with these masks. Editing activity is often timezone based, and all edits have fine-grained timestamps. Subject matter of edits is often correlated highly with geographic region. And so on. At the very least I think we would need to do a lot of study to be convincing that this scheme is safe.

Its also not clear to me who has access to the additional (but not IP) info for the mask in this proposal. Admins? Everyone? Is it considered sensitive information? Will it be logged and controlled in the same way as checkuser is? How are the masked CIDR ranges generated? It seems like the masked CIDR ranges would have a lot of potential to reveal additional near-by masks if you happen to know at least one-mask (And at the very least, you already know your own).

From what I understand, the proposal is that the masks stay constant over time (except for one jump when IP addresses are disabled). This also means as time goes on any info that gets leaked reduces the privacy of the system forever.

I think it would be instructive to study the literature on encrypted databases, and especially how such schemes fail. For example, consider section 5 of http://cs.brown.edu/people/seny/pubs/edb.pdf

Where SESSION_ID is the users session id. This would create a new mask every time the user's session was generated (i.e. each new device and browser, etc.). This would, of course, break the social contract of what the mask represents, but would be technically trivial to implement as the masks would function identically to the IP masks.

It should of course be explicit its not just each new device/browser, but also each time somebody clears cookies. IP addresses are usually used as an identifier because there is friction to getting new ones. Its of course possible with vpn's or whatever, but its a lot harder than just opening up a new incognito window in your browser.

dbarratt changed the task status from Open to Stalled.Jul 17 2019, 10:00 PM

@dbarratt: As you set the task status to stalled, who exactly / specifically are you waiting for for further input?

@dbarratt: As you set the task status to stalled, who exactly / specifically are you waiting for for further input?

This is still a draft, so anyone at this time. :) I wanted to indicate that this is not ready to be worked on. Is there a better way to indicate that?

Aklapper changed the task status from Stalled to Open.Jul 18 2019, 12:03 AM

You might mix up that a task itself is stalled (I do not see that here) vs that implementing what's proposed/discussed is stalled. Maybe Proposal covers that?

You might mix up that a task itself is stalled (I do not see that here) vs that implementing what's proposed/discussed is stalled. Maybe Proposal covers that?

Ah! Thanks! :)

As discussed in the IP address masking product development meetings last year with @kaldari and @DannyH, I don't think it is appropriate to change actor names to a hashed IP address, as you propose for phase II, when we can't practically change the IP addresses in page content, for example in talk page signatures.

As discussed in the IP address masking product development meetings last year with @kaldari and @DannyH, I don't think it is appropriate to change actor names to a hashed IP address, as you propose for phase II, when we can't practically change the IP addresses in page content, for example in talk page signatures.

I do not believe there is a perfect solution, only less-terrible ones. If we are searching for a perfect solution, I think we will be searching forever.

dbarratt updated the task description. (Show Details)Jul 18 2019, 5:05 AM

I think there would be a high risk of re-identification in this scheme.

I certainly do not believe it's perfect. I've updated the description to include a few extra safety measures we could take (like making the key unique to each IP address, etc.). I think in the future we could move towards safer systems, but anti-vandalism tools would need to be invented to deal with the increase in vandalism

Its also not clear to me who has access to the additional (but not IP) info for the mask in this proposal. Admins? Everyone?

I think anyone who has the block right would need access to information about the IP. Otherwise, you don't need this data.

Is it considered sensitive information?

I'm not sure if it is or if it isn't. I think that is yet to be determined. I think we should only provide what is absolutely needed and in as safe a way as possible. I realize that the "safest" solution would be no information at all, but that doesn't seem realistic given our current anti-abuse tools. I also don't think it's realistic to have every admin sign an NDA (CheckUser) so finding the proper balance is crucial.

Will it be logged and controlled in the same way as checkuser is?

I think that would be wise.

How are the masked CIDR ranges generated?

That comes from the WHOIS database. Or rather, it's just the base IP address with whatever the /NUM is from WHOIS. As far as I understand it, if you are on a residential IP (as you can see from the Organization) then that is the possible range of your IP address.

It seems like the masked CIDR ranges would have a lot of potential to reveal additional near-by masks if you happen to know at least one-mask (And at the very least, you already know your own).

What do you mean by "near-by"? Do you mean if my IP is 127.0.0.1 I might be able to get the mask for 127.0.0.2? I suppose you could get the make for the IP of your own range (if you have access to this tool), but that range would also only reveal to you people who already live really close to you anyways... ?

From what I understand, the proposal is that the masks stay constant over time (except for one jump when IP addresses are disabled). This also means as time goes on any info that gets leaked reduces the privacy of the system forever.

I added some details on this, I don't see a reason why they couldn't change multiple times (but realizing the downside, which is that when they change, the edit history breaks).

I think it would be instructive to study the literature on encrypted databases, and especially how such schemes fail. For example, consider section 5 of http://cs.brown.edu/people/seny/pubs/edb.pdf

This is over my head. :) Is there a resource that is for encrypted database laymen? :)

I don't think it is appropriate to change actor names to a hashed IP address, as you propose for phase II, when we can't practically change the IP addresses in page content, for example in talk page signatures.

To more directly answer your question, after Phase VIII (The Big Switch) the IP address that was associated with those edits (either their own, or someone talking about them) will no longer be associated with them from that point forward. This means we are not able to retroactively redact IP addresses. And how could we? As far as I know, they are all over our database dumps and Toolforge replicas. This proposal is about masking them from Phase VIII (The Big Switch) forward and providing a path to get to that point safely so they are not revealed after that point.

revi awarded a token.Aug 3 2019, 9:11 AM

It is pointless to discuss these Phase IVIVIVI in detail before consensus on how namely may IPs be encrypted, and whether they should be at all. If the address encryption is expected to pulverize each /64 IPv6 block—leaving the end user without an option to see all edits from the same 2hhh:hhhh:hhhh:hhhh: (with possible differences in the lowest 64 bits)—then the proposal should be dismissed and a homework of learning actual IP-editing practices has to be assigned to the author.

Stryn awarded a token.Aug 3 2019, 5:57 PM

It'd be easier to ban anonymous editors from editting altogether. Creating an account is free and very simple. I guess the option is not going to be very popular though.

Bawolff added a comment.EditedAug 3 2019, 8:21 PM

I do not believe there is a perfect solution, only less-terrible ones. If we are searching for a perfect solution, I think we will be searching forever.

I don't think things need to be perfect to be useful. However, this scheme comes with a cost (in terms anti-vandalism). We need to be able to know what level of privacy it provides, and at what cost. If we don't know the benefits and the costs, we can't do a cost/benefit analysis.

I also don't think its a given that any scheme is "less-terrible" than the status quo. For example, a scheme where it was very easy to reverse the masking might be worse than the status quo, since it could lure (anon) users into a false sense of security and cause them to engage in riskier behaviour since they think their address is hidden when its not. To be clear, I'm not saying that this scheme is worse than the status quo, just that its not a given that it isn't.

I think there would be a high risk of re-identification in this scheme.

I certainly do not believe it's perfect. I've updated the description to include a few extra safety measures we could take (like making the key unique to each IP address, etc.). I think in the future we could move towards safer systems, but anti-vandalism tools would need to be invented to deal with the increase in vandalism

I'm not sure that making the HMAC key unique to each IP changes much. An attacker without knowledge of the key(s) should not be able to tell the difference if the hash function is secure (is a PRF).

[..]

How are the masked CIDR ranges generated?

That comes from the WHOIS database. Or rather, it's just the base IP address with whatever the /NUM is from WHOIS. As far as I understand it, if you are on a residential IP (as you can see from the Organization) then that is the possible range of your IP address.

It seems like the masked CIDR ranges would have a lot of potential to reveal additional near-by masks if you happen to know at least one-mask (And at the very least, you already know your own).

What do you mean by "near-by"? Do you mean if my IP is 127.0.0.1 I might be able to get the mask for 127.0.0.2? I suppose you could get the make for the IP of your own range (if you have access to this tool), but that range would also only reveal to you people who already live really close to you anyways... ?

By near-by I mean people in the same subnet (roughly)

Sure, but if the hashes over time stay constant you can start to build up a list, especially if you combine information from friends or try to use proxies that are placed over the world.

From what I understand, the proposal is that the masks stay constant over time (except for one jump when IP addresses are disabled). This also means as time goes on any info that gets leaked reduces the privacy of the system forever.

I added some details on this, I don't see a reason why they couldn't change multiple times (but realizing the downside, which is that when they change, the edit history breaks).

I think it would be instructive to study the literature on encrypted databases, and especially how such schemes fail. For example, consider section 5 of http://cs.brown.edu/people/seny/pubs/edb.pdf

This is over my head. :) Is there a resource that is for encrypted database laymen? :)

Its a bit of a niche area, probably because (AFAIK) nobody has actually made a usable & secure one. But basically, the challenge is that people want to outsource their DB to the cloud, but don't trust the cloud provider with their data. There's various schemes (many (most?) of which have weaknesses) to encrypt the data in such a way that you can still do normal DB operations (e.g. depending on the scheme: equality comparisions, range comparisions, LIKE queries, SUM(), etc) without revealing the underlying data. It is hoped then that you could use cloud provided DBs while still keeping your data secret.

Although its not the same thing, its kind of similar to what is being proposed, as you want to hide the IP but still preserve certain properties of the data (or at least allow such properties to be computed/revealed)

One of the commonly proposed schemes if you only need equality comparison is to use deterministic encryption (Abbreviated as DTE-encryption in the paper). If you don't have the private key, deterministic encryption basically has the same properties as the HMAC construction you proposed. The linked paper discusses some attacks on encrypted db's, using medical information as an example. It mostly concerns itself with systems that allow range queries, but section 5 talks about deterministic encryption. It describes 2 attacks, the first one (frequency) is pretty straight forward. Applied to this bug, it would be something along the lines of: Take all the IP edits from the last month before switching to the hashed IP scheme. The IP that edited the most is probably the same as the hashed IP in the next month that edited the most. Then take the second most common IP, and so on.

Whether or not this is effective depends on how anon edits are distributed among different IPs; The less flat the distribution the better for the attack. It also depends on how constant IP frequency is over time. I have no idea how IP editing on wikipedia fits these properties, so I'm unsure how bad this attack is in practice. The second attack (Lp) similarly looks at the histogram of values, it has the benefit of being able to combine multiple data sources more easily. See the paper for details

One of the commonly proposed schemes if you only need equality comparison is to use deterministic encryption (Abbreviated as DTE-encryption in the paper). If you don't have the private key, deterministic encryption basically has the same properties as the HMAC construction you proposed. The linked paper discusses some attacks on encrypted db's, using medical information as an example. It mostly concerns itself with systems that allow range queries, but section 5 talks about deterministic encryption. It describes 2 attacks, the first one (frequency) is pretty straight forward. Applied to this bug, it would be something along the lines of: Take all the IP edits from the last month before switching to the hashed IP scheme. The IP that edited the most is probably the same as the hashed IP in the next month that edited the most. Then take the second most common IP, and so on.

I should mention, out of curiosity, i tried a naive versions of this attack with some data from enwiki. It did not really work on the data set I tried.

Risker added a subscriber: Risker.Aug 4 2019, 4:36 AM
SQL added a subscriber: SQL.Aug 5 2019, 2:16 AM
Alsee added a subscriber: Alsee.Aug 5 2019, 2:03 PM

First a nitpick, problem behavior is a far larger category than "vandalism". For example if someone is a partisan political warrior and they persistently put inappropriate negative content into the biographies of politicians of a particular political party, they are not a vandal. However they do need to be blocked.

It looks like this proposal largely obliterates our ability to deal with IP ranges. I'm not an admin and I generally only deal with vandalism when it falls into my lap during other work, but when I'm concerned about an edit by IP 1.2.3.4 I might view Special:Contributions/1.2.3.* and review all edits from that range. Am I dealing with a single suspicious edit in an otherwise good range? Or do I need to revert every recent edit made from the entire range and request a rangeblock? And as I noted above, it's quite possible that none of the edits being reverted are vandalism. Anyone else looking at one of those edits, without knowing the IP-range-behavior, might not spot why there is a problem-pattern with the collection of edits.

The motivations masking IP addresses are good, but the consequences appear to be rather ugly.

Titore added a subscriber: Titore.Sat, Aug 24, 10:24 AM