This task represents the beginning of a discussion and not a decision of what work anyone will do. There is much more research and community consultation to do. This task exists to think through some of the technical implications even as the requirements will almost certainly change.
MediaWiki exposes our anonymous user's IP addresses to the public. This isn't a good privacy practice and should be avoided if possible. The IP address can reveal information about a user (like their relative location) that makes MediaWiki less safe for our users.
The IP Address of anonymous users should be hashed (with a key) to prevent unprivileged users from knowing the IP of that user. Of course, this would be a major shift in the way that MediaWiki operates. It will require shifting all of the tools in core, extensions, and external tools to use the hash rather than the IP.
There are many pieces of identification that could be used to associate multiple actions by the same anonymous user, as the same anonymous user: IP address, user agent, session, etc. (or a combination of any of them). To ease the transition away from IP addresses on a technical and social level, the identification should continue to be tied directly to IP address. This should reduce the burden on developers and functionaries who rely on this behavior. However, this could be changed in the future.
The phases listed below may be separated by hours, days, weeks, months, or even years.
Phase I (Prep)
Thankfully, most of the IP addresses being used in MediaWiki have already been moved into the actor table. To fascilate the changes, a new column will be added to the actor table:
+------------+---------------------+------+-----+---------+----------------+ | Field | Type | Null | Key | Default | Extra | +------------+---------------------+------+-----+---------+----------------+ | actor_mask | varbinary(255) | NO | UNI | NULL | | +------------+---------------------+------+-----+---------+----------------+
(The field can be a different name, the author of this task has no preference).
This field should be exposed to database dumps and the Toolforge replicas.
Phase II (Hashing)
MediaWiki will be updated to generate a hash of the IP address when an IP is inserted into the actor table:
$mask = hash_hmac( 'sha1', '127.0.0.1', 'UNSAFEKEY' );
(The hashing algorithm may change, the author of this task has no preference, but must be cryptographically secure to prevent reverse-engineering).
A maintenance script will be created to back-fill the existing IP address in the actor table.
The key is unsafe because, even if the IP addresses were removed from the database dumps and Toolforge replicas, a user could utilize existing public database dumps to engineer an exhaustive list of IP address masks. Therefore, the masks added to the database, should not be treated as private.
Phase III (Reveal Mask)
Anywhere IP addresses are being displayed, the mask should (in some way) be displayed (or accessible) to the users. This will allow users to start using masks rather than IP addresses where they can be used.
A Special Page (and API) should be created for users with the block right (or a new right) to get information about the IP address, without actually revealing the IP address.
For instance, if a user were to input a mask like abcdef they might get an answer back like:
Organization: Charter Communications, Inc (CC-3518) CIDR: a36b5d/15
The CIDR would be a masked CIDR that could be used for blocking, etc. Whenever this mask is shown to a user, the mask should be inserted into the actor table so it can be retrieved later.
Also, it should be possible to input two (or more?) masks, like abcdef and 1f1f1f and get information about both of them together without having to reveal the IP:
City: Not Same State: Same Country: Same CIDR: a5b6d7/15
The CIDR would be a range wide enough to include both IPs. Like before, the mask (upon generation) would be inserted into the actor table for later reference.
This should allow admins to make decisions about vandalism, without having to actually reveal the IP address itself.
Phase IV (Accept Mask)
Existing systems that do not use the actor table, should be updated to use the actor table.
Anywhere IP addresses are used for input or output, should accept either an IP address OR a mask. Whatever type is given, should be what is returned.
Phase V (Default to Mask)
Now that all of the tools available accept a mask rather than an IP address, the display in MediaWiki (and it's APIs) should default to the mask, rather than the IP address. The IP address should still be able to be retrieved in case they are needed during the transition, but the use should be highly discouraged and any use should be documented so the use can be resolved.
Phase VI (Reject IPs)
Inputs (pages, tools, APIs, etc.) should now reject IP address input as invalid. This will prevent the input from accidentally revealing the IP address behind the mask.
Phase VII (Remove IPs)
The actor.actor_name column should be removed from database dumps and from Toolforge replicas. MediaWiki should refuse to display the contents of the field to anyone (unless they have signed a non-disclosure agreement).
Phase VIII (The Big Switch)
Now that the IP addresses are inaccessible and users are using masks rather than IPs, the final phase is to change the hash key to a secret key:
$mask = hash_hmac( 'sha1', '127.0.0.1', 'SUPERSECRETKEYUNIQUETOTHISIP' );
Doing this will break the edit history of an IP address. Before the switch, your mask could be abcd, after the switch, your mask could be e4f5. The reason this is necessary is to actually protect the IP address from being reverse-engineered from the existing publicly available database dumps.
After this switch, it will be impossible to get the IP address from the mask, without checkuser privileges.
Since the IPs will continue to exist in the database, blocks before the switch (IPs or masks) will continue to be enforced.
It would be best if the key were unique to each IP address to make re-identification more difficult.
Possible Future Enhancements.
After this process is complete, there are many enhancements that could be made to further increase the privacy and safety of our users. This list, is a non-exhaustive list of examples.
Switch Hash Key (again)
Technically, the hash key (changed in Phase VIII (The Big Switch) could be changed as frequently as we want to (daily, monthly, yearly, etc.). Doing this makes identification of the IPs more difficult, but also breaks the revision history.
Some sort of rolling scheme could also be introduced. For instance, perhaps the key is unique *per-ip* and perhaps that key could change if there haven't been edits from that IP within 30 (?) days.
Remove IPs from Database
The actor.actor_name column could be removed from the database completely. Tools (like CheckUser) that currently return an IP address could instead return a mask. However, this will take a lot of work because tools like the Mask Info special page described in Phase III (Reveal Mask) wont work because there will be no way to run a WHOIS lookup on a mask. Other types of abuse mitigation will need to be implemented before this becomes a possibility.
Base mask on session instead of IP.
Once everything has been switched to using masks instead of IPs, and those masks have no inherent meaning. It would be possible to change what the mask represents. For instance, we could change the mask to be encoded like this:
$mask = hash_hmac( 'sha1', 'SESSION_ID', 'SUPERSECRETKEY' );
Where SESSION_ID is the users session id. This would create a new mask every time the user's session was generated (i.e. each new device and browser, etc.). This would, of course, break the social contract of what the mask represents, but would be technically trivial to implement as the masks would function identically to the IP masks.
Doing this may open the door to being able to send "private" notifications to anonymous users. Or being able to "take over" edits by your standard user account that were made while accidentally being logged out.