MediaWiki exposes our anonymous user's IP addresses to the public. This isn't a good privacy practice and should be avoided if possible. The IP address can reveal information about a user (like their relative location) that makes MediaWiki less safe for our users.
#### Proposed Solution
The IP Address of anonymous users should be hashed (with a key) to prevent unprivileged users from knowing the IP of that user. Of course, this would be a major shift in the way that MediaWiki operates. It will require shifting all of the tools in core, extensions, and external tools to use the hash rather than the IP.
There are many pieces of identification that could be used to associate multiple actions by the same anonymous user, as the same anonymous user: IP address, user agent, session, etc. (or a combination of any of them). To ease the transition away from IP addresses on a technical and social level, the identification should continue to be tied directly to IP address. This should reduce the burden on developers and functionaries who rely on this behavior. However, this could be changed in the future.
The phases listed below may be separated by hours, days, weeks, months, or even years.
##### Phase I (Prep)
NOTE: This phase has the most work for #dba.
Thankfully, most of the IP addresses being used in MediaWiki have already been moved into the [[ https://www.mediawiki.org/wiki/Manual:Actor_table | actor table ]]. To fascilate the changes, a new column will be added to the `actor` table:
| Field | Type | Null | Key | Default | Extra |
| actor_mask | varbinary(255) | NO | UNI | NULL | |
(The field can be a different name, the author of this task has no preference).
This field **should** be exposed to database dumps and the Toolforge replicas.
##### Phase II (Hashing)
MediaWiki will be updated to generate a hash of the IP address when an IP is inserted into the `actor` table:
$mask = hash_hmac( 'sha1', '127.0.0.1', 'UNSAFEKEY' );
(The hashing algorithm may change, the author of this task has no preference, but **must** be cryptographically secure to prevent reverse-engineering).
A maintenance script will be created to back-fill the existing IP address in the actor table.
The key is unsafe because, even if the IP addresses were removed from the database dumps and Toolforge replicas, a user could utilize existing public database dumps to engineer an exhaustive list of IP address masks. Therefore, the masks added to the database, should **not** be treated as private.
##### Phase III (Reveal Mask)
Anywhere IP addresses are being displayed, the mask should (in some way) be displayed (or accessible) to the users. This will allow users to start using masks rather than IP addresses where they can be used.
##### Phase IV (Accept Mask)
NOTE: This phase consists of a lot of work for developers across the organization.
Existing systems that do not use the `actor` table, should be updated to use the `actor` table.
Anywhere IP addresses are used for input or output, should accept **either** an IP address OR a mask. Whatever type is given, should be what is returned.
##### Phase V (Default to Mask)
Now that all of the tools available accept a mask rather than an IP address, the display in MediaWiki (and it's APIs) should default to the mask, rather than the IP address. The IP address should still be able to be retrieved in case they are needed during the transition, but the use should be highly discouraged and any use should be documented so the use can be resolved.
##### Phase VI (Reject IPs)
Inputs (pages, tools, APIs, etc.) should now reject IP address input as invalid. This will prevent the input from accidentally revealing the IP address behind the mask.
##### Phase VII (Remove IPs)
The `actor.actor_name` column should be removed from database dumps and from Toolforge replicas. MediaWiki should refuse to display the contents of the field to anyone (unless they have signed a non-disclosure agreement).
##### Phase VIII (The Big Switch)
NOTE: This phase will have the most work for functionaries.
Now that the IP addresses are inaccessible and users are using masks rather than IPs, the final phase is to change the hash key to a secret key:
$mask = hash_hmac( 'sha1', '127.0.0.1', 'SUPERSECRETKEY' );
Doing this will break the edit history of an IP address. Before the switch, your mask could be `abcd`, after the switch, your mask could be `e4f5`. The reason this is necessary is to //actually// protect the IP address from being reverse-engineered from the existing publicly available database dumps.
After this switch, it will be impossible to get the IP address from the mask, without either 1) production database access or 2) access to the hash key.
#### Possible Future Enhancements.
After this process is complete, there are many enhancements that could be made to further increase the privacy and safety of our users. This list, is a non-exhaustive list of examples.
##### Remove IPs from Database
The `actor.actor_name` column could be removed from the database completely. Tools (like CheckUser) that currently return an IP address could instead return a mask.
##### Base mask on session instead of IP.
Once everything has been switched to using masks instead of IPs, and those masks have no inherent meaning. It would be possible to change what the mask represents. For instance, we could change the mask to be encoded like this:
$mask = hash_hmac( 'sha1', 'SESSION_ID', 'SUPERSECRETKEY' );
Where `SESSION_ID` is the users session id. This would create a new mask every time the user's session was generated (i.e. each new device and browser, etc.). This would, of course, break the social contract of what the mask represents, but would be technically trivial to implement as the masks would function identically to the IP masks.
Doing this may open the door to being able to send "private" notifications to anonymous users. Or being able to "take over" edits by your standard user account that were made while accidentally being logged out.