#### Description
The #anti-harassment Tools Team will be building a new extension for the #ip_info feature. The [[ https://meta.wikimedia.org/wiki/IP_Editing:_Privacy_Enhancement_and_Abuse_Mitigation/IP_Info_feature | project page ]] gives a detailed description of the project.
Effectively our extension will provide information //about// an IP address without (1) the need for the user to use an external service themselves and (2) exposing the IP address itself to the user. This provides the user with any details they could have retrieved from knowing the IP address. This information could be displayed in various ways (hover card, special page, etc.). Initially, this information would only be displayed on a select few pages (or as a beta feature to a select group of users). Eventually, this could be displayed to all (logged in?) users wherever an IP address is currently displayed in the interface.
We plan on building an API endpoint T260603 that takes an edit id or log id and returns data about the IP addressed used for that action. For anonymous actions it would provide a result to all users (who are logged in?). This endpoint //may// provide a result to checkusers for actions performed by logged in users. Regardless, the data will only be returned for actions performed within the previous year (?) for anonymous actions and 90 days for actions performed by logged-in users.
Based on our investigation in T259726, the data our users are looking for is not accessible from freely licensed datasets. Therefore, we will be looking to purchase a license to a proprietary dataset (or using one we've already purchased).
There are several ways we could implement this feature. We plan on creating an API endpoint T260603 that will accept a log id or revision id and return the information about the IP address used for that edit. This could even be added to the existing endpoints for revision or logs.
What could be problematic is how this data is retrieved from the proprietary dataset.
There are at least three ways to accomplish this:
# Implement a background job process like #machinevision. The extension would fetch the information from the dataset after a revision or logged action has taken place and store this information in the database. There would also be a job to go through the historical IP addresses and backfill the database. This would have some #privacy concerns as the amount of personal-identifiable information (PII) in the database would increase rather than decrease (especially if we store latitude and longitude). T259725#6383339
# IP info is currently calculated as part of a [[ https://github.com/wikimedia/puppet/blob/3529ffc7b55d1e917f17a4175091860e3f81b790/modules/varnish/templates/geoip.inc.vcl.erb | custom varnish function ]] and attached to incoming requests with a Cookie (I assume this cookie is tied to the IP address being used?). This is currently being used by #wikimedia-fundraising to target banner display. We could expand the usage of this function and collect the incoming data (when an edit or logged action is preformed) within the database. This would still have the PII problems, but would prevent having to run a Job on the servers and would use an existing system.
# The information could be retrieved //on-demand// from the proprietary dataset. This is a simple solution, reduces the PII we store in our database, but //could// have performance implications. Proprietary datasets typically offer either a downloadable database (like [[ https://www.maxmind.com/ | MaxMind ]]) or a highly available/cachable webservice (like [[ https://ipinfo.io/ | IP Info ]]) or sometimes both. When a user requests information about an IP address, the request will utilize our API endpoint, that endpoint will then lookup the data in the proprietary dataset on demand. To handle more requests, we could move our API endpoint to not be in MediaWiki (using a PHP connection) and instead use a separate, custom microservice (with nginx? node.js?) that could handle many more simultaneous requests.
Of the options available, we believe that Option 3 is the most risky performance-wise, but the least risky from a 3rd-party license perspective. Since we are not 100% sure at this time what the restrictions of the license will be, we will proceed with Option 3 until we know for sure that we are able to peruse a different option.
#### Preview environment
> //(Insert one or more links to where the feature can be tested, e.g. on Beta Cluster.)//
>
> Hosting the changes on Beta Cluster is a requirement prior to performance review. Please ensure that the feature can be used directly on the link(s) provided, without any data entry such as having to create an article. Any sample content needed should already be present.
>
> If the changes cannot be hosted on Beta Cluster, explain why and provide links to an alternate public environment instead where the Performance Team can also SSH into. Links to code only is insufficient for a performance review.
The feature will either be available on the beta cluster or on our test environment (T260607) depending on the relative timing of this review and the security review (T260822).
#### Which code to review
> //(Provide links to all proposed changes and/or repositories. It should also describe changes which have not yet been merged or deployed but are planned prior to deployment. E.g. production Puppet, wmf config, or in-flight features expected to complete prior to launch date, etc.).//
At the time of requesting this review, we're at the start of the project and haven't implemented the feature yet.
The extension repository is at [[https://gerrit.wikimedia.org/r/admin/repos/mediawiki%2Fextensions%2FIPInfo|mediawiki/extensions/IPInfo]], but more detailed links will follow before the review takes place.
#### Performance assessment
> Please initiate the performance assessment by answering the below:
>
> - What work has been done to ensure the best possible performance of the feature?
> - What are likely to be the weak areas (e.g. bottlenecks) of the code in terms of performance?
> - Are there potential optimisations that haven't been performed yet?
> - Please list which performance measurements are in place for the feature and/or what you've measured ad-hoc so far. If you are unsure what to measure, ask the Performance Team for advice: [[ mailto:performance-team@wikimedia.org | performance-team@wikimedia.org ]].
TBD