Page MenuHomePhabricator

IP Address ranges (CIDR) are stored as strings and cannot be queried
Open, Needs TriagePublic

Description

This task is a follow up to the discussion on Wikidata.

Problem
There is no way to query an IP address range in IPv4 routing prefix or IPv6 routing prefix. The string itself, of course, can be queried, but it is impossible to determine if a given IP address exists within the the stored range.

Example
Given the IP address 198.35.26.5 it is impossible to reason that the IP address belongs to Wikimedia Foundation.

Why Now?
During the Anti-Harassment offsite, we discussed many potential improvements to CheckUser. One idea was to provide a reverse IP address lookup as a standalone tool on Toolforge and maybe within the extension itself (see T174553). This would give stewards details about an IP address and help them make sock judgements. Many stewards already use non-free (as in speech) tools for this purpose. The Wikidata community already finds IP ranges as valuable data (given the creation of the two aforementioned properties). It might be wise to utilize this data for the community.

Proposed Solution
Based on the solution proposed in the discussion, the best solution would be to fix the storage rather than duplicating the data.

Create a new data type that extends from quantity. This new data type would be able to store a single IP address in decimal form, or a range of IP addresses in decimal form. The unit would be set to either http://www.wikidata.org/entity/Q11103 or http://www.wikidata.org/entity/Q2551624 based on what the input is (the URIs should be customizable in config). The UI would display either a single IP address or a CIDR range. The Wikidata-Query-Service would allow querying the field in decimal form, or perhaps a FILTER can be created in the future to convert an IP address to decimal form.

Property Proposal
https://www.wikidata.org/wiki/Wikidata:Property_proposal/Organization#IP_address_or_range

Work Arounds

  1. Duplicate the data in a Wikidata qualifier (the original proposal) or index the data in an external database. This is problematic though, as it requires querying and looping through many (right now thousands, later hundreds of thousands, or even millions) of records and keeping this data up-to-date. It also becomes problematic to filter based on other properties (as you would do with the Wikidata-Query-Service).
  2. Convert the existing fields to a standard quantity field. This solution does solve the problem, but creates a user experience (UX) problem as editors will need to properly convert an IP range to decimal format. For instance 10.0.0.1/24 would need to be entered as 167772287.5±127.5.

Details

Related Gerrit Patches:
mediawiki/extensions/Wikibase : masterCreate a new datatype for IP Address

Event Timeline

dbarratt created this task.Oct 13 2019, 4:35 PM
Restricted Application added a project: Wikidata. · View Herald TranscriptOct 13 2019, 4:35 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
dbarratt updated the task description. (Show Details)Oct 13 2019, 4:45 PM
Reedy added a subscriber: Reedy.EditedOct 13 2019, 4:49 PM

Given the IP address 198.35.26.5 it is impossible to reason that the IP address belongs to Wikimedia Foundation.

Not really, and MW has functionality for telling you that. Wikibase could expose/use that

I would concur it might need some brute-forcing type investigation to find out which CIDR it belonged to (for each IP CIDR stored in wikidata....). But it's certainly not impossible

> var_dump( IP::isInRange( '198.35.26.5', '198.35.26.0/23' ) );
bool(true)

> var_dump( IP::isInRanges( '198.35.26.5', [ '198.35.26.0/23', '208.80.152.0/22', '91.198.174.0/24', '198.73.209.0/24', '185.15.56.0/22', '103.102.166.0/24' ] ) );
bool(true)
Reedy added a comment.Oct 13 2019, 4:56 PM

And FWIW, CheckUser stores IPs as the string and as a hexadecimal representation, which it uses for IP comparison and maths...

dbarratt added a comment.EditedOct 13 2019, 5:39 PM

Not really, and MW has functionality for telling you that. Wikibase could expose/use that

Apologies, I didn't mean impossible in the literal sense, I meant it in the sense that it's impossible without implementing either the proposed solution or one of the listed work arounds.

And FWIW, CheckUser stores IPs as the string and as a hexadecimal representation, which it uses for IP comparison and maths...

Yes, so does ipblocks, but wikibase does not have a way to store ranges in hexadecimal format, so rather than building one, it would be simpler to store them in decimal format.

Reedy added a comment.Oct 13 2019, 5:46 PM

And FWIW, CheckUser stores IPs as the string and as a hexadecimal representation, which it uses for IP comparison and maths...

Yes, so does ipblocks, but wikibase does not have a way to store ranges in hexadecimal format, so rather than building one, it would be simpler to store them in decimal format.

Again, MW has functionality for doing this conversion, splitting, reformatting etc. You could store a start, store an end of the range like we do elsewhere for ranges. Wikibase stores most things in json blobs, so not like we're going to need schema changes, and storing a couple of extra fields for a property that aren't that widely used... Doesn't seem like a storage consideration we really need to care about

Just because it's simpler, doesn't necessarily mean it's the best solution either

dbarratt added a comment.EditedOct 13 2019, 5:48 PM

Again, MW has functionality for doing this conversion, splitting, reformatting etc. You could store a start, store an end of the range like we do elsewhere for ranges. Wikibase stores most things in json blobs, so not like we're going to need schema changes, and storing a couple of extra fields for a property that aren't that widely used... Doesn't seem like a storage consideration we really need to care about

Right, this is work around #1. I completely acknowledged this as a potential solution, and its drawbacks.

This is problematic though, as that database would not be part of Wikimedia production and therefore cannot be accessed from another wiki directly.

I don’t understand what this means. What kind of access are you talking about? Because the Query Service can’t be accessed from wikitext (parser functions or lua) either.

I still think that building a tool for this is the most reasonable approach.

I don’t understand what this means. What kind of access are you talking about? Because the Query Service can’t be accessed from wikitext (parser functions or lua) either.

Well that's fascinating. I suppose what I meant was making an XHR request to the Query Service from a Special Page. It's fine to send the user to a tool on Toolforge to do this though, so I suppose my concerns are really centered around freshness and scalability.

I still think that building a tool for this is the most reasonable approach.

I think it might be best to separate the problems. The UI of this "tool" (whether in an extension or on Toolforge) is the same regardless of the solution. Where the data is stored and how it is updated and queried is the bigger problem. I'm not opposed to building an index of the data, but I do have some questions though:

  1. How should this data be updated and how frequently?
  2. Should we periodically/continuously query for all of the statements and index them? Or should we listen for recent changes and parse the comment for the properties we're concerned with?
  3. Right now the number of values are in the thousands, but what happens if we are in the 10s of thousands or 100s of thousands (or more) which is certainly possible?

It seems, on the surface, that these are not straight-forward problems to solve (unless I'm missing something completely), and these problems have already been solved by the Query Service. If fixing the problem is simply creating a formatter for an existing data type, that seems pretty straight forward to me? Or am I way off?

dbarratt updated the task description. (Show Details)Oct 14 2019, 1:26 PM
dbarratt updated the task description. (Show Details)

Change 544323 had a related patch set uploaded (by Dbarratt; owner: Dbarratt):
[mediawiki/extensions/Wikibase@master] Create a new datatype for IP Address

https://gerrit.wikimedia.org/r/544323

Verdy_p added a subscriber: Verdy_p.EditedSun, Oct 20, 10:59 PM

The decimal form cannot be correct of it's not qualified with the address type (IPv4 or IPv6).

And the numeric type cannot hold 128 bits of precision for IPv6 (note: we really need 128 bit for supporting users that can only use IPv6 via internet relays, many of them not delivering a full /48 or /64 range to each user, but granting them very small ranges (e.g. a /124 only). For some ISPs (notably some mpbile ISPs) there's no other choice, because they can't safely maintain enough IPv4 sessions for long, as they don't have enough IPv4 to share or use NAT routing (various mobile ISPs only give a local address like 192.168.0.1 or 10.0.0.1, all traffic passing through a proxy then routed from the same publicly routable /48 IPv6 block, where each of their user sessions are mapped to very small ranges).
It would be preferable to use a storage form based on hexadecimal.
Note that IPv4 addresses also have an equivalent canonical IPv6 mapping, so everything could be mapped to IPv6.
So just use a format using fixed-length hexadecimal strings (32-bit or 128-bit) and another byte to specify the CIDR prefix length, possibly with an additional suffix byte for protocol filters.
And some addresses may also be stored as being the result of a DNS query, in which case it should contain the domain name and a timestamp, to see when we need to reassert the IP address.

We could also filter not just IP addresses but some AS numbers if we need them against malicious third party networks whose IP addresses are changing often, or if some AS is out of control or attacked. AS numbers are now over the old limit of 16 bits and can already be 32 bit too (so no way to distinguish a 32 bit IPv4 and a 32 bit AS number).

For some filters, we may need to identify not just the source IPv4 or IPv6 but also the peered IPv4 or IPV6 or tracking identifier when they are reported by the proxy (e.g. for students in a university, or workers in a company or public organisation using strong firewalls and strict protocol filters): if we want to report abuses to the remote proxy admin, they'll want us to specify this identifier, which can be any string (possibly with a date or some additional fields required by them).

So a single "number" cannot fit. The new type should still be based on strings, but with special formatting rules.

Finally, don't assume that a single IPv4 can be used simulteneously by a single user, even if there is no proxy. Some ISPs are already using NAT+PAT.

For example Free.fr assigns the same IPv4 to 4 different users, not even necessarily in the same region: each user has access to only 16384 port numbers, the 1024 first being reserved and the first and last one having special multicast semantics on routers, and a range being reserved for translating dynamically the port numbers when they don't fit in the assigned range: NAT is then enforced by the ISP and includes translation of port numbers, even if the public IP and the port range they use is stable for these 4 users for very long periods (which could exceed several years).

For these users, this IPv4 solution works but the ISP also assigns them a dedicated IPv6 range (containing 8 subblocks which can have different routes depending on the quality of service requested by different applications, the public IPv6 address may not always be the same from the same user, depending on protocols or applications/libraries used even if they belong to the same /96 range, and the default subblock 0 without QOS uses only 32 bits, leaving other subblocks using other IPv6 addresses mappings, such as from 48-bit Ethernet MAC; as well some subblocks of the public /64 block assigned to the user may in fact be unreachable and routed via VPNs of specific services, that will reexpose their public trafic on an unrelated IPv6 block, possibly from another ISP)
Users may as well use anonymizing proxies provided by their security solutions (including antivirus suites or password managers), and their traffic (even if it is initially routed to the solution provider in an IPv6 channel without any tunnel may be visible to Wikimedia as originating only from the proxy, which only reveals a dated tracking ID to be used for abuse reports; the tracking ID may be a subfield of the proxy's IPV6 address, but is not necessarily constant for all queries, it can be randomized and allocated by secret algorithms inside a very large IPv6 block, i.e. a very small CIDR /range value; these providers normally have a policy for reporting abuses of via their network and monitors to work on alerts they receive directly or that they detect from other cooperative networks).

Sorry for chiming in very late. It seems too niche to add an additional data type for this to Wikibase if the hurdle isn't storing the data but just querying it in a specific way. I believe Lucas' suggestion (writing a sparql query to get the current data every day or so from Wikidata to feed a tool that can put it into the right format for figuring out if a given IP address is in a certain range) is the best way to go.