Page MenuHomePhabricator

Investigation: How to add IP range support to Special:Contributions
Closed, ResolvedPublic5 Estimated Story Points

Description

Epic at T145912

@bd808 had some solid ideas about how we could represent each IP (v4 or v6) as a unique value in an indexed column, which should make simple queries fairly fast. Let's hash out the details and come up with a plan.

Currently the only useful index for this is a non-unique index on (rev_user_text, rev_timestamp) in the revision table.

Let's answer the following questions:

  1. What CIDR ranges would be useful to support searches for (both for IPv4 and IPv6)? This may require asking some administrators.
  2. With the current database schema, what CIDR ranges could we practically support (so that users of Special:Contributions would get a result within a reasonable amount of time)? This may require running some test queries.
  3. Are there potential changes to the database schema that would let us support more of the ranges listed from question #1?
  4. If the answer is yes, would it be feasible for us to make such changes (and worth the effort)?
  5. Could we also list block information for the range, such as individual IPs and/or subranges within in? (You can see a current example at https://en.wikipedia.org/wiki/Special:Contributions/MusikPuppet4)

Event Timeline

We hash IPv4 like this for trusted xff...

https://noc.wikimedia.org/conf/highlight.php?file=trusted-xff.php

We pass it through IP::toHex()

No real solution for IPv6 yet, https://github.com/wikimedia/mediawiki-extensions-TrustedXFF/blob/master/TrustedXFF.body.php#L10-L19

Now we're not using the CDB, we might be able to fix it

DannyH set the point value for this task to 5.
DannyH moved this task from Needs Discussion to Up Next (June 3-21) on the Community-Tech board.
kaldari updated the task description. (Show Details)

Here are my basic thoughts:

  • We are only concerned about IPv4/IPv6 addresses when the revision.rev_user = 0. We do not and should not store the IP address of authenticated users (except in checkuser data for which access is tightly controlled and data is frequently purged).
  • When revision.rev_user = 0, we do store the canonical IPv4/IPv6 address as a string in the revision.rev_user_text column (e.g. "127.0.0.1", "2001:db8:85a3::8a2e:370:7334").
  • The text storage is not helpful for range searches. To effectively use the index on the revision.rev_user_text column for range searching we would need a where clause condition like revision.rev_user_text IN (...) that enumerated all of the possible IPv4/IPv6 addresses in the range. This enumeration is too large to be effectively useful for most interesting IPv6 ranges.
  • An IPv4/IPv6 address is actually just a large number.
  • Any IPv4/IPv6 CIDR range is a bounded, contiguous set of large numbers (i.e. addr => X AND addr <= Y).
  • All IPv4 addresses can be mapped unambiguously into the IPv6 address space via the reserved ::ffff:0:0:0/96 range which makes storing mixed IPv4 and IPv6 in a single column possible.
  • A new column (i.e. rev_anon_ip VARBINARY(16) NOT NULL DEFAULT 0) could store the anonymous editor's IPv4/IPv6 address that has been mapped into the IPv6 space and treated as a number.
  • When revision.rev_user <> 0 we would populate the same column with 0 indicating that no address is available.
  • IPv4/IPv6 range searches for anon edits could be done via an index scan of the new rev_anon_ip column using a where clause condition like rev_anon_ip => start_of_ipv6_mapped_range AND rev_anon_ip <= end_of_ipv6_mapped_range.

A DBA needs to sanity check the idea that we can just add such a column to the revision table and that MySQL/MariaDB will actually use an index on a varbinary that is predominately filled with 0 values effectively.

I looked at both of the current solutions we have and compared them to come up with some information.

Range Contribs
  • Only supports IPv4 and CIDR range searching.
  • No limits on the query range as such but the lower. your CIDR range (i.e. the more number of IP addresses in your range), the slower it becomes.
    • A query for CIDR range 22 i.e. 1024 addresses takes about 30 seconds to load initially but the page takes ~5 minutes to complete loading and loads a whopping 8.4 MB of data when only querying data from this year. It loads 5764 entries (List by time).
  • It performs better when querying only a range of 16 addresses starting from 2014 in about 12 seconds with 1756 entries.
  • Querying for a CIDR range of 23, i.e. 512 addresses hits the sweet spot with 30 seconds load time and 5764 entries. Only querying for data from this year.
Contribsrange Gadget
  • Supports both IPv4 and IPv6 but only wildcard searches for both.
  • For IPv4, the gadget seemingly performs better than xtools tool probably because it only supports wildcard prefix searches. For instance a search for 213.205.* (link) i.e. CIDR range 16 (>65k addresses) produces ~8000 records which is 7.11 MB in about 50 seconds. The number of records seem much less than expected so I'm guessing it cuts off the data after a limit. There's a limit param in the URL which doesn't seem to work because it loads the same number of records irrespective of the limit.
  • For IPv6: The gadget performs pretty fast in wildcard searching too (example) but that is again because it simply does wildcard prefix searches (i.e. SELECT ... FROM revision WHERE rev_user_text LIKE '213.205.%').

I think it's safe to assume that building this feature into core/as an extension will be faster than this but it helps to get an idea of what to expect.

Options for proceeding with this task:
  1. We can time-restrict CIDR range searches to maybe only show data from a limited time frame. We'd need to cross-check with admins. We'd also need to convey this explicitly on the Special:Contribs page. This should speed-up our lookups but we still need to enforce some range limits we are doing lookups for.
  2. We can add a column to revision table which will store both the IPv4 and IPv6 addresses in the IPv6 Hex representation. I guess the main question here is about the feasibility and usefulness of adding this column and the trade-offs, if any.
Open questions
  1. Do we want to support Wildcard searches along with CIDR ones?
    1. It seems we should support them since plenty of people are using the contribsrange gadget. Also it seems quite trivial to implement. We can parse the field value and decide whether the user meant to do a wildcard search or CIDR range search.
  1. How big range searches do we really want to do here?
    1. Looking at the Range table for IPv6 , it's evident that the search timing will go up exponentially depending on what CIDR ranges we support. Current rules state that: they can affect up to 65,536 IPv4 addresses or 649,037,107,316,853,453,566,312,041,152,512 (~6.49*1032) IPv6 addresses each which means the lowest we can go with CIDR for IPv4 is /16 and for IPv6 /19 which is pretty big.
    2. Looking at the rangeblocks data since 2005 we can see that people frequently go down to the 16 limit for IPv4 but the lowest I found for IPv6 was 32 which still affects 79,228,162,514,264,337,593,543,950,336 addresses (4,294,967,296 /64 subnets).

We can time-restrict CIDR range searches to maybe only show data from a limited time frame. We'd need to cross-check with admins. We'd also need to convey this explicitly on the Special:Contribs page. This should speed-up our lookups but we still need to enforce some range limits we are doing lookups for.

I'd say the last 30 days (going off of recent changes) is a must, but there is value in searching for any arbitrary range. I've blocked some school IP ranges for a whole year, after evaluating there was 90-something% disruption during that time. That was IPv4, though, and edits were infrequent. With IPv6 and an ISP that constantly refreshes the IP, it probably isn't going to be feasible to review a year's worth of data, if even a few months. So if it helps, you might consider different date range limitations for v4 vs v6.

We can add a column to revision table which will store both the IPv4 and IPv6 addresses in the IPv6 Hex representation. I guess the main question here is about the feasibility and usefulness of adding this column and the trade-offs, if any.

I think we should poke some DBAs about it soon. I worry there might be a quick and resounding "no", and hopefully I'm wrong :) If it is a no, let's aim for recent changes.

Do we want to support Wildcard searches along with CIDR ones?

The only concern I have with wildcards is that admins should not do range blocks based on data they see there. I suppose if they knew how to do range blocks they should know a wildcard isn't an effective way of reviewing edits for that range, so no harm in supporting it I suppose.

How big range searches do we really want to do here?

I think we only need to support the widest blockable range, as defined by $wgBlockCIDRLimit.

Finally, what about showing any blocks that exist within the range that's being queried? This is useful information. At the least, I think we should somehow show which individual IPs are blocked within the results, sort of how it is shown in the CheckUser tool. We don't want admins blocking ranges when the troublesome IPs or subranges are already blocked.

Right, I forgot to mention about those. Querying for blocks should not be hard. I'm not sure how the CheckUser tool shows them. The screenshots on the documentation page don't have any examples.

One more thing I think we should be showing in the UI is the start and end IPs for the range queried for along with the number of IPs covered in that range. It's trivial to calculate it and important so people don't make a mistake.

Right, I forgot to mention about those. Querying for blocks should not be hard. I'm not sure how the CheckUser tool shows them. The screenshots on the documentation page don't have any examples.

It's nothing too special, just says "(blocked)" at the end with a link to the block (Special:BlockList I think): https://commons.wikimedia.org/wiki/File:CheckUser2.png

One more thing I think we should be showing in the UI is the start and end IPs for the range queried for along with the number of IPs covered in that range. It's trivial to calculate it and important so people don't make a mistake.

Yes, definitely a good idea to show many IPs will be affected. That should be enough to make admins think twice about blocking :)

The UI hasn't really been discussed, but I assume it would look something like the gadget, or the "enhanced watchlist", with expand/collapse controls to see contributions for specific IPs.

@jcrespo: In order to make IP range searches performant, we've been discussing adding a new column to either the revision table (ideally) or the recentchanges table (See T147664#2763996). For example, we could add a new column rev_anon_ip VARBINARY(16) NOT NULL DEFAULT 0 that stores a numeric representation of any anonymous user's IPv4 or IPv6 address (and stores 0 for logged in users). Two questions:

  1. Would it be at all sane and feasible to add such a column to the revision table?
  2. Would MySQL/MariaDB actually use an index on a varbinary that is predominately filled with 0 values effectively?

The revision table is already the largest table, I have nothing personally against that change, but I would strongly suggest doing that only if you are going to drop the usage of the user text one everywhere on the code. Otherwise, I would suggest the alternative, which is indexing that value on a virtual column, so only the index is created, not the underlying data.

This is only for enwiki/commons/wikidata, etc. which have 600GB revision tables, and we have performance problems with it. For mediawiki as a product, for small wikis or on recentchanges anywhere, there are no issues.

Would MySQL/MariaDB actually use an index on a varbinary that is predominately filled with 0 values effectively?

If that goes ahead, fill it with NULL, which should take only 1 bit*, not with 0es, if possible.

Let me rephrase my last comment, what I mean is not "do not add columns to revision". What I mean is, revision requires a deep revision (pun intended) for the largest wikis for scalability purposes, and due to that, it is difficult to add more info there on those large wikis.

After further discussion (and considering jcrespo's feedback), we're leaning towards modifying recentchanges over modifying revision. We're also leaning towards creating a new special page via an extension, rather than modifying Special:Contributions. This will allow us to have a dedicated UI just for dealing with IP ranges and will also make it less confusing if it is limited to the last 30 days.

@jcrespo: The idea of an index on a virtual column is interesting. I was under the impression that we weren't using virtual columns in MediaWiki yet since there are some incompatibilities between the MariaDB and MySQL implementations. Plus some popular SQL tools such as mysqldump, phpMyAdmin, SQLyog, etc. don't know how to handle virtual columns. That impression, however, is based on looking into using virtual columns about a year ago, and things might be different now. What's your opinion on it?

we weren't using virtual columns in MediaWiki yet since there are some incompatibilities between the MariaDB and MySQL implementations

The #1 reason would be that we are still stuck with MySQL 5.0 functionalities<ref>https://www.mediawiki.org/wiki/Manual:Installation_requirements#Database_server </ref>, which are almost older than me.

Aside from that, the incompatibilities will never go away, as MariaDB took the decision to fork things in 10.0 and not looking back, and that is bad for us. For example, right now we can monitor replication with MariaDB's GTID implementation, but not with MySQL. But my rants are offtopic here, I think.

Virtual columns in 10.0+ and 5.7+ are comparable, but use different syntax. I would not use them extensively, but if there is something optional or WMF-only (an extension, or a special config) which would provide a huge gain, we can give them a look. In particular 5.7 virtual columns allows to index quite efficiently json data, which is since that version a native type.

After further discussion (and considering jcrespo's feedback), we're leaning towards modifying recentchanges over modifying revision. We're also leaning towards creating a new special page via an extension, rather than modifying Special:Contributions. This will allow us to have a dedicated UI just for dealing with IP ranges and will also make it less confusing if it is limited to the last 30 days.

I would generally advise against modifying core tables for the purposes of extensions; the same way we advise against core hacks for support extensions. No other extensions do it (changes are incorporated into core where appropriate).

Generally, rather than adding more columns to a table, for example, the extension should add their own table, and link it back to the mw core table by the relevant PK

Especially as it could cause an extra testing/maintenance burden

Also, aren't people going to complain that they can't query older than $wgRCMaxAge?

I would generally advise against modifying core tables for the purposes of extensions; the same way we advise against core hacks for support extensions. No other extensions do it (changes are incorporated into core where appropriate).

Generally, rather than adding more columns to a table, for example, the extension should add their own table, and link it back to the mw core table by the relevant PK

Ouch. This makes sense, but I don't have to like it! :) We discussed that even if we do it in recent changes, it might still make sense to put into core. While range blocks are probably most common on larger wikis that are subject to more disruption, I still think a tool like this is essential to basic site administration. IPv6 is only going to become more prominent, rendering single-address IP blocks ineffective, even for smaller wikis.

Also, aren't people going to complain that they can't query older than $wgRCMaxAge?

I am speaking for myself, but most of the time when doing range blocks we are concerned about recent disruption, since longer range blocks aren't as common. But, if we do end up creating our own table for the extension, maybe it could link back to the revision table instead of recent changes so we can query for older contributions.

I am speaking for myself, but most of the time when doing range blocks we are concerned about recent disruption, since longer range blocks aren't as common. But, if we do end up creating our own table for the extension, maybe it could link back to the revision table instead of recent changes so we can query for older contributions.

Makes sense, if the primary reason is for vandalism et al.

Though, we have the likes of Translatewiki, who AFAIK, don't clear their recent changes at all

I wonder if there needs to be some changes to help CheckUser too along these lines. I digress, though... But if you're putting functionality into supporting this, it may have some wider benefits exposing functionality into CheckUser eventually... A task should be probably be filed for that

Ouch. This makes sense, but I don't have to like it! :) We discussed that even if we do it in recent changes, it might still make sense to put into core. While range blocks are probably most common on larger wikis that are subject to more disruption, I still think a tool like this is essential to basic site administration. IPv6 is only going to become more prominent, rendering single-address IP blocks ineffective, even for smaller wikis.

It's mostly about preventing "WMF specific" hacks making it into core... again. Things like WMF specific feature flags etc (that exist for longer than a release cycle of MW). We've spent years moving WMF specific scripts, config, code etc out of core directly, into config, extensions and alike.

It may make perfect sense for the column to be in core, and as such, be populated in core (rather than via some hook call that means it's only done if you've got the extension), which alleviates most of my concerns here

Then the question of course, whether this should just live in core anyway; like you say, IPv6 is only going to become more prevalent, there's no two ways about it.

Many ways to skin a cat, as usual. I do like that it's being talked about in depth first, rather than wasting time developing something and getting a lolnope, that ain't gonna happen :)

Luckily recentchanges has a primary key, so putting the IP data in a separate table isn't a bad idea, especially if it's going to be implemented as an extension. Right now, we only have 1 actual use case for this data, so I don't think there's a rush to build it into core MediaWiki. The less we have to modify core, the faster this will get implemented (probably by an order of magnitude).

I would generally advise against modifying core tables for the purposes of extensions

+1000 to that, we recently had issues with this, and policy was updated, although not sure if it is clear enough: https://www.mediawiki.org/wiki/Development_policy

Also, creating a new table will have almost 0 blockers, while modifying existing tables is a huge burden. I always recommend starting with a separate table, and when the feature is mature, see how to integrate directly on core, if it is agreed.

Thank the PK to the efforts devels everywhere are putting into fixing T17441.

I think the next steps from this investigation are:

  • Add IP range search support to a new special page (not Special:Contributions) via a new extension.
    • Searches happen against recentchanges table. Data from past 30 days only is shown.
    • An additional column or separate table might or might not be needed based on search performance.
    • Show start and end IPs for the range along with number of IPs affected.
    • UI will be similar to Contribsrange gadget (with OOUI maybe).
    • Support both CIDR and wildcard searches (latter is quite easy) for backwards compatibility.
    • Search limit for CIDR with IPv4 is up (down?) to /16 and for IPv6 to /19 beyond which searches won't be supported.

Optional add-on:

  • Add in a range calculator where user can enter start and end IP and find out the appropriate CIDR range.

Re-reading this I'm realizing we never commented with what we discussed, which is to piggyback off of the CheckUser extension. The rationale was:

  • Altering the revision table to meet our needs doesn't seem to be an option.
  • CU is already capable of querying for IPv4 and v6 and is very fast, thanks to the cuc_ip_hex column and some indexing magic.
  • A new extension would have to involve adding hooks to copy revision data, or reference it in some way, much like CU is already doing.
  • 90 days of data (as configured on most wikis) is usually enough for admins to make a decision in range blocking, since longer range blocks are uncommon.
  • CheckUser comes with a range calculator that we can reuse.
  • Most of the code is already written meaning we can this out that much quicker to meet the growing demand for such a tool. We may also get some free maintenance by using an existing and popular extension.
  • It's easy enough to not expose private data.
  • Most (all?) of WMF wikis have the CheckUser extension installed.

There are still some downfalls I've already ran into, such as the lack of being able to efficiently show the diff size, something we'd expect from a tool listing contributions. For that see T155734.