Page MenuHomePhabricator

Create new Special:RangeContributions page to support viewing contributions across an IP range
Open, NormalPublic13 Story Points

Description

There are numerous tools and guides to calculate a range, but very few that let you see contributions from that range. To my knowledge, we currently have XTools which only supports IPv4, and the Contribsrange gadget which does support IPv6 but only via wildcard. E.g. 2601:646:8200:7d30:* as opposed to 2601:646:8200:7d30::/64, lending confusion on whether the IPs you are seeing are within the target range, and how much collateral damage you'd cause by blocking that range. Both tools aren't particularly stable or reliable.

There are some tricks with IPv6, e.g. some ISPs are known to allocate a single customer to a /64 range, so we can usually safely do a rangeblock there, but this is not obvious to many admins. Having a tool to verify your target range is correct, and be able to carefully spot check the edits, would make a world of a difference and open the door for admins who are less comfortable with range blocks. CheckUsers I believe would especially find such a tool beneficial (see also T21796).

Requirements for version 1:

  • The tool will operate at Special:RangeContributions. Instead of typing in a single IP address, you simply type in the range.
  • If the data is limited to a certain number of days, this should be mentioned in the UI.
  • The results will mimic how the current gadget works, by showing a list of the IPs within the range with expand/collapse to show the individual contributions.
  • The interface should also show some metadata about the range: highest and lowest IP address, number of IP addresses included
  • Both IPv4 and IPv6 addresses should be supported.
  • Support both CIDR and wildcard searches (latter is quite easy)
  • Search limit for CIDR with IPv4 is up (down?) to /16 and for IPv6 to /19, beyond which searches won't be supported and an error message is displayed.

Wireframes

Results by time:

Results by IP:

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
kaldari added a comment.EditedDec 7 2016, 9:37 PM

For Results by IP, we are limiting it to the 25 most recent edits per IP. For Results by Date, it will use regular chronological pagination similar to the Special:Contributions page.

DannyH updated the task description. (Show Details)Dec 9 2016, 2:39 AM

@kaldari @MusikAnimal I updated the wireframes to reflect yesterday's meeting.

MusikAnimal moved this task from Ready to In Development on the Community-Tech-Sprint board.

For development and testing we'll need a bunch of edits from random IPs within a queryable range. See P4725 for how I did this. There may be an easier way.

Change 332538 had a related patch set uploaded (by MusikAnimal):
[WIP] Create new Special:RangeContributions page to view contribs of an IP range

https://gerrit.wikimedia.org/r/332538

Huji added a subscriber: Huji.Jan 17 2017, 11:57 PM

Change 332538 abandoned by MusikAnimal:
[WIP] Create new Special:RangeContributions page to view contribs of an IP range

Reason:
rebased and refs are all out of whack now

https://gerrit.wikimedia.org/r/332538

Change 333838 had a related patch set uploaded (by MusikAnimal):
[WIP] Create new Special:RangeContributions page to view contribs of an IP range

https://gerrit.wikimedia.org/r/333838

I don't think bundling this in CheckUser extension/using the cu_changes table for this is appropriate for multiple reasons:

  • Conceptually - it doesn't have to do with looking up private information about users
  • Limited to 90 days - while that's apparently acceptable for the proposed use case, I think you'll quickly have requests for older results, and people might expect that from a page named similarly to Special:Contributions, which does provide the full history
  • Filtering of deleted edits - CU intentionally doesn't delete deleted edits, but we probably want to open this special page up to everyone, so that's non-ideal.
  • Separation of private data - there are vague ideas about how to separate the private CU data to make it harder to access, etc. This would presumably get in the way of it. A theoretical SQLI might be easier if the data to expose is in the same table that's being queried.

So here's what I would propose as an alternative: Create a new database table that stores revision id and the hex form of the IP. (Maybe a second table for deleted edits too?) Then you can just join on the revision table for the rest of the information. It could go in core or a new extension, either seems fine to me.

Oh and I think CU copying in all the revision/logging data was a bad idea from a maintenance perspective. Every change to the normal tables has to be replicated there (e.g. T155734). In T145265#2627085 I suggested moving away from that and performing joins on logging, eventually we should do the same for edits/revision too.

I don't think bundling this in CheckUser extension/using the cu_changes table for this is appropriate for multiple reasons:

  • Conceptually - it doesn't have to do with looking up private information about users
  • Limited to 90 days - while that's apparently acceptable for the proposed use case, I think you'll quickly have requests for older results, and people might expect from a page named similarly to Special:Contributions, which does provide the full history
  • Filtering of deleted edits - CU intentionally doesn't delete deleted edits, but we probably want to open this special page up to everyone, so that's non-ideal.
  • Separation of private data - there are vague ideas about how to separate the private CU data to make it harder to access, etc. This would presumably get in the way of it. A theoretical SQLI might be easier if the data to expose is in the same table that's being queried.

So here's what I would propose as an alternative: Create a new database table that stores revision id and the hex form of the IP. (Maybe a second table for deleted edits too?) Then you can just join on the revision table for the rest of the information. It could go in core or a new extension, either seems fine to me.

First off, we never actually explained on Phab why we opted to include this as part of CU, so I've updated the investigation ticket: https://phabricator.wikimedia.org/T147664#2963973

I agree with you on pretty much everything. Bundling it in with CU does seem silly, but it helps that WMF wikis already have CU installed, and the data we need (for most use cases) is already available. I'm also amazed by how fast doing a range check is with CU, which I assume can be attributed not only to the IP hex column but the fact that all the basic data we need is in one table. Do you think joining with revision will slow it down a lot? Right now I'm also joining with page so I can get the ID of the latest revision. In regards to SQL injection, I'm no expert but it seems this should be easily preventable.

but it helps that WMF wikis already have CU installed

Deploying a new extension shouldn't be that difficult - and in general having more eyes review the code, the better. Or even just putting this feature in core seems reasonable to me.

which I assume can be attributed not only to the IP hex column but the fact that all the basic data we need is in one table. Do you think joining with revision will slow it down a lot

I'm not a mysql performance expert, but I think you'll get similar performance by joining on revision.

jcrespo added a subscriber: jcrespo.EditedJan 24 2017, 2:21 PM

I do not know much about all of this, and I would normally not comment if I lack the knowledge, but I was asked for help specifically at T155734#2962885, and I have to subscribe the general sentiment of @Legoktm, for the few reasons he expressed at T145912#2963977 and that I hinted at T155734#2963266.

I would like to add some assertions that are in general true, and are common misconceptions:

  • At the persistance layer (aka DB) you want to avoid duplicates as much as possible. The problem is not the storage in disk (which is normally not a big deal), but the memory needed to cache them and access effectively. Also, the more duplicate data you have, the more chance there is for inconsistancies. An edit should (ideally) transactionally write to revision, recentchanges and log (when necessary)- making sure that either the 3 of them go through or none of them. CU seems to have a more complex workflow which could result on inconsistent results on error.
  • Joins are not costly. In fact, with our load and size, joins may be faster than duplicate data because if we access extra tables in the most natural way (primary keys), access time is practically constant O(1), while data size is much lower, which normally it means more data can fit into memory. Do not be afraid of the JOIN by default, and only duplicate/denormalize as an optimization, and not the other way around. There are some cases where joins may be costly, but that requires complex queries such as order by columns on 2 separate tables, or other non-common examples.

Altering the revision table to meet our needs doesn't seem to be an option.

why? I mention that the revision may be a bad idea because that should be (in theory) static, how large it is, practical problems, etc., but additional columns could be added with an extra table (specially if it was going to be deployed as an extension), referencing either that or logging, or better, recentchanges (as it has a recentchanges vibe).

I think the problem in general, is that there is very confusing usage of revision/logging vs. recentchanges/checkuser. I think recentchanges should be used as much as possible (because it is shorter than log/revision), and it should be used in general as a "fast revisions", and modified as needed so we avoid joins towards them, but not be afraid to do so if it makes sense. if part of core; "Optional" fields can be created as extra tables referencing recentchanges. If recentchanges is the wrong place, because we want to cover all edits, the same can be done for revision/log.

Not sure if this helps or makes this more confusing.

No this definitely helps @jcrespo, thank you! You've relieved me of my main concern about joining against revision. To reiterate, we'd also be joining against page so we can get the ID of the latest revision. Not sure if that makes it worse, or if performance still won't be a problem?

The big thing we need that we don't have is a column to represent the IP (v4 and v6) as a unique hex value, so that we can do a BETWEEN to efficiently get revisions by a particular range. That can go in a new table with a foreign key to revision. We were not planning on doing the same for logging or other actions. I also am with @Legoktm in that we should maybe have another table for deleted edits, if that's the best way to do it. The tool would seem incomplete if there were no way to check deleted edits. I still need to figure out exactly what happens to the revisions when a page is deleted, versus revision deleted, suppressed, etc., but I assume our second table for deleted edits would act much like the archive table, except with only the revision ID and IP hex columns.

Thankfully this discussion happened before I got too far with development :) At this point I have only one standalone file for Special:RangeContributions. I'm really attracted the idea of working on something that'd go in core, maybe you all could help me in making that decision. I wonder if these new tables would be of use for other things at some point, maybe another extension, or some new functionality in core that we could add further down the road? If that's the case maybe core is the right way to go. The other thing is that I'm led to believe changes to core take considerably longer than an extension. As a general guideline, we hope to get this out by Q1 (July or so), but hopefully much sooner than that. Admins are struggling and badly need a tool like this, which again was a big reason we wanted to utilize CU, since that'd move things along that much quicker. However "doing it right" certainly takes priority :)

I proposed creating a new table to store the hex-ified IP because the vast majority of rows in revision/archive are made by logged in users and won't need this extra column (21% of enwp's 703,826,467 edits are logged out users).

I still need to figure out exactly what happens to the revisions when a page is deleted, versus revision deleted, suppressed, etc.,

If it's deleted, the row moves to the archive table. But if it's revdel/suppressed then the rev_deleted bitfield is set.

If you want to put this feature in core (which is totally reasonable IMO) I think we should just build it in the way users will expect the most - just modify Special:Contributions. I don't think it would be that complex tbh, you'd just check if the input is an IP range, and then use a join on the revision_ip_hex table (or whatever it's named). Same/similar change to Special:DeletedContributions / archive_ip_hex.

Because it's a new table we don't need to wait for the schema change process (it just needs a DBA review), and a maint script can backfill it.

The other thing is that I'm led to believe changes to core take considerably longer than an extension.

How about we prove this wrong? :)

+1 to all of that.

Thanks for the info! I'm feeling better and better about this.

If you want to put this feature in core (which is totally reasonable IMO) I think we should just build it in the way users will expect the most - just modify Special:Contributions. I don't think it would be that complex tbh, you'd just check if the input is an IP range, and then use a join on the revision_ip_hex table (or whatever it's named). Same/similar change to Special:DeletedContributions / archive_ip_hex.

Using the same Special:Contribs seems the most intuitive, but the UI will be different. For ranges, we want an IP range calculator, and the ability to view the results by IP or by date, and eventually more comprehensive information like the ISP and general geolocation data (see T152114). That might be a little too "in your face" for your average editor, and conditionally showing it based on if you've entered a range or single IP would be a little odd, methinks. Special:RangeContributions is more geared towards power users who need a deep analysis, and overall I feel like the UI will be too different to warrant the same Special page. The nice thing is you can still enter in a single IP to RangeContribs and get all the extras. Moreover, I've already had a lot of difficulty combing through SpecialContributions.php to find what I need for the range tool, and I worry about the frustrations I'll have trying to build on top of it.

Going to discuss all of this with my team today, and hopefully get back to coding soon. Many thanks to @Legoktm and @jcrespo for your insight, and inspiration! Let me know if you would like to talk more over hangout.

Hmm, I see. I think we could integrate a button or something on Special:Contribs that opens up the range calculator. If we're concerned that this extra information will be too much for average editors, maybe we should be putting it behind a user preference?

Moreover, I've already had a lot of difficulty combing through SpecialContributions.php to find what I need for the range tool, and I worry about the frustrations I'll have trying to build on top of it.

You mostly just need to modify ContribsPager it looks like, that handles all SQL and formatting. I'd be concerned about a new special page having to duplicate all the stuff that's already in Special:Contribs like feeds, hooks, warnings, inclusion, etc. And remember that we link to Special:Contribs *everywhere*. I don't think we want to modify all of those places (software + templates) to link to a new special page.
Glancing at your current patch, it doesn't handle any of that. I agree that it will be harder and take longer to modify the existing special page, but in the long run I believe it will be worth it.

Huji added a comment.Jan 25 2017, 3:00 AM

The range calculator should be part of core. See T152850

I think it would be fine to put Special:RangeContributions into core, but I think the interface for this may be too different from Special:Contributions for it to make sense to combine them. At the very least, I think it should be prototyped as a separate page first so that we can work out the bugs, have administrators test the interface, see if there are any performance issues, etc. This is going to be a specialized interface for a very small number of editors (basically admins and vandal fighters). Special:Contributions should be streamlined for the other 99.9% of editors who have no interest in ranges.

I agree that this may duplicate some functionality from Special:Contributions, but if the trade-off is between creating a small amount of extra maintenance burden or complicating a commonly used interface for the benefit of a small number of users, I would favor the former. I don't like the idea of adding more preferences, especially because it isn't clear where such a preference would live (there are no existing preferences for Special:Contributions).

So there are two different things being discussed here right - the technical ability to query for contributions across an IP range, and then some kind of enhanced contributions view right? I feel like those should be tackled as separate issues.

Snaevar added a subscriber: Snaevar.Feb 5 2017, 2:16 PM
MusikAnimal added subscribers: bzimport, scfc, Matanya and 8 others.

Good catch! They are the same. Merging into this one that is serving as the epic task. Thanks

Qgil removed a subscriber: Qgil.Feb 20 2017, 2:23 PM
DannyH moved this task from Estimated to Bug backlog on the Community-Tech board.

Change 333838 abandoned by MusikAnimal:
[WIP] Create new Special:RangeContributions page to view contribs of an IP range (v4 and v6)

Reason:
Definitely not happening in CheckUser anymore. See https://gerrit.wikimedia.org/r/#/c/349457 for the first iteration of the new approach!

https://gerrit.wikimedia.org/r/333838

Mz7 added a subscriber: Mz7.Sep 15 2017, 3:49 AM
Wiki13 added a subscriber: Wiki13.Dec 28 2017, 6:51 PM
jeblad added a subscriber: jeblad.Jan 20 2018, 5:17 PM

How would it be communicated to the users that this special page would not work for large ISPs that dynamically assigns IP-addresses? Also there are IPv6 solutions that actively does address obfuscation?

Unless I'm completely mistaken this solves a situation that only exists in some areas with abundant number of IP-addresses?

Isn't this already implemented? I returned last fall after taking a few months off and found I could see contribs from any IPv4 and IPv6 range right in the standard interface. It's awesome!

MusikAnimal removed MusikAnimal as the assignee of this task.Jan 22 2018, 5:57 PM

How would it be communicated to the users that this special page would not work for large ISPs that dynamically assigns IP-addresses? Also there are IPv6 solutions that actively does address obfuscation?
Unless I'm completely mistaken this solves a situation that only exists in some areas with abundant number of IP-addresses?

I'm not sure I follow. The Special page would work for any IP range within the configured CIDR limit. If there are multiple ISPs, I suppose they'd each be listed, with the known range. It's up to the user to ascertain how the ISP assigns IP addresses, whether it's a proxy, etc.

Isn't this already implemented? I returned last fall after taking a few months off and found I could see contribs from any IPv4 and IPv6 range right in the standard interface. It's awesome!

T163562 (basic support) has been implemented :) This ticket is about creating a new Special page for power users, that basically does what the old XTools range contributions tool did (geolocation, WHOIS, sorting by IP as well date, etc.). We haven't seen much interest in this since basic support was added, so frankly I question if/when it will happen. On that note I'll remove me as the assignee.

Good work so far, Leon. Since this task is now about a feature improvement prioritized by our team I am going to remove Community-Tech

How would it be communicated to the users that this special page would not work for large ISPs that dynamically assigns IP-addresses? Also there are IPv6 solutions that actively does address obfuscation?
Unless I'm completely mistaken this solves a situation that only exists in some areas with abundant number of IP-addresses?

I'm not sure I follow. The Special page would work for any IP range within the configured CIDR limit. If there are multiple ISPs, I suppose they'd each be listed, with the known range. It's up to the user to ascertain how the ISP assigns IP addresses, whether it's a proxy, etc.

The lease time for an IP address can be very short, and there are no guarantee the IP address won't be reclaimed. The lease time is typically shorter for more crowded areas. Without knowing the lease time this will still be guesswork, only automated guesswork.

The lease time for an IP address can be very short, and there are no guarantee the IP address won't be reclaimed. The lease time is typically shorter for more crowded areas. Without knowing the lease time this will still be guesswork, only automated guesswork.

True, but that's the same for Special:Contributions. Regardless of whether you're talking about a single IP address or a range, there is no guarantee that it hasn't been reassigned. I think admins that do range blocks know that. Perhaps it would be worth adding more documentation to https://www.mediawiki.org/wiki/Help:Range_blocks though.

Izno added a subscriber: Izno.Sep 13 2018, 3:59 PM