Page MenuHomePhabricator

[SPIKE] Investigate getting number of past block for IP (including IP ranges) [8H]
Closed, ResolvedPublicSpike

Description

From T270318: Block info in the popup [M], getting an accurate count for the number of past blocks for an IP is more complicated than originally expected:

https://gerrit.wikimedia.org/r/723530 introduces the BlockManager::getIPBlock() method, which gets blocks that apply to the given IP address, including system blocks.

  • You can't query the logging table by IP range whereas you can the ip_blocks table. However, the ip_blocks table isn't always guaranteed to be exhaustive as expired blocks could be purged by running the maintenance/purgeExpiredBlocks.php script in MediaWiki Core [and are purged whenever a new block is inserted into the database]
Questions
  1. How could we keep an accurate record of the number of past blocks affecting an IP
Notes
  1. We could use the ip_changes table as inspiration – it's a table created to return very specific information for an IP or a range of IPs.
  2. https://www.mediawiki.org/wiki/Manual:Block_and_unblock
  3. Could/we should create a hook that fires whenever a block is inserted into the database?
  4. https://www.mediawiki.org/wiki/Manual:Ipblocks_table
  5. https://www.mediawiki.org/wiki/Manual:Logging_table

Event Timeline

Restricted Application changed the subtype of this task from "Task" to "Spike". · View Herald TranscriptOct 13 2021, 3:26 PM
Niharika triaged this task as Medium priority.Oct 19 2021, 5:07 PM
ARamirez_WMF renamed this task from [SPIKE] Investigate getting number of past block for IP (including IP ranges) to [SPIKE] Investigate getting number of past block for IP (including IP ranges) [8H].Oct 20 2021, 4:43 PM

We could create a separate table that stores info more specific to this function, this seems to be the easier solution
However I tried to find why we purge ip_blocks but I wasn't able to follow the git history only found this commit(Hard deprecate AuthManager::singleton),
my thinking is if we create a new table to store this, its going to have the same issues that led to the decision of purging ip_blocks.

Can we combine https://gerrit.wikimedia.org/r/c/mediawiki/core/+/730315 with removing purge functionality in insertBlock?

In a case where we choose to go with a new table , we can also use this hook(listed above) to add entries to the that table.
To prevent and sort of bottlenecks or overheads we can write to the DB async(use a queue/job).
we could also have an additional retry job in case an update to the DB fails for some reason.

We could create a separate table that stores info more specific to this function, this seems to be the easier solution

What would this table look like? ip_range_start, ip_range_end, count?

What if a specific IP gets blocked and then the same IP gets blocked as part of a range block? What would the table contain?

However I tried to find why we purge ip_blocks but I wasn't able to follow the git history only found this commit(Hard deprecate AuthManager::singleton),
my thinking is if we create a new table to store this, its going to have the same issues that led to the decision of purging ip_blocks.

I've just checked the size of the ipblock table for enwiki and it's got 1,551,957 rows. The more I think about this, That's 1.5M rows with regular (likely) purging. If we were to store historic data across all wikis I imagine that table would get large quickly.

I like the table idea to pull data and I agree that we want to avoid overloading tables with rows.

For our use, it seems we just want a total of blocks over all wikis. Could we just increment the block counts for an IP whenever the onBlockIp hook is called? Our table could then be just IP and block_counts.

Regarding the IP Range block, and how often an IP is specifically targeted versus how often it's targeted as part of an IP Range block, do we want to display the exact number of blocks, or could we instead show a range? Like Less than 10 blocks, between 50 and 100 blocks, etc. Could this be an alternative solution?

I like the table idea to pull data and I agree that we want to avoid overloading tables with rows.

For our use, it seems we just want a total of blocks over all wikis. Could we just increment the block counts for an IP whenever the onBlockIp hook is called? Our table could then be just IP and block_counts.

Regarding the IP Range block, and how often an IP is specifically targeted versus how often it's targeted as part of an IP Range block, do we want to display the exact number of blocks, or could we instead show a range? Like Less than 10 blocks, between 50 and 100 blocks, etc. Could this be an alternative solution?

Does ip_blocks store unique IP, or can we find a count of unique IP @phuedx ? Then maybe we can see if @AGueyte 's suggestion is feasible?

For enwiki:

+---------+----------------+--------------------------------+----------------------+-----------+
|    n    | n_with_address | n_with_address_and_range_block | n_distinct_addresses | n_expired |
+---------+----------------+--------------------------------+----------------------+-----------+
| 1487154 |        1439883 |                          46864 |              1439852 |         4 |
+---------+----------------+--------------------------------+----------------------+-----------+

Sources:

Edit

Number of blocks that have been inserted: 11,962,908. See https://quarry.wmcloud.org/query/59858

87.57% of all blocks on enwiki have expired and been purged. 90.63% of the remaining blocks will never expire and be purged. Unfortunately, we can't easily know whether those blocks that expired and were purged were for unique IP addresses or not. It might be possible to get this information from the logging table.

As I said during yesterday's AHT Standup meeting, I think it would be a good idea to reach out to a DBA to get a sense of what is acceptable in terms of number of rows. It may well be the case that we could stop purging expired blocks or we could expire blocks after some configurable time (say 90 days).

As I said during yesterday's AHT Standup meeting, I think it would be a good idea to reach out to a DBA to get a sense of what is acceptable in terms of number of rows.

We should ask in the wikimedia-databases channel on IRC.

👋 Hi, I'm database architect in data persistence team and was summoned to help here. I generally like the idea and I don't think we would block new features (unless it would cause major issues). What would be great is to do the design together to avoid issues later in the future. I have some random thoughts I try to give here:

  • Number of rows per se it's not a useful metric, it heavily depends on the table schema. So I would love to see a suggested schema to refine.
  • Currently ipblocks table is just 220MB in enwiki, a tenfold increase wouldn't cause an issue but it can get complicated and cause issues later (by redaction of values in the memory cache, etc.)
  • What do you want to do about global blocks? Are you thinking of duplicating the data or querying the two tables twice? (So you would also need to introduce a new table in centralauth extension)
  • Adding a counter column to ip_blocks sounds nice given that having less tables reduces the backup workload, filtering for cloud and maintenance cost in the code but a- it wouldn't keep the data for expired blocks (unless these data is not needed, I need to know the product requirements better.) b- querying for ranges can get a bit tricky (but not too much) c- updating it can get a bit complicated, would you update the counter for a range block if the new block is just one ip inside that range?
  • Delayed purge (a month/90 days) sounds pretty good, it won't increase the table size too much. If it satisfies product requirements, it's the simplest solution.
  • If you really need to keep all the historical data (forever and since the beginning of time), I suggest rethinking design of ipblocks table and do all of them together.
    • All of tinyint columns can be moved to a dedicated table and called blocktype. It would make the tables conceptually more cohesive.
    • ipb_address possibly need to be turned into int, for CIDR ranges, you can just add another column (maybe actually split user block and ip blocks to different tables?)
  • Keep in mind that checking past blocks for IP ranges is basically impossible. e.g. there might be an overlap in a previous block and the range that is being queried. It's logically too complex.

HTH, let me know if I can help more. Please keep in loop for the work!

@Prtksxna @phuedx and I talked about this earlier today and it sounds like the idea of a delayed purge (90 days or longer, if possible) would meet our needs. Displaying very old blocks is probably not very helpful in most patrolling cases. Keeping records around for 90 days would be an acceptable place to start.

@Tchanders: AIUI you're the member of Anti-Harassment-Team with the most experience with blocks. Would there be any other side effects of delaying the purge (other than displaying X-day-old blocks to patrollers)?

👋 Hi, I'm database architect in data persistence team and was summoned to help here. I generally like the idea and I don't think we would block new features (unless it would cause major issues). What would be great is to do the design together to avoid issues later in the future. I have some random thoughts I try to give here:

  • Number of rows per se it's not a useful metric, it heavily depends on the table schema. So I would love to see a suggested schema to refine.
  • Currently ipblocks table is just 220MB in enwiki, a tenfold increase wouldn't cause an issue but it can get complicated and cause issues later (by redaction of values in the memory cache, etc.)

Thanks, @Ladsgroup. FYI we (Anti-Harassment-Team) are also exploring a table to store the number of contributions from IPs across all wikis (likely in a global ip_changes table) in T292623: [SPIKE] Investigate getting global contribution information for IP Info [8H].

  • What do you want to do about global blocks? Are you thinking of duplicating the data or querying the two tables twice? (So you would also need to introduce a new table in centralauth extension)
  • Adding a counter column to ip_blocks sounds nice given that having less tables reduces the backup workload, filtering for cloud and maintenance cost in the code but a- it wouldn't keep the data for expired blocks (unless these data is not needed, I need to know the product requirements better.) b- querying for ranges can get a bit tricky (but not too much) c- updating it can get a bit complicated, would you update the counter for a range block if the new block is just one ip inside that range?
  • Delayed purge (a month/90 days) sounds pretty good, it won't increase the table size too much. If it satisfies product requirements, it's the simplest solution.
  • If you really need to keep all the historical data (forever and since the beginning of time), I suggest rethinking design of ipblocks table and do all of them together.
    • All of tinyint columns can be moved to a dedicated table and called blocktype. It would make the tables conceptually more cohesive.
  • Keep in mind that checking past blocks for IP ranges is basically impossible. e.g. there might be an overlap in a previous block and the range that is being queried. It's logically too complex.

It feels as if we've landed on delaying the purge – just yesterday, @Prtksxna pointed out that displaying blocks for IPs from years ago wouldn't be valuable to our users as the users of those IP addresses have likely changed.

We'd be happy to add missing information to the task where it's missing or to have a short meeting to explain these two investigations if necessary. Briefly though, we're looking to show users of the IP Info tool the numbers of active and past blocks against an IP address, e.g. see

image.png (682×1 px, 95 KB)

  • ipb_address possibly need to be turned into int, for CIDR ranges, you can just add another column (maybe actually split user block and ip blocks to different tables?)

AIUI the ipblocks.ipb_range_start and .ipb_range_end fields allow querying for IP ranges. That said, maybe it would be best to be consistent with the ip_changes table?

Thanks and let me know if I can help on anything, I'll be commenting on the other ticket now.

AIUI the ipblocks.ipb_range_start and .ipb_range_end fields allow querying for IP ranges. That said, maybe it would be best to be consistent with the ip_changes table?

I haven't landed on a good design for ip and ip range storage yet but both of them are bad. ip_changes store the value as varbinary(35) in a polymorphic column (trying to handle IPv4 and IPv6 at the same time) and it's a common misconception: varbinary is not that much better than binary, while it's smaller in disk storage, it's the same in memory where storage actually matters.

I will spend some time on how we should store IPs in database and come back with a design.

@Tchanders: AIUI you're the member of Anti-Harassment-Team with the most experience with blocks. Would there be any other side effects of delaying the purge (other than displaying X-day-old blocks to patrollers)?

I believe there are many places where rows from the ipblocks table are selected without checking if the blocks are expired. Whether all of these only concern displaying the blocks to other users, I'd have to spend some time looking into that...

In general where actions are being performed (e.g. when attempting to block and IP, or when an anonymous user is attempting to edit) it looks as though the DatabaseBlock::newFrom* methods are largely used, which won't return expired blocks.

But we should go through this with a fine-tooth comb before delaying the purge, and probably find somewhere to communicate that the expiry should always be checked too.

@Tchanders: AIUI you're the member of Anti-Harassment-Team with the most experience with blocks. Would there be any other side effects of delaying the purge (other than displaying X-day-old blocks to patrollers)?

I believe there are many places where rows from the ipblocks table are selected without checking if the blocks are expired. Whether all of these only concern displaying the blocks to other users, I'd have to spend some time looking into that...

In general where actions are being performed (e.g. when attempting to block and IP, or when an anonymous user is attempting to edit) it looks as though the DatabaseBlock::newFrom* methods are largely used, which won't return expired blocks.

But we should go through this with a fine-tooth comb before delaying the purge, and probably find somewhere to communicate that the expiry should always be checked too.

Thanks, @Tchanders. That sounds like work that is (arguably) a separate spike – perhaps something like "[SPIKE] Identify areas affected by delaying purging of blocks by X days"?

AIUI the ipblocks.ipb_range_start and .ipb_range_end fields allow querying for IP ranges. That said, maybe it would be best to be consistent with the ip_changes table?

I haven't landed on a good design for ip and ip range storage yet but both of them are bad. ip_changes store the value as varbinary(35) in a polymorphic column (trying to handle IPv4 and IPv6 at the same time) and it's a common misconception: varbinary is not that much better than binary, while it's smaller in disk storage, it's the same in memory where storage actually matters.

I will spend some time on how we should store IPs in database and come back with a design.

Closing this 🧵:

So TLDR is that if you want to store it for a short period of time, even three months, just go with Hex, it's simpler, ready, etc.

Since the current intention is to delay the purge for 90 days, there's no need to revisit the schema of the ipblocks table.