Page MenuHomePhabricator

Investigation: Using log_search to query for logged actions against IPs in a given range
Open, Needs TriagePublic

Description

Separate from T187579, we want to be able to query logging where the target is an IP range (individual IPs are stored as the log_title, with a log_namespace of 2).

One idea is to use log_search (T187579#3979710). Let's use blocks of IPs as an example. So every time a block is made on an IP, we'd call ManualLogEntry::setRelations() to store information on it. We'd use a new type of ls_field, perhaps called ip_log_title, that indicates the ls_value is a hexadecimal representation of the IP address (which is the log_title for the corresponding row in logging). That way we can use BETWEEN on log_search to get log entries of IPs within a range. We'd need to also filter by log_type, in our example block in order to get the block log of all IPs within a given range.

On enwiki there are some 3.5 million blocks of IPs in logging -- a rough guess judging by SELECT COUNT(*) FROM logging WHERE log_type = 'block' AND log_title rlike '^[1-9]' AND log_page = 0, and again without log_page = 0 (includes IPs that have a userpage and accounts that start with a number).

Do we think log_search could work?

Soliciting feedback from DBAs and other smart people... thank you!

Event Timeline

On enwiki there are some 3.5 million blocks of IPs in logging -- a rough guess judging by SELECT COUNT(*) FROM logging WHERE log_type = 'block' AND log_title rlike '^[1-9]' AND log_page = 0, and again without log_page = 0 (includes IPs that have a userpage and accounts that start with a number).

Which means, roughly speaking (If we assume blocks are distributed evenly across all IPs, which is definitely not true but is an argument towards the "average case") using this method to find a /16 range would involve filesorting about 2000 rows. Doing a /24 would be about 44 rows.

The other thing to note here is that at least for blocks within an IP range, we'll want any blocks of subranges to appear in the results. For that maybe in log_search we'd store the hex value of the first IP in the range, that way your BETWEEN query we'll pick it up and get a reference to the block against the subrange (thx to Bawolff for the fine idea!). There is the caveat however that some big ranges may not fall in searches for small ranges, if things don't overlap correctly. Maybe that's OK, though?

jcrespo subscribed.

I wonder if MySQL is the right storage for these, as there seems to be a need to search instantly for large ranges, and our search platform has dedicated, specialized storage (elastic) for pattern matching.

LSobanski subscribed.

Removing the DBA tag and subscribing myself instead. Once there are specific actions for DBA please re-add us and/or @mention me.