The Anti-Harassment Tools team has developed the partial blocks feature, which allows for article- or namespace-specific blocks (T2674). In order to understand the effectiveness of that tool, we are interested in calculating various statistics on blocks (e.g. number of sitewide blocks created in a given month, number of partial blocks created in a given month, etc). It would be beneficial to have this for both registered and IP users, as blocks can affect either.
The mediawiki_user_history table in the Data Lake appears to have the relevant information for registered users, storing both the start and end timestamps for blocks. This is awesome!
From what I can tell by querying them, the mediawiki_user_history table only contains data on registered users, and the mediawiki_history table does not contain any block information on non-registered users. For example this query:
SELECT * FROM mediawiki_history WHERE snapshot = '2018-10' AND wiki_db = 'nowiki' AND event_user_is_anonymous = TRUE AND size(event_user_blocks_historical) > 0 LIMIT 25;
…returns no rows.
I searched Phabricator but could not find much reference to this problem already, so I'm creating this task to document it and start a discussion. It would be great if the mediawiki history tables in the Data Lake also contain information on IP blocks, but I am unsure if that's even possible (meaning that we'll need to dig data out of the logging table instead.