Page MenuHomePhabricator

Mediawiki history has no data on IP blocks
Open, NormalPublic


The Anti-Harassment Tools team has developed the partial blocks feature, which allows for article- or namespace-specific blocks (T2674). In order to understand the effectiveness of that tool, we are interested in calculating various statistics on blocks (e.g. number of sitewide blocks created in a given month, number of partial blocks created in a given month, etc). It would be beneficial to have this for both registered and IP users, as blocks can affect either.

The mediawiki_user_history table in the Data Lake appears to have the relevant information for registered users, storing both the start and end timestamps for blocks. This is awesome!

From what I can tell by querying them, the mediawiki_user_history table only contains data on registered users, and the mediawiki_history table does not contain any block information on non-registered users. For example this query:

FROM mediawiki_history
WHERE snapshot = '2018-10'
AND wiki_db = 'nowiki'
AND event_user_is_anonymous = TRUE
AND size(event_user_blocks_historical) > 0

…returns no rows.

I searched Phabricator but could not find much reference to this problem already, so I'm creating this task to document it and start a discussion. It would be great if the mediawiki history tables in the Data Lake also contain information on IP blocks, but I am unsure if that's even possible (meaning that we'll need to dig data out of the logging table instead.

Event Timeline

Restricted Application added subscribers: MGChecker, jeblad. · View Herald TranscriptDec 10 2018, 8:25 PM

Hi @nettrom_WMF ,
Indeed the mediawiki_history table doesn't contain historical blocks (or actually group, but it's not relevant).
The approach taken when rebuilding user-history through logging table was to concentrate on registered-users, as IPs are by nature multi-user, changing etc.
The use-case you describe here makes a lot of sense, and while we have curretnly too many things on the plate, I'll talk with the team to prioritize adding blocks for IPs.

fdans triaged this task as Normal priority.Dec 13 2018, 5:56 PM
fdans moved this task from Incoming to Smart Tools for Better Data on the Analytics board.

@JAllemandou : Thanks for meeting up with me during All Hands to discuss this, and also giving me handy tips on working with the Data Lake, I really appreciated that! One thing we discussed was whether there is a need to support both blocks on IP ranges as well as single addresses. I looked into that and found that @TBolliger and I discussed it, and we're only interested in blocks of single IPs.

Thanks @nettrom_WMF for the follow up :)
I'll try to include that in the next bunch of big changes I'm working on for mediawiki-history :)

Actually I haven't had time to tackle this issue in this round of change, sorry about that :(
Keeping the task in the bakclog of things to do for mediawiki-history.