Page MenuHomePhabricator

Investigation: how many event participants have been affected by IP Blocks
Closed, ResolvedPublic

Description

IP blocking affects a number of communities (https://meta.wikimedia.org/wiki/Talk:No_open_proxies/Unfair_blocking). For experienced editors, it is easier to request a workaround for the block, such as IPBlockExempt right on the wikis they are active on. However, IP blocks have acute effects for in person events. During the event, a person who is new to the wikis often experiences the block. We want to better understand metrics and data around the impact of IP blocks on good faith users attending events.

We have a event logging data when a block notice is shown to a user. @Iflorez has documented several ways to both identify the individual users enrolled in Event Registration or historical Events in the Programs and Events Dashboard. In order to determine if the event organizer tools being built by the Campaign Product team might be a viable route for reducing the impact of IP Blocks, we need to better understand the baseline and metrics around this problem.

We would minimally like to be able to evaluate the following questions:

  • What percent of event participants enrolled in an event have experienced an IP Block within a specific window (a month, quarter or year, etc) of time near when they attended an event?
  • What percent of event participants experienced an IP Block during an event they were participating in? (might be more complex and require a more complex analysis)
  • Which geographies have most effected users?
  • If we can divide the data between in-person and online-first events, what differences do we see in impact?

Notes

Notes from @Iflorez:

A query to pull all names in the CampaignEvents:

user_ids_query = 
'''
    SELECT DISTINCT ce_participants.cep_user_id
    FROM ce_participants
'''
cep_user_ids = mariadb.run(user_ids_query, 'centralauth')

And a second query to match the CampaignEvents id to global username:

#GET usernames
user_names_participants_query =  '''
SELECT gu_name AS username,
gu_id AS user_id
FROM globaluser 
WHERE globaluser.gu_id IN {cep_user_id_tuple}
'''
user_names_p = mariadb.run(user_names_participants_query.format(**query_vars), 'centralauth')

You can view cells 9-18 in this GitHub repo for more on editor data pulling for CampaignEvents monthly reporting.

See also this thread

Prior work

Event Timeline

mpopov subscribed.

Status update: Product Analytics is going to own a hypothesis under WE 4.2, likely in Q2 (since we are fully booked for Q1), to the effect of:

If we develop a metric for measuring how many event participants are prevented from participating in events due to IP blocks, we will be able to measure baselines across different dimensions (such as geographic regions and languages) to understand the severity and distribution of the problem.

(To be refined by the hypothesis owner (TBD) with @kostajh as KR owner)

More details in Slack.

kostajh renamed this task from Investigation: how many event participants have been effected by IP Blocks to Investigation: how many event participants have been affected by IP Blocks.Jul 12 2024, 7:58 AM
kostajh added a project: WE4.2 Anti-abuse.
nettrom_WMF subscribed.

Assigning this to me as I'll pick this up and do a quick investigation into this.

I've completed an initial investigation into this, finding that 2% of users who signed up for one or more campaigns encountered at least one blocked edit attempt during those campaigns.

This analysis was purposefully kept lightweight. I pulled data on all users who signed up for a campaign that took place between Feb 1 and Apr 25, 2025, excluding campaigns that were marked deleted and users who had unregistered. Similarly, I pulled data on all blocked edit attempts by logged-in users in the same time period across all wikis. After mapping local user ids to global user ids (because campaign signups use global ids), I joined the two datasets and measured the overlap, limiting it to edit attempts that occurred between the start and end times of a given campaign.

There's roughly 3,500 users who signed up for at least one campaign in this dataset, and roughly 70 users who had at least one blocked edit attempt during a campaign that they signed up for. While our Data Publication Guidelines allow me to report the exact numbers, I chose not to because the rough counts make the math easier.

Thank you for this analysis, @nettrom_WMF! At least from my end, I did not have any guess in advance for what this number could be, so this work helped establish a crucial baseline that we can use moving forward. I will share these findings with the Campaigns team and see if anyone has any questions.

Noting that I also think Feb 1 to April 25 was a good time period to analyze, as that period covered campaign cycles for gender organizers in March and climate organizers in April (so, strong topical interest from people in many regions, including those more likely to be impacted by IP blocks).

Wow! I'm so glad this research was requested. Thank you for these valuable insights @nettrom_WMF (and @ifried for flagging it!).

I was wondering: is there an opportunity to look into whether the 2% of blocked edit attempts during campaigns were geographically concentrated, or if they were more evenly distributed across regions? I’m particularly curious whether this data aligns with community feedback that IP-based restrictions tend to disproportionately affect users in Africa. If there is a noticeable concentration, could this point to a systemic bias in the blocking mechanisms—such as VPN/proxy detection or the lack of whitelisted IP ranges—rather than a random pattern?

Another follow-up will be to compare in-person campaign events with online events.

Thanks for taking this on, @nettrom_WMF!

I was wondering: is there an opportunity to look into whether the 2% of blocked edit attempts during campaigns were geographically concentrated, or if they were more evenly distributed across regions?

Yes, we have some ability to dig into this since the blocked edit attempt data contains the country code of the IP address associated with the edit attempt. One thing to note is that I have not looked at the location of the campaign events themselves, meaning I haven't estimated an expected percentage. This means that I'm unsure whether Africa or Asia is over-represented or not, but maybe you, @ifried, or someone else has insights here?

Note that I can only report the percentages here as the number of users fall below our data publication guidelines. In a similar fashion, I'm also going to only report the top 3.

ContinentPercentage of blocked users
Africa32.9%
Europe24.3%
Asia24.3%

Another follow-up will be to compare in-person campaign events with online events.

I added the type of event to my dataset and then split both the number of sign-ups and number of users who encountered a block so that we can calculate the proportions by type of event. The counts are small in some cases, so I'm again not reporting exact numbers.

Event typeN signed upPercent
In-person<1,0001.5%
Online<2,3502.1%
Both<4002.4%

Since the dataset isn't large, there's a lot of uncertainty about the "both" category. For the other two, it seems reasonably clear that in-person events are somewhat less likely to have a user encountering a blocked edit attempt.

Lastly, I combined the two questions and looked at type of event and what continents show up. Again I'll look at the top 3 continents, and only report this for in-person and online events. Similarly as for the first table, the percentage here are out of all users encountering a block while signed up for an event of a particular type.

Event typeContinentPercentage
OnlineAfrica33.3%
OnlineAsia23.5%
OnlineEurope21.6%
In-personAfrica40.0%
In-personAsia26.7%
In-personEurope26.7%

This makes me wonder whether in-person participants in Africa are particularly affected by blocks, and that this can easily be masked if we look at higher-level statistics. As mentioned, we don't have an expected percentage because I didn't try to locate events, but that might be a useful next step to consider.

I was wondering: is there an opportunity to look into whether the 2% of blocked edit attempts during campaigns were geographically concentrated, or if they were more evenly distributed across regions?

Yes, we have some ability to dig into this since the blocked edit attempt data contains the country code of the IP address associated with the edit attempt. One thing to note is that I have not looked at the location of the campaign events themselves, meaning I haven't estimated an expected percentage. This means that I'm unsure whether Africa or Asia is over-represented or not, but maybe you, @ifried, or someone else has insights here?

Note that I can only report the percentages here as the number of users fall below our data publication guidelines. In a similar fashion, I'm also going to only report the top 3.

ContinentPercentage of blocked users
Africa32.9%
Europe24.3%
Asia24.3%

Another follow-up will be to compare in-person campaign events with online events.

I added the type of event to my dataset and then split both the number of sign-ups and number of users who encountered a block so that we can calculate the proportions by type of event. The counts are small in some cases, so I'm again not reporting exact numbers.

Event typeN signed upPercent
In-person<1,0001.5%
Online<2,3502.1%
Both<4002.4%

Since the dataset isn't large, there's a lot of uncertainty about the "both" category. For the other two, it seems reasonably clear that in-person events are somewhat less likely to have a user encountering a blocked edit attempt.

Lastly, I combined the two questions and looked at type of event and what continents show up. Again I'll look at the top 3 continents, and only report this for in-person and online events. Similarly as for the first table, the percentage here are out of all users encountering a block while signed up for an event of a particular type.

Event typeContinentPercentage
OnlineAfrica33.3%
OnlineAsia23.5%
OnlineEurope21.6%
In-personAfrica40.0%
In-personAsia26.7%
In-personEurope26.7%

This makes me wonder whether in-person participants in Africa are particularly affected by blocks, and that this can easily be masked if we look at higher-level statistics. As mentioned, we don't have an expected percentage because I didn't try to locate events, but that might be a useful next step to consider.

Thanks very much! I'm also wondering if it's possible to extract information about the blocks that were involved, and perhaps also the block authors, block reasons, block duration.

I was wondering: is there an opportunity to look into whether the 2% of blocked edit attempts during campaigns were geographically concentrated, or if they were more evenly distributed across regions?

Yes, we have some ability to dig into this since the blocked edit attempt data contains the country code of the IP address associated with the edit attempt. One thing to note is that I have not looked at the location of the campaign events themselves, meaning I haven't estimated an expected percentage. This means that I'm unsure whether Africa or Asia is over-represented or not, but maybe you, @ifried, or someone else has insights here?

Hi @nettrom_WMF and others,

Yes, I would say that Africa, in particular, is probably over-represented in the data. Many of our first users were organizers in Sub-Saharan Africa, and they have continued to use Event Registration over the years. I don't know the statistical breakdown in terms of percentages, but I can definitely say that, yes, we do have a large share of users based in Africa. We also used to include the country & region in our data collection (see 2024 base data), which showed a large percentage of the events in the Sub-Saharan African region. However, we no longer collect this data, so my current read is based on observation rather than hard data.

As for Asia, I don't know! I know that some user groups in Asia (Wikimedians of Kerala, Iranian Wikimedians) have been using our tools, and we have recently seen some growth in users in the ESEAP region. However, I don't know if it is any larger than, say, our users in other regions.

Noting that this work is continuing as a hypothesis: WE4.2.21 Metric for measuring how event participants are impacted by IP blocks (Asana board, Foundation internal only)

@nettrom_WMF -- just got back from sabbatical --- this is wonderful and finally gets us a source of truth for some of the data.

One of the unintended effects of the IP Blocks is that it prevents account creation -- do we have any logs that (i.e. goes from one of these event pages, tries to create an account (thus has a return target page in the return url), but is prevented from doing so -- do we log the return url anywhere? Could we probe if the return urls are in the Event namespace?

One of the unintended effects of the IP Blocks is that it prevents account creation -- do we have any logs that (i.e. goes from one of these event pages, tries to create an account (thus has a return target page in the return url), but is prevented from doing so -- do we log the return url anywhere? Could we probe if the return urls are in the Event namespace?

We do not have this readily available, unfortunately. If this turns out to be something we want to prioritize figuring out, then I think adding information about the return url (e.g. page namespace) to the accountcreation/block schema would be the way to go.

I'm also wondering if it's possible to extract information about the blocks that were involved, and perhaps also the block authors, block reasons, block duration.

It's been a bit of a rabbit hole to dig into (and led to T396425 being filed, as you know). For now, I'll stick with reporting on the readily available data that's in the blocked edit attempt schema.

I've looked at the following metadata related to the blocks that campaign participants encountered:

  • Scope: local or global blocks.
  • Type: autoblock, user, IP range, or individual IPs.
  • Expiry: here I'll report the proportion of encountered blocks that are infinite blocks, and for those that have an expiry timestamp set give some observations about the time between the blocked edit attempt and expiry (which I'll refer to as "duration").

Block scope:
Most campaign participants encountered local blocks, they make up 86.1% of all blocked participants (and conversely the global blocks make up 13.9%). with an 80%/20% split between local/global [Edited on 2025-07-01 after double-checking the math to correct the proportions]

Global blocks:
I'll cover these first, because all the global blocks are IP range blocks, and none of them have an infinite expiry. The duration of those blocks fall into two groups: one with relatively shorter duration of less than 6 months, and one with longer duration of about 1.5 to 2.5 years.

Local blocks:
For local blocks, the distribution of block type (based on number of participants blocked) is as follows:

Block typeProportion
Autoblock3.1%
User18.5%
IP32.3%
IP range46.2%

I'll ignore the autoblocks for the rest of the report.

Local user blocks:
When it comes to the user blocks, there weren't too many and I investigated why they were blocked. If I were to categorize them then I'd label them with one of three: vandalism, poor quality/disruptive edits, and miscellaneous (e.g. username policy, personal attacks).

When it comes to block duration, I'll calculate proportions based on the number of blocked edit attempts because that doesn't require me to figure out how to deal with participants that encounter various types of blocks and durations. The majority (57.0%) of user blocks have an infinite expiry. Those that do have an expiry always have a relatively short expiry (less than two weeks), split into three groups: 1) a couple of days, 2) one week, 3) 10–14 days.

Local IP range blocks:
While more participants are blocked by IP range blocks than individual IPs, I'll cover this category first because it's easier to talk about. For these blocks, 17.3% of them have an infinite expiry. For the vast majority that do have an expiry set, the duration appears to land into three groups: less than 1 month, less than 6 months, and from 1 to 6 years. This is quite different from the global range blocks where we saw that none of them are infinite blocks, and their duration was also typically shorter.

Local IP blocks:
I'm covering these blocks last because this is where we run into complications. While these blocks are labelled as local, it is also where multiple block ids show up in the dataset (and why T396425 got filed). I haven't calculated to what extent global blocks also show up in that list of block ids, but I did estimate it across all blocked edit attempts in the first week of June, where it's 82.5% (in other words: if an edit attempt gets blocked because of a local IP block, it's very likely it's also triggering at least one global block).

Block expiry for these local IP blocks is different from range blocks. None of the blocks had an infinite expiry. The duration is again grouped, this time into two groups: the first is less than 6 months, the second is from around 1 year to 8 years.

Open question:
I find these very long block expiry settings quite puzzling. Why would someone set a block for that many years into the future, instead of simply setting an infinite block? Or setting it much shorter (e.g. global range blocks don't appear to be as long)?

Engineering question:
Since local IP blocks is the only place where multiple blocks shows up, is there something in the way MediaWiki handles blocks that results in other types of blocks not behaving in that way?

Why would someone set a block for that many years into the future, instead of simply setting an infinite block?

Generally, IPs should not be indefinitely blocked since they can be reassigned over time, shared, or have other collateral. (See en:Wikipedia:Blocking IP addresses § Block lengths)

Or setting it much shorter (e.g. global range blocks don't appear to be as long)?

Global blocks have a much broader impact than local blocks, so their duration tends to be shorter.

Generally, I'd expect to see block durations on an IP (or range) increase each time a new one is applied to the same IP (range).

@JJMC89 : Thank you for chiming in with information about this! Those were great points, particularly the fact that a block might just be a continuation of a previous block and therefore naturally have longer duration than previously.

I'm closing this task as resolved as we've been able to get our estimate together with a good amount of contextual information, and partly because we'll be focusing on other things this quarter. The notebooks for this analysis can be found on Gitlab: https://gitlab.wikimedia.org/nettrom/T366222-blocked-campaign-participants

I filed T398475 from @Astinson's comment about campaign participants being blocked from creating accounts (in T366222#10875457) so that's documented and can be prioritized accordingly by the involved teams.