Page MenuHomePhabricator

Analyze blocked edit attempts
Closed, ResolvedPublic

Description

Recently, Editing team deployed instrumentations to track blocked edit attempts. (T310390). With the instrumentation, we'd like to explore and answer below questions :

  • To what extent do blocks play a role in preventing potentially productive editors from making an edit?
  • What countries do blocked edit attempts most frequently occur?
  • What are the differences in frequency of blocked edit attempts by wiki, editing interface, and platform?
  • What types of IP blocks (local/global, short-/long-term) are more frequently encountered?
  • How many distinct users are stopped from editing by a block?
Deliveries

Summary of initial findings: https://docs.google.com/document/d/1UmFw8-B4gad8ylWNffYl98szOZcw4L0h7LUHol_xw6w/edit?usp=sharing
Date sheet of all wikis and countries: link

Event Timeline

jwang triaged this task as Medium priority.Nov 15 2022, 3:09 AM
jwang edited projects, added Product-Analytics (Kanban); removed Product-Analytics.
jwang moved this task from Next 2 weeks to Doing on the Product-Analytics (Kanban) board.

The initial analysis of blocked edit attempts rate by wiki shows some mismatch. Post the info here for discussion and investigation.
The edit attempts block rate is defined as:
the number blocked edit attempts / estimated total edit attempts = the number blocked edit attempts / (the number of sampled edit attempts / sample rate)

But the data shows schemas captured more blocked edit attempts than the estimated total edit attempts on anonymous users. Take metawiki for example, on most of days between 2022-10-13 and 2022-11-08, the block rate is more than 100%. Especially, the blocked edit attempts between 2022-10-31 and 2022-11-04 are very high.

wiki_dbevent_dateblocked_edit_attemptssampled_edit_attemptsEstimated edit attempts (=sampled_edit_attempts/0.0625)Estimated block rate (=blocked_edit_attempts/estimated edit attempts)
metawiki2022-10-13725944370881.024
metawiki2022-10-14725026943041.684
metawiki2022-10-15763132952641.450
metawiki2022-10-167494661105760.709
metawiki2022-10-178240668106880.771
metawiki2022-10-181117958793921.190
metawiki2022-10-1910838835133600.811
metawiki2022-10-201160851782721.403
metawiki2022-10-211037739463041.646
metawiki2022-10-221052635156161.874
metawiki2022-10-2312037636101761.183
metawiki2022-10-241055049579201.332
metawiki2022-10-25109591005160800.682
metawiki2022-10-26153361792286720.535
metawiki2022-10-27987826842882.304
metawiki2022-10-281022336057601.775
metawiki2022-10-291120718429443.807
metawiki2022-10-3086624145232037.338
metawiki2022-10-314132962153440120.144
metawiki2022-11-01384845241385699.804
metawiki2022-11-023455451752800123.409
metawiki2022-11-03270835228364874.242
metawiki2022-11-042963981592544116.509
metawiki2022-11-051054716426244.019
metawiki2022-11-061100131450242.190
metawiki2022-11-07943119431043.038
metawiki2022-11-081158717928644.046

It leads us to question:

  1. Is blocked edit attempt also captured in event.editattemptstep schema?
  2. Are the sample rates below correct for anonymous users? The number blocked edit attempts is extracted from event.mediawiki_editattempt_block schema. The number of sampled edit attempts is extracted from event.editattemptstep schema. Currently, the info about sample rates we have are:

The sample rate of event.mediawiki_editattempt_block schema is 100%.
The sample rate of event.editattemptstep schema on metawiki is 6.25% according to https://meta.wikimedia.org/wiki/Schema_talk:VisualEditorFeatureUse#Sampling .

  1. What else could cause the mismatch?

Query to extract data

WITH blocked_attempts AS (
SELECT `database` AS wiki_db, 
CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) AS event_date,
count(1) AS blocked_edit_attempts
FROM event.mediawiki_editattempt_block
WHERE year=2022 
-- event is avaiable since 2022-10-12. 2022-10-13 is the first day with complete data
AND CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) BETWEEN '2022-10-13' AND '2022-11-08'
AND `database`='metawiki'
-- anonymous users
AND performer.user_id == 0
GROUP BY `database` ,
CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0'))
),

edit_attempts AS (
SELECT wiki AS wiki_db, 
CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) AS event_date,
count(1) AS sampled_edit_attempts
FROM event.editattemptstep
WHERE year=2022 
-- Blocked edit attempts events have complete data since 2022-10-13, 
-- select the same timing window for analysis
AND CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) BETWEEN '2022-10-13' AND '2022-11-08'
-- init when user clicks edit
AND event.action='init' 
-- anonymous users
AND event.user_id == 0
-- The default sample rate is 6.25%. The events at default sample rate are identified by event.is_oversample=false. 
-- Some wikis has non-default sample rate. The additional sampled events are identified by event.is_oversample=true.
AND NOT event.is_oversample 
AND wiki='metawiki'
GROUP BY wiki,
CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0'))
)
SELECT t1.wiki_db,  t1.event_date, t1.blocked_edit_attempts,  t2.sampled_edit_attempts
FROM blocked_attempts AS t1
INNER JOIN edit_attempts AS t2 ON t1.wiki_db=t2.wiki_db 
AND t1.event_date=t2.event_date
ORDER BY t1.wiki_db,  event_date, t1.blocked_edit_attempts DESC
LIMIT 1000000

Hi @DavidL, @MNeisler, as we discussed, whether the blocked edit attempt is also captured in event.editattemptstep schema is different by interface. Therefore, I am looking into the breakdown by interface. However, I noticed that the string values of interface in mediawiki_editattempt_block schema and editattemptstep schema are different. Can you help to match them?

The string value of interface in two schemas are listed below.

Value of interface field in mediawiki_editattempt_block schema

mobilefrontend
discussiontools
wikieditor
visualeditor

Value of editor_interface field in editattemptstep schema

other
wikitext-2017
visualeditor
wikitext

Sorry, we've wound up with some inconsistent naming here. Specifically, interface in the editattemptsblocked schema is much closer to integration in the editattemptstep schema.

That said, this is the actual mapping you'll need to distinguish them:

editattemptsblockededitattemptstepeditattemptstepeditattemptstep
interfaceintegrationeditor_interfaceplatform
wikieditorpagewikitextdesktop
visualeditorpagevisualeditor or wikitext-2017desktop
mobilefrontendpagewikitext or visualeditor or wikitext-2017phone
discussiontoolsdiscussiontoolsvisualeditor or wikitext-2017desktop or phone

@DLynch, thank you for the info. I added one more column to document our discussion about how init events are stored in schema.

editattemptsblockededitattemptstepeditattemptstepeditattemptstepWas blocked attempts are logged in EditAttemptStep as event.action = 'init'? Assume it's sampled
interfaceintegrationeditor_interfaceplatform
wikieditorpagewikitextdesktopNo
visualeditorpagevisualeditor or wikitext-2017desktopYes
mobilefrontendpagewikitext or visualeditor or wikitext-2017phoneYes, but with lower incidences of events, because it’s one of the only interfaces that actually tells you before you try to edit that you’re blocked https://phabricator.wikimedia.org/T303995#7990254
discussiontoolsdiscussiontoolsvisualeditor or wikitext-2017desktop or phoneYes

Here are some initial findings from the analysis of blocked editing attempts. The initial analysis focused on answering below questions:

  • To what extent do blocks play a role in preventing potentially productive editors from making an edit?
  • What countries do blocked edit attempts most frequently occur?
  • What are the differences in frequency of blocked edit attempts by wiki, editing interface?

Note
The blocked edit attempts happened mostly on anonymous users. The block rate on anonymous users is high. Want to check with Editing team whether it's expected.
Code location:stat1005: /home/jiawang/share/T322682_blocked_edit_attempts_R_v01.ipynb
Date sheet of all wikis and countries: link

Please see the number of blocked edit attempts and block rate by interface below.

User typeinterfaceblocked edit attemptsestimated_edit_attemptsestimated edit attempts include blockedblock rate
Logged-indiscussiontools25051473751473751.70%
Logged-inmobilefrontend10167142784714278470.71%
Logged-invisualeditor12205143703414370340.85%
Logged-inwikieditor28732613606661647980.47%
Anonymousdiscussiontools21412972189721822.02%
Anonymousmobilefrontend1703000135796241357962412.54%
Anonymousvisualeditor5036494526907452690711.13%
Anonymouswikieditor19261335313789335064026838.04%

*Data timeframe: 2022-10-21 ~ 2022-11-08

Next steps

  • Discuss the initial findings and deep dive.
  • Analyze following questions
    • What are the differences in frequency of blocked edit attempts by platform?
    • What types of IP blocks (local/global, short-/long-term) are more frequently encountered?
    • How many distinct users are stopped from editing by a block?

@DLynch, After the adjustment to count for 1) different sampling rates across wikis , 2) wikieditor edit attempts not logged in EditAttemptStep, the block rate is still high on anonymous users as mentioned in T322682#8482611. On some of wikis, it is close to 99%. (link). @MNeisler and I feel the result is suspicious and want to double check with you. Is the sample rate before 2022-11-08 correct for anonymous users (mentioned in T322682#8429742)? What other possible causes? Thanks!

The blocked edit attempts happened mostly on anonymous users. The block rate on anonymous users is high. Want to check with Editing team whether it's expected.
Code location:stat1005: /home/jiawang/share/T322682_blocked_edit_attempts_R_v01.ipynb
Date sheet of all wikis and countries: link

Please see the number of blocked edit attempts and block rate by interface below.

User typeinterfaceblocked edit attemptsestimated_edit_attemptsestimated edit attempts include blockedblock rate
Logged-indiscussiontools25051473751473751.70%
Logged-inmobilefrontend10167142784714278470.71%
Logged-invisualeditor12205143703414370340.85%
Logged-inwikieditor28732613606661647980.47%
Anonymousdiscussiontools21412972189721822.02%
Anonymousmobilefrontend1703000135796241357962412.54%
Anonymousvisualeditor5036494526907452690711.13%
Anonymouswikieditor19261335313789335064026838.04%

*Data timeframe: 2022-10-21 ~ 2022-11-08

I'd expect things to be heavily skewed towards blocks on loggedout wikieditor, because that's where all the automated crawlers are going to be hitting. e.g. a naïve spider crawling every link on wikipedia is going to hit the edit-link on every single page, and if its IP is blocked as many are.... Similarly, those spam scripts that just go around and try to submit every form they can find with links, etc.

Granted, that's a gut feeling rather than something I can back up with data.

Here are some initial findings from the analysis of blocked editing attempts. The initial analysis focused on answering below questions:

  • To what extent do blocks play a role in preventing potentially productive editors from making an edit?
  • What countries do blocked edit attempts most frequently occur?
  • What are the differences in frequency of blocked edit attempts by wiki, editing interface?

Note
The blocked edit attempts happened mostly on anonymous users. The block rate on anonymous users is high. Want to check with Editing team whether it's expected.
Code location:stat1005: /home/jiawang/share/T322682_blocked_edit_attempts_R_v01.ipynb
Date sheet of all wikis and countries: link

Please see the number of blocked edit attempts and block rate by interface below.

User typeinterfaceblocked edit attemptsestimated_edit_attemptsestimated edit attempts include blockedblock rate
Logged-indiscussiontools25051473751473751.70%
Logged-inmobilefrontend10167142784714278470.71%
Logged-invisualeditor12205143703414370340.85%
Logged-inwikieditor28732613606661647980.47%
Anonymousdiscussiontools21412972189721822.02%
Anonymousmobilefrontend1703000135796241357962412.54%
Anonymousvisualeditor5036494526907452690711.13%
Anonymouswikieditor19261335313789335064026838.04%

*Data timeframe: 2022-10-21 ~ 2022-11-08

@jwang are these numbers for all types of blocks, or for IP blocks specifically?

Next steps

  • What types of IP blocks (local/global, short-/long-term) are more frequently encountered?

I am curious if there are any updates on this question.

@kostajh, Please see my answers below

are these numbers for all types of blocks, or for IP blocks specifically?

For all types of blocks

What types of IP blocks (local/global, short-/long-term) are more frequently encountered?

Based on the Aug 2023 monthly data,

  • The majority of blocked edit attempts are due to IP and IP range blocks.
  • The majority of blocked edit attempts are due to the blocks lasting longer longer than 6 months.

Here is the Aug 2023 monthly summary for blocked edit attempts by block types and block scope across all wikis.

block_typeblock_scopeblocked_edit_attempts
rangeglobal30
iplocal11765118
rangelocal4975573
userlocal49104
autoblocklocal6350

The Aug 2023 monthly summary for blocked edit attempts by length of the block across all wikis.

block days_to_expireblocked_edit_attempts
1-3 years9711153
181-365 days2010735
>3 years1665215
31-90 days1453900
91-180 days1141136
7 days596681
8-30 days217355

query

SELECT block_type, block_scope, count(1) AS blocked_edit_attempts
FROM event.mediawiki_editattempt_block
WHERE year=2023 and month=8
GROUP BY block_type, block_scope
ORDER BY block_scope, blocked_edit_attempts DESC
LIMIT 1000

SELECT
CASE 
  WHEN DATEDIFF(to_date(block_expiry), to_date('2023-08-31'))<=7 THEN '7 days'
  WHEN DATEDIFF(to_date(block_expiry), to_date('2023-08-31'))<=30 THEN '8-30 days'
  WHEN DATEDIFF(to_date(block_expiry), to_date('2023-08-31'))<=90 THEN '31-90 days'
  WHEN DATEDIFF(to_date(block_expiry), to_date('2023-08-31'))<=180 THEN '91-180 days'
  WHEN DATEDIFF(to_date(block_expiry), to_date('2023-08-31'))<=365 THEN '181-365 days'
  WHEN DATEDIFF(to_date(block_expiry), to_date('2023-08-31'))<=1095 THEN '1-3 years'
  ELSE '>3 years'
  END AS days_to_expire
, COUNT(1) AS blocked_edit_attempts
FROM event.mediawiki_editattempt_block
WHERE year=2023 and month=8
GROUP BY 
CASE 
  WHEN DATEDIFF(to_date(block_expiry), to_date('2023-08-31'))<=7 THEN '7 days'
  WHEN DATEDIFF(to_date(block_expiry), to_date('2023-08-31'))<=30 THEN '8-30 days'
  WHEN DATEDIFF(to_date(block_expiry), to_date('2023-08-31'))<=90 THEN '31-90 days'
  WHEN DATEDIFF(to_date(block_expiry), to_date('2023-08-31'))<=180 THEN '91-180 days'
  WHEN DATEDIFF(to_date(block_expiry), to_date('2023-08-31'))<=365 THEN '181-365 days'
  WHEN DATEDIFF(to_date(block_expiry), to_date('2023-08-31'))<=1095 THEN '1-3 years'
  ELSE '>3 years'
  END
ORDER BY blocked_edit_attempts DESC
LIMIT 1000

I'd expect things to be heavily skewed towards blocks on loggedout wikieditor, because that's where all the automated crawlers are going to be hitting. e.g. a naïve spider crawling every link on wikipedia is going to hit the edit-link on every single page, and if its IP is blocked as many are.... Similarly, those spam scripts that just go around and try to submit every form they can find with links, etc.

Granted, that's a gut feeling rather than something I can back up with data.

More or less confirmed with the notebook, created for T347463: Add rel=nofollow to edit links.

@jwang would you be able to re-run your edit attempt and account creation analyses, but filtering out edit attempt blocks with browser_family containing bot|Bot|spider|Spider? That would hopefully give us a clearer picture about humans being impacted by blocks, as there is such a high volume of blocked bot edit attempts in the data.

jwang updated the task description. (Show Details)
Aklapper subscribed.

@jwang: Removing task assignee as this open task has been assigned for more than two years - See the email sent on 2025-05-22.
Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome!
If this task has been resolved in the meantime, or should not be worked on by anybody ("declined"), please update its task status via "Add Action… 🡒 Change Status".
Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator. Thanks!

Dreamy_Jazz assigned this task to jwang.
Dreamy_Jazz subscribed.

Looks like the analysis was done