Page MenuHomePhabricator

Investigate revertrisk threshold generation for enwiki
Closed, ResolvedPublic

Description

We would like to run the same analysis ran in T392148: Run analysis to retrieve thresholds for high impact wikis to deploy recent changes revert risk language agnostic filters to to extract the threshold corresponding to a false positive rate of less than 15% for English Wikipedia.
As part of this investigation we want to figure out if the current way of calculating thresholds (jupyter notebook or converted python script) is a scalable enough way for larger wikis.
In the case where we find that the current way doesn't scale we should identify potential solutions that would allow us to run this analysis on all wikis.

Event Timeline

Initial try to run threshold-analysis.py for enwiki. It seems that the query that runs retrieve zero data for enwiki, although is the exact same query which ran on T392148: Run analysis to retrieve thresholds for high impact wikis to deploy recent changes revert risk language agnostic filters to ticket.

Query and Results:

1QUERY =
2WITH revert_risks AS (
3 SELECT
4 rr.wiki_db AS wiki_db,
5 rr.rev_id AS revision_id,
6 rev_revert_risk AS revert_risk_score,
7 rr.rev_timestamp AS edit_timestamp,
8 mwh.event_user_text,
9 mwh.event_user_revision_count AS user_edit_count,
10 -- was it a newly created account. defined as user that created account 90 days before edit
11 CASE WHEN rru.user_id IS NOT NULL
12 AND unix_timestamp(rr.rev_timestamp) - unix_timestamp(rru.user_registration_timestamp) < 7776000
13 THEN TRUE
14 ELSE FALSE
15 END AS is_new_account,
16 -- was the user anonymous
17 CASE
18 WHEN event_user_is_anonymous THEN TRUE
19 ELSE FALSE
20 END AS is_anon,
21 CASE
22 WHEN SIZE(event_user_is_bot_by_historical) = 0 THEN FALSE
23 ELSE TRUE
24 END AS user_is_bot,
25 -- edit reverts another edit
26 revision_is_identity_revert AS is_revert,
27 rr.page_title,
28 page_namespace_is_content AS namespace_is_content,
29 -- was this edit reverted
30 CASE
31 WHEN rr.rev_is_identity_reverted THEN TRUE
32 ELSE FALSE
33 END AS is_reverted,
34 -- time to revert
35 rr.rev_seconds_to_identity_revert AS time_to_revert,
36 -- was the user extended confirmed
37 CASE
38 WHEN ARRAY_CONTAINS(event_user_groups, 'extendedconfirmed') THEN TRUE
39 ELSE FALSE
40 END AS is_extendedconfirmed,
41 CASE
42 WHEN bc.user_text IS NOT NULL THEN TRUE
43 ELSE FALSE
44 END AS is_blocked,
45 CASE
46 WHEN amr.amr_rev_parent_id IS NOT NULL THEN TRUE
47 ELSE FALSE
48 END AS was_reverted_by_automoderator
49 FROM
50 risk_observatory.revert_risk_predictions rr
51 JOIN
52 wmf.mediawiki_history mwh
53 ON rr.wiki_db = mwh.wiki_db
54 AND rr.rev_id = mwh.revision_id
55 LEFT JOIN wmf_product.automoderator_monitoring_snapshot_daily amr
56 ON rr.rev_id = amr.amr_rev_parent_id
57 -- was reverted by automoderator edits
58 AND is_amr_revert
59 AND amr.wiki_db = 'enwiki'
60 AND amr.snapshot_date = '2025-02-15'
61 LEFT JOIN event.mediawiki_user_blocks_change bc
62 ON rr.user_name = bc.user_text
63 AND rr.wiki_db = bc.`database`
64 AND bc.`database` = 'enwiki'
65 -- user blocked after edit
66 -- AND bc.meta.dt > rr.rev_timestamp
67 -- join to users dataset to identify new users
68 LEFT JOIN mneisler.canonical_revertrisk_users_2024 rru
69 ON mwh.event_user_id = rru.user_id
70 AND mwh.wiki_db = rru.wiki_db
71 WHERE
72 mwh.snapshot = '2024-11'
73 -- limit to indonesian Wikipedia
74 AND rr.wiki_db == 'enwiki'
75 -- limit to only revisions assigned revert risk score
76 AND rev_revert_risk IS NOT NULL
77 -- exclude adminstrators
78 AND
79 (
80 event_user_groups IS NULL
81 OR NOT ARRAY_CONTAINS(mwh.event_user_groups_historical, 'sysop')
82 )
83 AND event_timestamp > "2024-01-01 00:00:00"
84 AND event_timestamp < "2024-11-05 00:00:00"
85 ),
86
87 excl_self_reverts AS (
88 SELECT
89 rd.*
90 FROM
91 revert_risks rd
92 JOIN
93 wmf.mediawiki_history mwh
94 ON rd.revision_id = mwh.revision_first_identity_reverting_revision_id
95 AND rd.wiki_db = mwh.wiki_db
96 WHERE
97 snapshot = '2024-11'
98 AND rd.is_revert
99 -- exclude self reverts
100 AND NOT rd.event_user_text = mwh.event_user_text
101 )
102
103SELECT
104 *
105FROM
106 revert_risks
107WHERE
108 NOT revert_risks.is_revert
109UNION ALL
110SELECT
111 *
112FROM
113 excl_self_reverts ;
114
115============ - enwiki - ============
116 - Raw data shape: (0, 17)
117 - Duplicate rows found and removed: 0
118 - Clean data shape: (0, 17)
119 - Unique revision_ids: 0 | Data Shape: 0 | Same? : -> True
120 - Removing edits that are reverts from df | New Shape: (0, 17)
121 - Is any revert_risk_score NA? : False
122 - Is any user_edit_count NA? : False
123 - Is any time_to_revert NA? : False
124Traceback (most recent call last):
125 File "/srv/home/gkyziridis/revert_risk_threshold_for_all_wikies.py", line 392, in <module>
126 rc = ROC(actual=actual, probs=predicted, plot=True, name=wiki_name)
127 File "/srv/home/gkyziridis/revert_risk_threshold_for_all_wikies.py", line 113, in ROC
128 fpr, tpr, thresholds = roc_curve(actual, probs)
129 File "/home/gkyziridis/.conda/envs/2025-05-08T10.39.42_gkyziridis/lib/python3.10/site-packages/sklearn/utils/_param_validation.py", line 216, in wrapper
130 return func(*args, **kwargs)
131 File "/home/gkyziridis/.conda/envs/2025-05-08T10.39.42_gkyziridis/lib/python3.10/site-packages/sklearn/metrics/_ranking.py", line 1150, in roc_curve
132 fps, tps, thresholds = _binary_clf_curve(
133 File "/home/gkyziridis/.conda/envs/2025-05-08T10.39.42_gkyziridis/lib/python3.10/site-packages/sklearn/metrics/_ranking.py", line 835, in _binary_clf_curve
134 pos_label = _check_pos_label_consistency(pos_label, y_true)
135 File "/home/gkyziridis/.conda/envs/2025-05-08T10.39.42_gkyziridis/lib/python3.10/site-packages/sklearn/utils/validation.py", line 2641, in _check_pos_label_consistency
136 raise ValueError(
137ValueError: y_true takes value in {} and pos_label is not specified: either make y_true take value in {0, 1} or {-1, 1} or pass pos_label explicitly.

After experimentation I observed that this query returns zero data for any wiki, even the ones that I was running for in the initial ticket.

I think we should find people to review this query and consult us to where the data exist at the moment.

Update

I achieved to run the threshold analysis for enwiki successfully.
The obstacle I faced was that I could not run the query for a bigger than one month window. English wiki has the highest volume of data which causes the machine to crash when querying data for more than 1 month.
The threshold analysis for the wikis in T392148: Run analysis to retrieve thresholds for high impact wikis to deploy recent changes revert risk language agnostic filters to ran on a 12 months window between 2024-01-01 to 2025-01-01. \
This experiment ran on the latest data on batches (one month per batch), so the threshold analysis ran multiple times (once per batch).
You can find the results (per month) in the following paste:

1============ - enwiki - ============
2 - Date Window: 2024-12-01 00:00:00 - 2025-01-05 00:00:00
3 - Raw data shape: (3866111, 17)
4 - Duplicate rows found and removed: 180911
5 - Clean data shape: (3685200, 17)
6 - Unique revision_ids: 3685200 | Data Shape: 3685200 | Same? : -> True
7 - Removing edits that are reverts from df | New Shape: (3522442, 17)
8 - Is any revert_risk_score NA? : False
9 - Is any user_edit_count NA? : False
10 - Is any time_to_revert NA? : False
11 - ROC_enwiki.png saved!
12 - Optimal threshold for 15.0% FPR is: 0.8609287142753601
13 - confusion_matrix_enwiki.png saved!
14 - False Positive Rate is: 0.14999987694942868
15 - CONFUSION MATRIX -
16
17Predicted not reverted reverted
18Actual
19not reverted 2763092 487604
20reverted 84670 187076
21
22
23Optimal Thresholds calculated
24{'enwiki': 0.8609287}
25
26============ - enwiki - ============
27 - Date Window: 2025-01-01 00:00:00 - 2025-02-05 00:00:00
28 - Raw data shape: (3954832, 17)
29 - Duplicate rows found and removed: 216729
30 - Clean data shape: (3738103, 17)
31 - Unique revision_ids: 3738103 | Data Shape: 3738103 | Same? : -> True
32 - Removing edits that are reverts from df | New Shape: (3559861, 17)
33 - Is any revert_risk_score NA? : False
34 - Is any user_edit_count NA? : False
35 - Is any time_to_revert NA? : False
36 - ROC_enwiki.png saved!
37 - Optimal threshold for 15.0% FPR is: 0.8726013898849487
38 - confusion_matrix_enwiki.png saved!
39 - False Positive Rate is: 0.15000013720110358
40 - CONFUSION MATRIX -
41
42Predicted not reverted reverted
43Actual
44not reverted 2787878 491979
45reverted 85382 194622
46
47
48Optimal Thresholds calculated
49{'enwiki': 0.8726014}
50
51============ - enwiki - ============
52 - Date Window: 2025-02-01 00:00:00 - 2025-03-05 00:00:00
53 - Raw data shape: (3465504, 17)
54 - Duplicate rows found and removed: 180513
55 - Clean data shape: (3284991, 17)
56 - Unique revision_ids: 3284991 | Data Shape: 3284991 | Same? : -> True
57 - Removing edits that are reverts from df | New Shape: (3113856, 17)
58 - Is any revert_risk_score NA? : False
59 - Is any user_edit_count NA? : False
60 - Is any time_to_revert NA? : False
61 - ROC_enwiki.png saved!
62 - Optimal threshold for 15.0% FPR is: 0.8680040240287781
63 - confusion_matrix_enwiki.png saved!
64 - False Positive Rate is: 0.15000008746932014
65 - CONFUSION MATRIX -
66
67Predicted not reverted reverted
68Actual
69not reverted 2429423 428722
70reverted 78568 177143
71
72
73Optimal Thresholds calculated
74{'enwiki': 0.868004}
75
76============ - enwiki - ============
77 - Date Window: 2025-03-01 00:00:00 - 2025-04-05 00:00:00
78 - Raw data shape: (3821024, 17)
79 - Duplicate rows found and removed: 203813
80 - Clean data shape: (3617211, 17)
81 - Unique revision_ids: 3617211 | Data Shape: 3617211 | Same? : -> True
82 - Removing edits that are reverts from df | New Shape: (3439901, 17)
83 - Is any revert_risk_score NA? : False
84 - Is any user_edit_count NA? : False
85 - Is any time_to_revert NA? : False
86 - ROC_enwiki.png saved!
87 - Optimal threshold for 15.0% FPR is: 0.7739927768707275
88 - confusion_matrix_enwiki.png saved!
89 - False Positive Rate is: 0.1499996065068035
90 - CONFUSION MATRIX -
91
92Predicted not reverted reverted
93Actual
94not reverted 2700175 476500
95reverted 75224 188002
96
97
98Optimal Thresholds calculated
99{'enwiki': 0.7739928}
100
101============ - enwiki - ============
102 - Date Window: 2025-04-01 00:00:00 - 2025-05-05 00:00:00
103 - Raw data shape: (3616972, 17)
104 - Duplicate rows found and removed: 210338
105 - Clean data shape: (3406634, 17)
106 - Unique revision_ids: 3406634 | Data Shape: 3406634 | Same? : -> True
107 - Removing edits that are reverts from df | New Shape: (3247157, 17)
108 - Is any revert_risk_score NA? : False
109 - Is any user_edit_count NA? : False
110 - Is any time_to_revert NA? : False
111 - ROC_enwiki.png saved!
112 - Optimal threshold for 15.0% FPR is: 0.779203474521637
113 - confusion_matrix_enwiki.png saved!
114 - False Positive Rate is: 0.15000018345178204
115 - CONFUSION MATRIX -
116
117Predicted not reverted reverted
118Actual
119not reverted 2548353 449710
120reverted 73117 175977
121
122
123Optimal Thresholds calculated
124{'enwiki': 0.7792035}
125
126
127Average Threshold: {'enwiki': 0.83092}

Since the variance is pretty small in the threshold distribution (one threshold per month), we can drive into the assumption that we can use the average threshold as the final one:
Avg Optimal Threshold for 15% FPR: {'enwiki': 0.83092}

Change #1177446 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/mediawiki-config@master] ores-extension: add threshold for revertrisk in enwiki

https://gerrit.wikimedia.org/r/1177446

Change #1177446 merged by jenkins-bot:

[operations/mediawiki-config@master] ores-extension: add threshold for revertrisk in enwiki

https://gerrit.wikimedia.org/r/1177446

Mentioned in SAL (#wikimedia-operations) [2025-08-14T07:07:24Z] <gkyziridis@deploy1003> Started scap sync-world: Backport for [[gerrit:1177446|ores-extension: add threshold for revertrisk in enwiki (T400590)]]

Mentioned in SAL (#wikimedia-operations) [2025-08-14T07:09:46Z] <gkyziridis@deploy1003> gkyziridis, isaranto: Backport for [[gerrit:1177446|ores-extension: add threshold for revertrisk in enwiki (T400590)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2025-08-14T07:19:31Z] <gkyziridis@deploy1003> Finished scap sync-world: Backport for [[gerrit:1177446|ores-extension: add threshold for revertrisk in enwiki (T400590)]] (duration: 12m 07s)