Maniphest T321947

Data collection for the Knowledge Integrity Risk Composite Index
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Pablo
	Oct 28 2022, 8:16 PM

Description

Collection of data for prototyping metrics of the knowledge integrity risks proposed in T316946

Event Timeline

Pablo created this task.Oct 28 2022, 8:16 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 28 2022, 8:16 PM

Weekly updates:

Collected data on revert ratios for all wikis since January 2016 https://superset.wikimedia.org/r/2043
It might be worth examining the existing dataset on quality over time generated at https://meta.wikimedia.org/wiki/Research:Cross-lingual_article_quality_assessment

Weekly updates:

It has been decided to schedule calls with:
- @diego and collaborators to explore how to integrate data on reverts and data on article quality.
- Global Data & Insights team to exchange views in the context of knowledge integrity metrics.

Hi @MunizaA! @diego and I have been discussing on adding data on vandalism and reverts to the risk observatory.

For the moment we have longitudinal graphs (monthly granular) on reverts ratio and IP edits reverts ratio (link to graphs). Data is stored in the riskobservatory.monthly_wiki_stats Hive table (link to table) with the following columns:

month VARCHAR
wiki_db VARCHAR
edit_count BIGINT
editors_count BIGINT
special_editor_edits_ratio FLOAT
bot_edits_ratio FLOAT
anonymous_edits_ratio FLOAT
minor_edits_ratio FLOAT
reverts_ratio FLOAT
page_namespace_is_content_ratio FLOAT
user_seconds_since_previous_revision_avg FLOAT
reverts_anonymous_edits_ratio FLOAT

Based on your work at T314384, we would love to incorporate new fields like:

vandalism_count
vandalism_ratio
vandalism_reverts_ratio
seconds_to_revert_vandalism_avg

Your feedback would be highly appreciated, so thanks in advance for your interest and happy to brainstorm together on this :)

Based on your work at T314384, we would love to incorporate new fields like:

vandalism_count

vandalism_ratio

vandalism_reverts_ratio

seconds_to_revert_vandalism_avg

Your feedback would be highly appreciated, so thanks in advance for your interest and happy to brainstorm together on this :)

Just to clarify we have a "revert probability", we can't claim this is "vandalism". Different from previous model we have just one single score.
Maybe you might be interested on collecting abuse filter information. I have some code to do that, and from there you might be able to compute something like "abuse filters hits".

I see your point, thanks @diego! Out of curiosity, are you training the models with (a sample of) all reverts or have you filtered the data to specific reverts (e.g., vandalism and other known forms of abuse)?

Actually, we are already computing AbuseFilters metrics, but the reliability of this metric is questionable for two reasons: (a) false positives when rules are not well defined, and (b) projects are very differently equipped with AbuseFilter (this is the major reason).

Weekly updates:

Call with a Global Data & Insights team member who provided several resources on data to be reviewed.
Data on article quality will need to be generated (values were only computed at a page level, not revision)
Data on admin capacity (e.g., active_admin_count, active_admins_active_editors_ratio, active_admins_new_active_editors_ratio, active_admins_edits_ratio) has been computed from every month since 2015-01.

Weekly updates:

For revision quality score, a script has been reviewed with @diego who will run it to retrieve these data
For revert risks, a call will be held next week with @MunizaA to retrieve the corresponding data
It has been set the following call with WMF's Trust&Safety that will cover a tutorial of the dashboard for the disinformation specialists

Regarding article quality, you can find the scores for all revisions in all languages from 2020-01-01 until 2022-09-31 here: /user/dsaez/paramita_article_quality/scores_all_v3_from_2020-01-01.parquet (HDFS)

Weekly updates:

Thanks to the support of @diego, quality scores have been computed for all revisions in all language editions of Wikipedia. Aggregated data is already accessible on Superset https://superset.wikimedia.org/r/2199

The call with @MunizaA has served to specify the requirements for the code that will retrieve data on revert risk of revisions (the code is expected to be ready over the next week).

Weekly updates:

Most of the work has been focused on preparing slides and ideas for next week's call with the Disinformation team to review how the knowledge integrity risk index will provide information for their workflows (including a simplified alternative dashboard design).
We have experienced memory failures in collecting data on reversion risk because we are merging TBs. Therefore, @MunizaA is now running this process in batches and also excluding any parent revisions from before 2015.

Weekly update.

@MunizaA solved the memory issues (thanks!) and built a dataset with all revisions from all Wikipedias in 2022, including their reversion risk scores calculated with the new ML model.

A preliminary analysis has been performed, including the distribution of scores for reverted and non-reverted revisions in order to approximate a data-driven definition of high risk revision.

Other findings related to missing data or class imbalances will be shared and discuss with the team of the ML model to better understand how to construct metrics for the risk observatory, in particular, for Wikipedia editions with atypical patterns of revert activity.

Pablo closed this task as Resolved.Jan 20 2023, 5:48 PM

Wow this is amazing @Pablo and @MunizaA, thanks for sharing!

	F36357747: image.png
	Jan 20 2023, 5:47 PM

	F35870274: image.png
	Dec 16 2022, 4:04 PM

Data collection for the Knowledge Integrity Risk Composite IndexClosed, ResolvedPublicActions

Description

Event Timeline

Data collection for the Knowledge Integrity Risk Composite Index
Closed, ResolvedPublic
Actions