Page MenuHomePhabricator

Data collection for the Knowledge Integrity Risk Composite Index
Closed, ResolvedPublic

Description

Collection of data for prototyping metrics of the knowledge integrity risks proposed in T316946

Event Timeline

Weekly updates:

Weekly updates:

  • It has been decided to schedule calls with:
    • @diego and collaborators to explore how to integrate data on reverts and data on article quality.
    • Global Data & Insights team to exchange views in the context of knowledge integrity metrics.

Hi @MunizaA! @diego and I have been discussing on adding data on vandalism and reverts to the risk observatory.

For the moment we have longitudinal graphs (monthly granular) on reverts ratio and IP edits reverts ratio (link to graphs). Data is stored in the riskobservatory.monthly_wiki_stats Hive table (link to table) with the following columns:

  • month VARCHAR
  • wiki_db VARCHAR
  • edit_count BIGINT
  • editors_count BIGINT
  • special_editor_edits_ratio FLOAT
  • bot_edits_ratio FLOAT
  • anonymous_edits_ratio FLOAT
  • minor_edits_ratio FLOAT
  • reverts_ratio FLOAT
  • page_namespace_is_content_ratio FLOAT
  • user_seconds_since_previous_revision_avg FLOAT
  • reverts_anonymous_edits_ratio FLOAT

Based on your work at T314384, we would love to incorporate new fields like:

  • vandalism_count
  • vandalism_ratio
  • vandalism_reverts_ratio
  • seconds_to_revert_vandalism_avg

Your feedback would be highly appreciated, so thanks in advance for your interest and happy to brainstorm together on this :)

Based on your work at T314384, we would love to incorporate new fields like:

  • vandalism_count
  • vandalism_ratio
  • vandalism_reverts_ratio
  • seconds_to_revert_vandalism_avg

Your feedback would be highly appreciated, so thanks in advance for your interest and happy to brainstorm together on this :)

Just to clarify we have a "revert probability", we can't claim this is "vandalism". Different from previous model we have just one single score.
Maybe you might be interested on collecting abuse filter information. I have some code to do that, and from there you might be able to compute something like "abuse filters hits".

I see your point, thanks @diego! Out of curiosity, are you training the models with (a sample of) all reverts or have you filtered the data to specific reverts (e.g., vandalism and other known forms of abuse)?

Actually, we are already computing AbuseFilters metrics, but the reliability of this metric is questionable for two reasons: (a) false positives when rules are not well defined, and (b) projects are very differently equipped with AbuseFilter (this is the major reason).

Weekly updates:

  • Call with a Global Data & Insights team member who provided several resources on data to be reviewed.
  • Data on article quality will need to be generated (values were only computed at a page level, not revision)
  • Data on admin capacity (e.g., active_admin_count, active_admins_active_editors_ratio, active_admins_new_active_editors_ratio, active_admins_edits_ratio) has been computed from every month since 2015-01.

Weekly updates:

  • For revision quality score, a script has been reviewed with @diego who will run it to retrieve these data
  • For revert risks, a call will be held next week with @MunizaA to retrieve the corresponding data
  • It has been set the following call with WMF's Trust&Safety that will cover a tutorial of the dashboard for the disinformation specialists

Regarding article quality, you can find the scores for all revisions in all languages from 2020-01-01 until 2022-09-31 here: /user/dsaez/paramita_article_quality/scores_all_v3_from_2020-01-01.parquet (HDFS)

Weekly updates:

image.png (400×752 px, 261 KB)

  • The call with @MunizaA has served to specify the requirements for the code that will retrieve data on revert risk of revisions (the code is expected to be ready over the next week).

Weekly updates:

  • Most of the work has been focused on preparing slides and ideas for next week's call with the Disinformation team to review how the knowledge integrity risk index will provide information for their workflows (including a simplified alternative dashboard design).
  • We have experienced memory failures in collecting data on reversion risk because we are merging TBs. Therefore, @MunizaA is now running this process in batches and also excluding any parent revisions from before 2015.

Weekly update.

  • @MunizaA solved the memory issues (thanks!) and built a dataset with all revisions from all Wikipedias in 2022, including their reversion risk scores calculated with the new ML model.
  • A preliminary analysis has been performed, including the distribution of scores for reverted and non-reverted revisions in order to approximate a data-driven definition of high risk revision.

image.png (489×1 px, 68 KB)

  • Other findings related to missing data or class imbalances will be shared and discuss with the team of the ML model to better understand how to construct metrics for the risk observatory, in particular, for Wikipedia editions with atypical patterns of revert activity.

Wow this is amazing @Pablo and @MunizaA, thanks for sharing!