Page MenuHomePhabricator

Scheduled risk observatory pipeline
Closed, ResolvedPublic

Description

  • Ingest data periodically to the risk observatory powered by Superset that assists the T&S Disinformation team https://superset.wikimedia.org/superset/dashboard/riskobservatory2
  • For now, we have ingested historical data with notebooks that:
    • retrieve revisions and compute revert risk scores through the language agnostic revert risk model stat1007:/user/paragon/lab/workspaces/auto-i/tree/riskindex/content-reverts-0-data-collection-20212022.ipynb. derived from Muniza’s notebook https://gitlab.wikimedia.org/repos/research/knowledge_integrity/-/blob/mnz/examples/examples/notebooks/revertrisk_example.ipynb
    • compute metrics at a wiki level and store data in Hive stat1007:/user/paragon/lab/workspaces/auto-i/tree/riskindex/content-reverts-2-highrisk-thresholds-v2-20212022.ipynb
      • Output: table risk observatory.reverts
    • compute metrics at a page/editors/revision level and store data in Hive stat1007:/user/paragon/lab/workspaces/auto-i/tree/riskindex/content-reverts-3-highrisk-editorspages-20212022.ipynb
      • Output: tables riskobservatory.highrisk_pages riskobservatory.highrisk_editors riskobservatory.highrisk_revision
  • The criteria for identifying high risk revisions is: any revision by a non-IP non-robot editor with a revert risk score above the threshold at which the model accuracy is maximal for that wiki.
    • Thresholds metrics were already computed stat1007:/user/paragon/lab/tree/riskindex/data/reverts_2021/ but I am currently recomputing these values as some wikis were missing in the original notebook (e.g. ukwiki)

Prioritization: The election calendar of the T&S Disinformation team is as follows

  • Q1 - Myanmar
  • Q2 - Poland, US Gubernatorial
  • Q3 - Russia
  • Q4 - EU parliament, India

Having this process ready for the elections in Q2-Q4 would be ideal

Details

Other Assignee
fkaelin

Event Timeline

The T&S/Disinformation team have submitted the request to prioritize this task (T345345).

@Pablo can this ticket be closed as well, as the work was tracked with T341777?