- Ingest data periodically to the risk observatory powered by Superset that assists the T&S Disinformation team https://superset.wikimedia.org/superset/dashboard/riskobservatory2
- For now, we have ingested historical data with notebooks that:
- retrieve revisions and compute revert risk scores through the language agnostic revert risk model stat1007:/user/paragon/lab/workspaces/auto-i/tree/riskindex/content-reverts-0-data-collection-20212022.ipynb. derived from Muniza’s notebook https://gitlab.wikimedia.org/repos/research/knowledge_integrity/-/blob/mnz/examples/examples/notebooks/revertrisk_example.ipynb
- compute metrics at a wiki level and store data in Hive stat1007:/user/paragon/lab/workspaces/auto-i/tree/riskindex/content-reverts-2-highrisk-thresholds-v2-20212022.ipynb
- Output: table risk observatory.reverts
- compute metrics at a page/editors/revision level and store data in Hive stat1007:/user/paragon/lab/workspaces/auto-i/tree/riskindex/content-reverts-3-highrisk-editorspages-20212022.ipynb
- Output: tables riskobservatory.highrisk_pages riskobservatory.highrisk_editors riskobservatory.highrisk_revision
- The criteria for identifying high risk revisions is: any revision by a non-IP non-robot editor with a revert risk score above the threshold at which the model accuracy is maximal for that wiki.
- Thresholds metrics were already computed stat1007:/user/paragon/lab/tree/riskindex/data/reverts_2021/ but I am currently recomputing these values as some wikis were missing in the original notebook (e.g. ukwiki)
Prioritization: The election calendar of the T&S Disinformation team is as follows
- Q1 - Myanmar
- Q2 - Poland, US Gubernatorial
- Q3 - Russia
- Q4 - EU parliament, India
Having this process ready for the elections in Q2-Q4 would be ideal