Implement an automatic data collection process to provide metric values for the last months in the new prototype of the Knowledge Integrity Risk Observatory (see T316946)
Description
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Open | Pablo | T288337 [EPIC] Wikipedia Knowledge Integrity Risk Observatory | |||
| Resolved | Pablo | T341777 Automate the data collection process |
Event Timeline
Weekly updates
- A 1:1 with @fkaelin has been scheduled for next week to discuss this task.
This task needs more information before we can triage it. Pablo will ask T&S/Disinformation team to submit a request related to it. We will review this task in light of the details of the to-be-submitted request and will determine whether it should be prioritized now. This will affect the triaging of a similar task (T343065). @fkaelin cc.
Weekly updates
- The T&S/Disinformation team has been informed that they have to submit a request for this task to be prioritized.
Weekly updates
- This was discussed in the Disinformation Working Group Call and the T&S/Disinformation team have submitted the request to prioritize this task (T345345).
Weekly updates
- Data and code uploaded to the Gitlab repository https://gitlab.wikimedia.org/paragon/knowledge-integrity-risk-index
Weekly updates
- @fkaelin, Nicholas and I met to review their roadmap and to address questions resulting from their archeology process.
Weekly updates
- @fkaelin and I in our 1:1 clarified the remaining open issue from Nicholas's code investigation about self-reverts.
Weekly updates
- @fkaelin and I had our 1:1, where he provided me with information about the delay with this task. The causes are reasonable and it is still expected to have the process completed throughout Q2.
Weekly updates
- Following @fkaelin's recommendation, Nicholas will share a notebook next Monday for me to verify the output of datasets generated by the existing steps and fix any issues (e.g., missing columns, etc). The notebook is expected to contain revisions of an illustrative wiki (e.g., frwiki).
Weekly updates
- The notebook shared by @nickifeajika is currently under review (bugs and other issues are being discussed and addressed in the Slack channel on research engineering).
Weekly updates
- Due to internal issues, there will be changes in the allocation of research engineering resources. As a consequence, I am already in coordination with @fkaelin to minimize the delay of this deliverable.
Weekly updates
- @fkaelin created a replica of the dashboard (link). The dashboard exposes data from a new hive database risk_observatory that is used by the airflow dag (code). The tables for the dashboard are identical to those in the original riskobservatory database, but contain revert risk predictions for all revisions from 2019-01 to 2023-10.
- We had a 1:1 this week to discuss open questions (e.g. threshold re-computation) and I started reviewing code/data.
Weekly updates
- No issues were found while exploring the data except for the current limit to the 48 major wikis (the rest of language will added in the production run).
- I had a meeting the Trust & Safety Disinformation specialist reviewed the new dashboard. He will explore it with the rest of the team and, if possible, share feedback in the Disinformation Working Group Call next week.
Weekly updates
- Data from 2023-12 is now available in the risk observatory (thanks again @fkaelin).
@fkaelin, since you already managed to automate the data collection process, are you OK with me closing this ticket?
@Pablo thanks for flagging - there was indeed an issue with the wikidiff table: it is an external hive table, the required data was on hdfs and triggered the risk observatory dag, but the hive table itself was not being correctly updated, so no data was ingested. This is fixed now, and the dashboard shows data until Feb 24 now.
I agree that this phab can be closed now.
@fkaelin thanks to you for the quick fix! I will therefore proceed to close this ticket as solved.