Page MenuHomePhabricator

Fix risk observatory dashboard
Closed, ResolvedPublic

Description

The risk observatory dashboard is not updating anymore, because the airflow dag that produces the datasets has stopped producing output.

Problem: the risk observatory pipeline is batching the revert risk inference based on the size of the data. However, the partitioning of this dataset has changed which broke the airflow dag.

Solution: the batching mechanism predates the content diff dataset, which makes the risk observatory pipeline much cheaper to run. The batching is very likely not needed anymore and the monthly dag can be run in a single job. This needs to be validated.

Event Timeline

Thanks @fkaelin :)

Beyond the dashboard (which is sometimes used by the T&S Disinformation team), the risk observatory data has been used on multiple occasions, e.g., to calibrate the default thresholds of Automoderator (T358128) or to enrich the patrolling dataset (T392210).

On the one hand, I am happy that the proposed solution is cheaper to run. On the other hand, I understand that the time between when a revision is made and when it is processed is now shorter, no longer delayed until the next month. As a consequence, the fields related to the reverted status of the revision (including the time to get reverted) might be less reliable.

Even if the input dataset is no longer monthly-based, would it be possible to wait a month (aprox) before computing the statistics of a revision?

This is fixed.

@Pablo once/when incremental data sources are available, this pipeline can be migrated to use these instead. In general we prefer to "outsource" the logic of how these fields are computed. e.g. for a revert column, this is currently "solved" by just exporting once a month, which is good enough - even though edits on the last day of the export are not "settled". For an incremental mediawiki history dataset, this will not suffice anymore - but I expect these considerations to be baked into the dataset. For example, there could a airflow "sensor" that triggers a dataset after we expect 99% of revisions to have been reverted (based on historical stats).

Similarly, the revert risk scores themselves should not even be computed as part of the risk observatory pipeline, this should be its own datasets (see Better access paths for LiftWing data) that the risk observatory pipeline could consume (and others for e.g. setting thresholds for automoderator)