For baseline measurement, an ETL pipeline is required to calculate "Pageviews Received to Potentially Vandalised Content" every week.
- The initial baseline analysis is available at: https://nbviewer.org/urls/gitlab.wikimedia.org/kcvelaga/automoderator-measurement/-/raw/main/baselines/T348861_vandalism_pageviews.ipynb/%3Fref_type%3Dheads#Summary
- The query has to be under https://gitlab.wikimedia.org/repos/product-analytics/data-pipelines
- create a sub-folder for automoderator and place the query files there
- then a DAG under https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/tree/main/analytics_product/dags?ref_type=heads (with automoderator subfolder)
The suggested destination table schema is
- month
- wiki
- platform (desktop/mobile)
- number of potentially vandalized revisions
- total pageviews to articles
- total pageviews to potentially vandalized revisions
Other notes:
- The pipeline should run monthly, and is dependent on wmf.mediawiki_history
- The operational definition for potential vandalism is defined in the baseline analysis notebook.
- The target table destination is wmf_product.am_potential_vandal_pageviews_monthly (has to be created)