The current SQL queries are brittle and need to be updated for new MediaWiki and PHP releases. This process can be error prone, and, if the queries are not updated before the new values start to appear in the data, the previous data must be backfilled, which can also be an error prone process. See T413349#11543337.
Description
Description
Related Objects
Related Objects
Event Timeline
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptThu, Jan 22, 4:17 PM2026-01-22 16:17:49 (UTC+0)
Ahoelzl moved this task from Incoming (new tickets) to Backlog on the Data-Engineering board.Thu, Jan 22, 5:15 PM2026-01-22 17:15:16 (UTC+0)
Comment Actions
Copying the issues found on the original pipeline here, for completeness:
In T413349#11543337, @xcollazo wrote:We ran into a couple issues trying to backfill:
- The SQL UNION ALLs existing data with the newly calculated data. This means the SQL expects the shape of preivously generated TSV files to include a column for each *future* category. For example, the versions all have their own column for 1.45, 1.46, etc. This makes backfills difficult, since the TSV files themselves require manual fixes to add the missing columns.
This is error prone, but I did it anyway manually for php.tsv, version_simple.tsv, and all the versions under folder php_drilldown, and ran the following to replace the production TSV files:
kerberos-run-command hdfs hdfs dfs -rm /wmf/data/published/datasets/periodic/reports/metrics/pingback/php.tsv kerberos-run-command hdfs hdfs dfs -rm /wmf/data/published/datasets/periodic/reports/metrics/pingback/version_simple.tsv kerberos-run-command hdfs hdfs dfs -rm -r /wmf/data/published/datasets/periodic/reports/metrics/pingback/php_drilldown kerberos-run-command hdfs hdfs dfs -cp /user/xcollazo/artifacts/fix_pingback/php.tsv /wmf/data/published/datasets/periodic/reports/metrics/pingback/php.tsv kerberos-run-command hdfs hdfs dfs -cp /user/xcollazo/artifacts/fix_pingback/version_simple.tsv /wmf/data/published/datasets/periodic/reports/metrics/pingback/version_simple.tsv kerberos-run-command hdfs hdfs dfs -cp /user/xcollazo/artifacts/fix_pingback/php_drilldown /wmf/data/published/datasets/periodic/reports/metrics/pingback/php_drilldown kerberos-run-command hdfs hdfs dfs -chown -R analytics:analytics-privatedata-users /wmf/data/published/datasets/periodic/reports/metrics/pingback/php.tsv kerberos-run-command hdfs hdfs dfs -chown -R analytics:analytics-privatedata-users /wmf/data/published/datasets/periodic/reports/metrics/pingback/version_simple.tsv kerberos-run-command hdfs hdfs dfs -chown -R analytics:analytics-privatedata-users /wmf/data/published/datasets/periodic/reports/metrics/pingback/php_drilldown kerberos-run-command hdfs hdfs dfs -chmod -R 644 /wmf/data/published/datasets/periodic/reports/metrics/pingback/php.tsv kerberos-run-command hdfs hdfs dfs -chmod -R 644 /wmf/data/published/datasets/periodic/reports/metrics/pingback/version_simple.tsv kerberos-run-command hdfs hdfs dfs -chmod -R 644 /wmf/data/published/datasets/periodic/reports/metrics/pingback/php_drilldown kerberos-run-command hdfs hdfs dfs -chmod 744 /wmf/data/published/datasets/periodic/reports/metrics/pingback/php_drilldown
- Additionally, the calculation steps all want to write to central TSV files, creating race conditions when wanting to backfill multiple dates.
To go around this, we set max_active_runs=1 on the DAG.
These fixes are ok for now, but we should really fix this pipeline to be idempotent, so that we can rerun it as needed. A discussing around this can be found on Slack.