Page MenuHomePhabricator

Update pingback MediaWiki versions to include new values
Closed, ResolvedPublic

Description

https://pingback.wmcloud.org/#media-wiki-version/media-wiki-version-timeseries only includes MediaWiki versions through MediaWiki 1.44. MediaWiki 1.45 has been released, and MediaWiki 1.46 is in development. The list of versions in the pingback queries should be updated to include those versions and probably a few more so we don't need to do this again too soon. Perhaps a few PHP versions should be added as well. T326825 is an earlier version of this task for reference. Note that this is now apparently done by Airflow, not ReportUpdater.

Once the queries are updated, they will need to be rerun for dates May 2025 and later to pick up the new versions in the pingbacks.

Event Timeline

Change #1222506 had a related patch set uploaded (by Cicalese; author: Cicalese):

[analytics/refinery@master] Update pingback MediaWiki and PHP versions to include new values

https://gerrit.wikimedia.org/r/1222506

I'm not 100% confident that I found all of the places that changes need to be made, but the patches above are ready for review.

I also updated https://meta.wikimedia.org/wiki/Config:Dashiki:Pingback to add MW 1.45 and 1.46. It will need to be edited in the future to add 1.47, 1.48, ...

Once the patches are merged, the weekly queries will need to be re-run starting from the beginning of May 2025.

It looks like some fixtures need to be added to airflow-dags/tests/main/fixtures/spark_skein_specs to make the tests pass. I will leave that to someone who is more familiar with the syntax of those files.

You can see if it has picked up the new versions by visiting https://pingback.wmcloud.org/#media-wiki-version/media-wiki-version-timeseries and clicking on only "other" in the left sidebar. You should not see the line for "other" spike beginning in May 2025.

Change #1222506 merged by Joal:

[analytics/refinery@master] Update pingback MediaWiki and PHP versions to include new values

https://gerrit.wikimedia.org/r/1222506

A_smart_kitten subscribed.

(retagging for visibility as this affects MW reporting, even though the change isn't being made to MW itself)

(Airflow DAG MR has been updated with test fixes, and it is on the release train to be deployed in the next couple days)

We ran into a couple issues trying to backfill:

  • The SQL UNION ALLs existing data with the newly calculated data. This means the SQL expects the shape of preivously generated TSV files to include a column for each *future* category. For example, the versions all have their own column for 1.45, 1.46, etc. This makes backfills difficult, since the TSV files themselves require manual fixes to add the missing columns.

This is error prone, but I did it anyway manually for php.tsv, version_simple.tsv, and all the versions under folder php_drilldown, and ran the following to replace the production TSV files:

kerberos-run-command hdfs hdfs dfs -rm /wmf/data/published/datasets/periodic/reports/metrics/pingback/php.tsv
kerberos-run-command hdfs hdfs dfs -rm /wmf/data/published/datasets/periodic/reports/metrics/pingback/version_simple.tsv
kerberos-run-command hdfs hdfs dfs -rm -r /wmf/data/published/datasets/periodic/reports/metrics/pingback/php_drilldown

kerberos-run-command hdfs hdfs dfs -cp /user/xcollazo/artifacts/fix_pingback/php.tsv /wmf/data/published/datasets/periodic/reports/metrics/pingback/php.tsv
kerberos-run-command hdfs hdfs dfs -cp /user/xcollazo/artifacts/fix_pingback/version_simple.tsv /wmf/data/published/datasets/periodic/reports/metrics/pingback/version_simple.tsv
kerberos-run-command hdfs hdfs dfs -cp /user/xcollazo/artifacts/fix_pingback/php_drilldown /wmf/data/published/datasets/periodic/reports/metrics/pingback/php_drilldown


kerberos-run-command hdfs hdfs dfs -chown -R analytics:analytics-privatedata-users /wmf/data/published/datasets/periodic/reports/metrics/pingback/php.tsv
kerberos-run-command hdfs hdfs dfs -chown -R analytics:analytics-privatedata-users /wmf/data/published/datasets/periodic/reports/metrics/pingback/version_simple.tsv
kerberos-run-command hdfs hdfs dfs -chown -R analytics:analytics-privatedata-users /wmf/data/published/datasets/periodic/reports/metrics/pingback/php_drilldown

kerberos-run-command hdfs hdfs dfs -chmod -R 644 /wmf/data/published/datasets/periodic/reports/metrics/pingback/php.tsv
kerberos-run-command hdfs hdfs dfs -chmod -R 644 /wmf/data/published/datasets/periodic/reports/metrics/pingback/version_simple.tsv
kerberos-run-command hdfs hdfs dfs -chmod -R 644 /wmf/data/published/datasets/periodic/reports/metrics/pingback/php_drilldown

kerberos-run-command hdfs hdfs dfs -chmod 744 /wmf/data/published/datasets/periodic/reports/metrics/pingback/php_drilldown
  • Additionally, the calculation steps all want to write to central TSV files, creating race conditions when wanting to backfill multiple dates.

To go around this, we set max_active_runs=1 on the DAG.

These fixes are ok for now, but we should really fix this pipeline to be idempotent, so that we can rerun it as needed. A discussing around this can be found on Slack.

After fixes, the backfill is running well with:

airflow dags backfill --reset-dagruns --start-date 2025-05-01 --end-date 2026-01-20 pingback_report_weekly_v2

This created weekly backfill runs for dates 2025-05-01 to 2025-06-15, and reset regular scheduled runs from that date and on.

This will continue running for a while, will ping here when it is done.

xcollazo changed the task status from Open to In Progress.Wed, Jan 21, 8:56 PM
xcollazo claimed this task.
xcollazo triaged this task as Medium priority.

@xcollazo Thank you so much for your work on this! I appreciate it!

I'm not a huge fan of the current SQL queries and the need to update them for new versions. This process too is potentially error prone, and somebody needs to remember to do it, preferably before the new versions start appearing in the data. It would be great if somebody could create a more future proof process.

After fixes, the backfill is running well with:

airflow dags backfill --reset-dagruns --start-date 2025-05-01 --end-date 2026-01-20 pingback_report_weekly_v2

This created weekly backfill runs for dates 2025-05-01 to 2025-06-15, and reset regular scheduled runs from that date and on.

This will continue running for a while, will ping here when it is done.

Backfill complete, runs can be seen at https://airflow.wikimedia.org/dags/pingback_report_weekly_v2/grid?num_runs=100.

Data appears correct at /wmf/data/published/datasets/periodic/reports/metrics/pingback.

it looks like the rsync job already triggered to publish the data to https://analytics.wikimedia.org/published/datasets/periodic/reports/metrics/pingback/.

@xcollazo Thank you so much for your work on this! I appreciate it!

I'm not a huge fan of the current SQL queries and the need to update them for new versions. This process too is potentially error prone, and somebody needs to remember to do it, preferably before the new versions start appearing in the data. It would be great if somebody could create a more future proof process.

Agreed @cicalese. Would you mind opening a task to refactor this pipeline? As you know, it always helps when the original task is a direct ask from a community member. I can add context as well.

CC @amastilovic