Page MenuHomePhabricator

Backfill datasets affected by automated traffic detection issues
Closed, ResolvedPublic

Description

After the automated traffic detection issues discovered in T395934 and T395727,
and the detection heuristic fixes and Airflow improvements described in T395934 and T402645,
we need to rerun 34 Airflow DAGs (main instance) to backfill the affected datasets from March 21st to August 31st,
so that we correctly tag automated traffic, and eliminate the artifacts that have polluted our pageview metrics and other derived metrics.

Here's a list of the affected DAGs and their backfilling progress:

[ ] TO DO
[>] IN PROGRESS
[X] BACKFILLED
[!] FIX PENDING

# hourly DAGs (backfill speed: 1 month worth of data in ~3.5 days)
Mar Apr Mai Jun Jul Aug
[X] [X] [X] [X] [X] [X]  webrequest_actor_metrics_hourly
[X] [X] [X] [X] [X] [X]  webrequest_actor_metrics_rollup_hourly
[X] [X] [X] [X] [X] [X]  webrequest_actor_label_hourly
[X] [X] [X] [X] [X] [X]  pageview_actor_hourly
[X] [X] [X] [X] [X] [X]  pageview_hourly
[X] [X] [X] [X] [X] [X]  projectview_hourly
[X] [X] [X] [X] [X] [X]  projectview_geo
[X] [X] [X] [X] [X] [X]  cassandra_load_pageview_per_project_hourly

# daily DAGs (backfill speed: 1 month worth of data in ~1 day)
Mar Apr Mai Jun Jul Aug
[X] [X] [X] [X] [X] [X]  browser_general_daily
[X] [X] [X] [X] [X] [X]  unique_devices_per_domain_daily
[X] [X] [X] [X] [X] [X]  unique_devices_per_project_family_daily
[X] [X] [X] [X] [X] [X]  cassandra_load_pageview_per_article_daily
[X] [X] [X] [X] [X] [X]  cassandra_load_pageview_per_project_daily
[X] [X] [X] [X] [X] [X]  cassandra_load_pageview_top_articles_daily
[X] [X] [X] [X] [X] [X]  cassandra_load_pageview_top_per_country_daily
[X] [X] [X] [X] [X] [X]  cassandra_load_unique_devices_daily
[X] [X] [X] [X] [X] [X]  interlanguage_daily
[X] [X] [X] [X] [X] [X]  dump_day_of_hourly_pageviews
[X] [X] [X] [X] [X] [X]  referrer_daily
                [X] [X]  druid_load_pageviews_hourly_aggregated_daily (we only keep last 3 months of data)

# weekly DAGs (backfill speed: 1 month worth of data in ~3 hours)
Mar Apr Mai Jun Jul Aug
[X] [X] [X] [X] [X] [X]  browser_metrics_weekly

# monthly DAGs (backfill speed: 1 month worth of data in ~1 hours)
Mar Apr Mai Jun Jul Aug
[X] [X] [X] [X] [X] [X]  cassandra_load_pageview_per_project_monthly
[X] [X] [X] [X] [X] [X]  cassandra_load_pageview_top_articles_monthly
[X] [X] [X] [X] [X] [X]  cassandra_load_pageview_top_by_country_monthly
        [X] [X] [X] [X]  cassandra_load_unique_devices_monthly
    [X] [X] [X] [X] [X]  clickstream_monthly
        [X] [X] [X] [X]  unique_devices_per_domain_monthly
        [X] [X] [X] [X]  unique_devices_per_project_family_monthly
[X] [X] [X] [X] [X] [X]  druid_load_pageviews_daily_aggregated_monthly
[X] [X] [X] [X] [X] [X]  druid_load_unique_devices_per_domain_daily_aggregated_monthly
        [X] [X] [X] [X]  druid_load_unique_devices_per_domain_monthly
[X] [X] [X] [X] [X] [X]  druid_load_unique_devices_per_project_family_daily_aggregated_monthly
        [X] [X] [X] [X]  druid_load_unique_devices_per_project_family_monthly
[X] [X] [X] [X] [X] [X]  dump_month_of_daily_pageviews

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

FYI, assuming that the May backfill finished at the end of yesterday and using @mforns's estimate that a month takes about 3.5 days to backfill, I calculate that the hourly backfill will probably finish by the end of Monday, October 6.

@mforns please pause the Druid backfilling for the moment.

mforns updated the task description. (Show Details)

Backfilling plan updated

  • Pause all loading to Cassandra and do not start any loading to Druid until further notice.
  • Hourly reruns: Continue reruns of all hourly DAGs except cassandra_load_pageview_per_project_hourly. They should finish between the end of Oct 6th and early Oct 7th.
  • Daily reruns: Continue reruns of all daily DAGs except cassandra or druid loading DAGs. The backfilling for July should be triggered after hourly reruns are done with July estimated trigger time: Friday Oct 3rd, end of day. The backfilling for August should be triggered after hourly reruns are ALMOST done with August (i.e. when processing Aug 20th) estimated trigger time: Monday Oct 6th, start of day. They should finish between the end of Oct 6th and early Oct 7th.
  • Weekly reruns: Do not start browser_metrics_weekly, since it populates a public dashboard until further notice.
  • Monthly reruns: Continue reruns of all monthly DAGs except cassandra or druid loading DAGs. The backfilling for March, April, May and June, can be triggered starting today Oct 2nd, manually, as fast as the cluster allows. The backfilling for July should be triggered after after hourly reruns are done with July estimated trigger time: Friday Oct 3rd, end of day. The backfilling for August should be triggered after hourly reruns are ALMOST done with August (i.e. when processing Aug 20th) estimated trigger time: Monday Oct 6th, start of day. They should finish between the end of Oct 6th and early Oct 7th.

Paused reruns: when we receive confirmation that we can move on with Cassandra and Druid loading

  • Hourly reruns: Trigger reruns for cassandra_load_pageview_per_project_hourly from 2025-07-23T00(inclusive) to 2025-08-28T14(exclusive) They should take about 30 hours to finish.
  • Daily reruns: Trigger reruns for daily cassandra and druid loading jobs for July and August. They should take about 70 hours.
  • Weekly reruns: Trigger reruns for browser_metrics_weekly. They should take about 1 hour.
  • Monthly reruns: Trigger reruns for monthly cassandra and druid loading jobs for March till August. druid_load_pageviews_daily_aggregated_monthly should take about 10 hours. The rest of DAGs should be run together via script and will take about 2 hours combined.

For the record, all backfills should be resumed.

mforns updated the task description. (Show Details)

We've had an issue, where the April partition for monthly datasets:

  • unique_devices_per_domain_monthly
  • unique_devices_per_project_family_monthly
  • clickstream_monthly

has been computed with incomplete source data and thus generated an incomplete April partition.

This happened because, by the time we launched the backfilling of unique devices monthly jobs for April, the 180 day retention limit had already deleted the first couple days of April's wmf.pageview_actor data. This made the unique devices calculation for April incomplete. The other months are fine, since they have complete source data.

Interesting fact: Since we changed the Airflow sensors to probe for the state of the parent DAG instead of the presence of the source data (to allow for more robust reruns), the sensors triggered successfully even if part of the data wasn't there anymore, allowing the incomplete calculation.

Data recovery:
I think we can recover the unique devices monthly data for April, by copying it over from the former Hive tables (only the newer Iceberg unique devices tables have been backfilled).
In the case of clickstream, I don't think we can recover April snapshot. However, clickstream is snapshot based, and the newer snapshots supersede the incomplete one.

mforns updated the task description. (Show Details)
mforns updated the task description. (Show Details)

In a meeting with @mforns, we identified that we had to 'go back in time' for two Iceberg tables to reinstate the data as it was at that date.

We concluded that we need to run the following commands as the analytics service user:

CALL catalog_name.system.rollback_to_snapshot('wmf_readership.unique_devices_per_domain_monthly', 2325163397903337906)

CALL catalog_name.system.rollback_to_snapshot('wmf_readership.unique_devices_per_project_family_monthly', 5614501084050913893)

Question on this: could we, instead of rolling back and possibly have to rerun airflow jobs, could we run a query against the said snapshots and insert that data into a new snapshot, overwriting wrong data?

We executed the snapshot revert commands with success:

spark-sql (default)> CALL spark_catalog.system.rollback_to_snapshot('wmf_readership.unique_devices_per_domain_monthly', 2325163397903337906);
25/10/08 18:50:56 WARN BaseTransaction: Failed to load metadata for a committed snapshot, skipping clean-up
previous_snapshot_id	current_snapshot_id
1441360873772091690	2325163397903337906
Time taken: 2.376 seconds, Fetched 1 row(s)

spark-sql (default)> CALL spark_catalog.system.rollback_to_snapshot('wmf_readership.unique_devices_per_project_family_monthly', 5614501084050913893);
25/10/08 18:51:23 WARN BaseTransaction: Failed to load metadata for a committed snapshot, skipping clean-up
previous_snapshot_id	current_snapshot_id
4053015571303478409	5614501084050913893
Time taken: 0.344 seconds, Fetched 1 row(s)

Question on this: could we, instead of rolling back and possibly have to rerun airflow jobs, could we run a query against the said snapshots and insert that data into a new snapshot, overwriting wrong data?

@JAllemandou great idea, unfortunately, the combination of Spark 3.1.2 and Iceberg 1.2.1 doesn't let us use the AS OF VERSION clause: https://web.archive.org/web/20230611001050/https://iceberg.apache.org/docs/1.2.1/spark-queries/#time-travel. We'd need Spark 3.3 in production.

But @mforns had another idea that he's trying right now combining data from the old snapshot and moving it forward. I'll let him fill in with the details.

Just kidding!, the snapshots with the good old data are:

CALL spark_catalog.system.rollback_to_snapshot('wmf_readership.unique_devices_per_project_family_monthly', 7168788790649123499)
CALL spark_catalog.system.rollback_to_snapshot('wmf_readership.unique_devices_per_domain_monthly', 9188215025360366880)

OK, I think this time it worked.

  • We rolled back to the Iceberg snapshot previous to the corrupting of the data by the backfill process.
  • Then we copied April (still not backfilled) into a temp table in the wmf_staging database.
  • Then we rolled the tables forward to its present state
  • And finally deleted April data and copied the old temporary uncorrupted data into the April gap.

The data looks good now. Both unique devices monthly tables have correct data for all months.
Up to end of April the data was calculated using the old heuristics;
whereas starting May onward the data was calculated using the new heuristics.