After the automated traffic detection issues discovered in T395934 and T395727,
and the detection heuristic fixes and Airflow improvements described in T395934 and T402645,
we need to rerun 34 Airflow DAGs (main instance) to backfill the affected datasets from March 21st to August 31st,
so that we correctly tag automated traffic, and eliminate the artifacts that have polluted our pageview metrics and other derived metrics.
Here's a list of the affected DAGs and their backfilling progress:
[ ] TO DO
[>] IN PROGRESS
[X] BACKFILLED
[!] FIX PENDING
# hourly DAGs (backfill speed: 1 month worth of data in ~3.5 days)
Mar Apr Mai Jun Jul Aug
[X] [X] [X] [X] [X] [X] webrequest_actor_metrics_hourly
[X] [X] [X] [X] [X] [X] webrequest_actor_metrics_rollup_hourly
[X] [X] [X] [X] [X] [X] webrequest_actor_label_hourly
[X] [X] [X] [X] [X] [X] pageview_actor_hourly
[X] [X] [X] [X] [X] [X] pageview_hourly
[X] [X] [X] [X] [X] [X] projectview_hourly
[X] [X] [X] [X] [X] [X] projectview_geo
[X] [X] [X] [X] [X] [X] cassandra_load_pageview_per_project_hourly
# daily DAGs (backfill speed: 1 month worth of data in ~1 day)
Mar Apr Mai Jun Jul Aug
[X] [X] [X] [X] [X] [X] browser_general_daily
[X] [X] [X] [X] [X] [X] unique_devices_per_domain_daily
[X] [X] [X] [X] [X] [X] unique_devices_per_project_family_daily
[X] [X] [X] [X] [X] [X] cassandra_load_pageview_per_article_daily
[X] [X] [X] [X] [X] [X] cassandra_load_pageview_per_project_daily
[X] [X] [X] [X] [X] [X] cassandra_load_pageview_top_articles_daily
[X] [X] [X] [X] [X] [X] cassandra_load_pageview_top_per_country_daily
[X] [X] [X] [X] [X] [X] cassandra_load_unique_devices_daily
[X] [X] [X] [X] [X] [X] interlanguage_daily
[X] [X] [X] [X] [X] [X] dump_day_of_hourly_pageviews
[X] [X] [X] [X] [X] [X] referrer_daily
[X] [X] druid_load_pageviews_hourly_aggregated_daily (we only keep last 3 months of data)
# weekly DAGs (backfill speed: 1 month worth of data in ~3 hours)
Mar Apr Mai Jun Jul Aug
[X] [X] [X] [X] [X] [X] browser_metrics_weekly
# monthly DAGs (backfill speed: 1 month worth of data in ~1 hours)
Mar Apr Mai Jun Jul Aug
[X] [X] [X] [X] [X] [X] cassandra_load_pageview_per_project_monthly
[X] [X] [X] [X] [X] [X] cassandra_load_pageview_top_articles_monthly
[X] [X] [X] [X] [X] [X] cassandra_load_pageview_top_by_country_monthly
[X] [X] [X] [X] cassandra_load_unique_devices_monthly
[X] [X] [X] [X] [X] clickstream_monthly
[X] [X] [X] [X] unique_devices_per_domain_monthly
[X] [X] [X] [X] unique_devices_per_project_family_monthly
[X] [X] [X] [X] [X] [X] druid_load_pageviews_daily_aggregated_monthly
[X] [X] [X] [X] [X] [X] druid_load_unique_devices_per_domain_daily_aggregated_monthly
[X] [X] [X] [X] druid_load_unique_devices_per_domain_monthly
[X] [X] [X] [X] [X] [X] druid_load_unique_devices_per_project_family_daily_aggregated_monthly
[X] [X] [X] [X] druid_load_unique_devices_per_project_family_monthly
[X] [X] [X] [X] [X] [X] dump_month_of_daily_pageviews