Page MenuHomePhabricator

Write an Airflow sensor for scraper page summaries being fully imported to Hive
Closed, InvalidPublic

Description

Add a sensor to the WIP code in T412019: [Epic] Schedule scraper and aggregations as an Airflow job, which does the following:

  • Check that the scraper job finished successfully, eg. by a scrape >> sensor dependency in the Airflow job.
  • Calculate the hour in which the scraper finishes, and add 1 hour.
  • Wire this calculated hour from the scraper to the sensor using xcoms.
  • Look for an hourly partition on the event.mediawiki_wmde_page_summary table with the form datacenter=eqiad/year=2026/month=1/day=9/hour=23 matching this "hour after success"
  • Can use NamedHivePartitionSensor, partition_names_by_granularity with @hourly granularity
  • (At this point the page summary data is guaranteed to be complete in Hive.)
  • Expect a 3-hour lag between emitting the page summary events to Event Gate through Kafka and Gobblin, until finally persisting in Hive.
  • Nice to have: alert if the lag is much longer than 3 hours.

Event Timeline

We actually don't know yet if we can achieve what we need here. Let's stall it for now.

New approach: directly insert the page summaries, no sensor is required.