Page MenuHomePhabricator

Write an Airflow sensor to detect Enterprise Snapshots
Open, Stalled, Needs TriagePublic

Description

Like in T414066: Download enterprise structured content snapshots in hdfs but with Enterprise Snapshot of Parsoid HTML, and without importing the full data: we will stream input content through Airflow tasks without persisting to HDFS.

Suggested implementation:

  • Write a small Elixir script, for example packaged as a mix task like lib/mix/tasks/scrape.ex)
  • Credentials are wired through the environment, same as for the mix scrape BashOperator.
  • Following lib/enterprise_dumps.ex, call Enterprise.list_available_snapshots
  • Check that dewiki is available for the current month, the item we're looking for will roughly match:
%{
  "chunks" => [...],
  "date_modified" => "2026-01-02T01:20:08.408829866Z",
  "identifier" => "dewiki_namespace_0",
  "in_language" => %{"identifier" => "de"},
  "is_part_of" => %{"identifier" => "dewiki"},
  "namespace" => %{"identifier" => 0},
  "size" => %{"unit_text" => "MB", "value" => 0.123},
  "version" => "0d03dc5d5246bac205bdc5416ca565cd"
}
  • Call this script from a BashSensor in the Airflow job.

Running the script under Airflow will be blocked on T414804: Package scraper binary for Airflow job but as a Mix task it can be developed in the local elixir environment and called like mix is-snapshot-available dewiki 2026-01.

Code for review:

Event Timeline

awight updated the task description. (Show Details)
awight changed the task status from Open to Stalled.Fri, Jan 30, 10:48 AM

Last steps are blocked on the BashOperator image lacking libssl.