Like in T414066: Download enterprise structured content snapshots in hdfs but with Enterprise Snapshot of Parsoid HTML, and without importing the full data: we will stream input content through Airflow tasks without persisting to HDFS.
Suggested implementation:
- Write a small Elixir script, for example packaged as a mix task like lib/mix/tasks/scrape.ex)
- Credentials are wired through the environment, same as for the mix scrape BashOperator.
- Following lib/enterprise_dumps.ex, call Enterprise.list_available_snapshots
- Check that dewiki is available for the current month, the item we're looking for will roughly match:
%{
"chunks" => [...],
"date_modified" => "2026-01-02T01:20:08.408829866Z",
"identifier" => "dewiki_namespace_0",
"in_language" => %{"identifier" => "de"},
"is_part_of" => %{"identifier" => "dewiki"},
"namespace" => %{"identifier" => 0},
"size" => %{"unit_text" => "MB", "value" => 0.123},
"version" => "0d03dc5d5246bac205bdc5416ca565cd"
}- Call this script from a BashSensor in the Airflow job.
Running the script under Airflow will be blocked on T414804: Package scraper binary for Airflow job but as a Mix task it can be developed in the local elixir environment and called like mix is-snapshot-available dewiki 2026-01.
Code for review: