Like in {T414066} but with Enterprise Snapshot of Parsoid HTML, and without importing the full data: we will stream input content through Airflow tasks without persisting to HDFS.
Suggested implementation:
- Write a small Elixir script, for example packaged as a mix task like [[ https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump/-/blob/main/lib/mix/tasks/scrape.ex | lib/mix/tasks/scrape.ex ]])
- Credentials are wired through the environment, same as for the `mix scrape` BashOperator.
- Following [[ https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump/-/blob/main/lib/enterprise_dumps.ex | lib/enterprise_dumps.ex ]], call `Enterprise.list_available_snapshots`
- Check that dewiki is available for the current month, the item we're looking for will roughly match:
- Call this script from a BashOperator in the Airflow job.
```
%{
"chunks" => [...],
"date_modified" => "2026-01-02T01:20:08.408829866Z",
"identifier" => "dewiki_namespace_0",
"in_language" => %{"identifier" => "de"},
"is_part_of" => %{"identifier" => "dewiki"},
"namespace" => %{"identifier" => 0},
"size" => %{"unit_text" => "MB", "value" => 0.123},
"version" => "0d03dc5d5246bac205bdc5416ca565cd"
}
```
Running the script under Airflow will be blocked on {T414804} but as a Mix task it can be developed in the local elixir environment and called like `mix is-snapshot-available dewiki 2026-01`.