Summary
The MediaWiki History monthly DAG (mediawiki_history_denormalize) averages ~56h end-to-end wall-clock, of which ~46h (~80%) is sqoop wait, not Spark compute. The tail in every recent month is three tables from the mediawiki_private database — actor, comment, and centralauth_localuser — which consistently land 1–2 days after the public sqoop tables. Spark can't start the monthly rebuild until all sensors are green, so these three tables alone gate the entire job.
Pulling these private tables onto (or close to) the public-table timeline would save roughly a full day of DAG wall-clock for near-zero engineering cost on the history-pipeline side. This is the single highest-leverage freshness win available short of the full daily-delta work.
Findings from the last 5 monthly runs
Data pulled from Airflow URLSensor completion times in mediawiki_history_denormalize (Airflow link):
| Snapshot | Spark denormalize_history | DAG wall-clock | Sqoop wait | Notes |
|---|---|---|---|---|
| 2026-03 | 11h 04m | 44h 13m | ~33h | |
| 2026-02 | 14h 23m | 52h 53m | ~38h | try_number=4 (3 failed Spark retries) |
| 2026-01 | 9h 12m | 41h 24m | ~32h | |
| 2025-12 | 9h 23m | 66h 10m | ~57h | |
| 2025-11 | 8h 58m | 78h 46m | ~70h | SLA miss |
| avg | ~10.6h | ~56.6h | ~46h | |
In every one of these five runs, the last three URLSensors to complete were actor, comment, and centralauth_localuser. Public-database tables (revision, archive, page, logging, user, user_groups, change_tag, etc.) typically land ~24–48h earlier.
Why this matters
- Headline number: ~46h of a ~56h monthly wall-clock is sqoop wait. Even halving the Spark side (Option C in the umbrella plan) would only save ~5h — moving the private-table timeline is worth ~5× that.
- Cadence prerequisite: as long as a monthly rebuild is 3–7 days late every month, any weekly cadence for the monthly-style full rebuild is infeasible — regardless of what we do in Spark.
- Retry slack: 2026-02 used try 4, indicating 3 failed Spark attempts. Reducing sqoop-side jitter also reduces downstream retry slack; if the pipeline has more breathing room before the SLA, retries cost less wall-clock.
Proposed work
- Root-cause why actor, comment, centralauth_localuser land later than the public tables.
- Move these three tables onto the earlier timeline. The specific change depends on the root cause.
- Verify savings over two monthly runs before declaring the work done. Target: monthly DAG wall-clock drops by ~24h consistently.