The compute_metrics task of the mw_content_reconcile_mw_content_history_daily DAG takes, on average, 5.4 hours to complete:
Additionally, we have have multiple incidents in which the Spark driver OOMs (see T400830, T387033).
In this task, we should:
- Investigate why the metrics take so long, and what optimizations we could do to run them faster.
- Investigate why the driver continues to need more memory, and what optimizations we could do use less mem. This is a priority.
- Presumably any benefits we find for the daily run we can apply to the monthly run as well?
