The current implementation of canary events puts unnecessary pressure on Airflow main.
- Canary events are launched as individual Skein jobs, doubling the number of container spawning (k8s+yarn).
- It runs twice an hour
Quick wins:
- Now that we don't rely on Gobblin to set the end of the partition (we use data in the next hour) a single canary event per hour is sufficient (instead of 2).
- We can translate the work done from analytics_test/dags/canary_events/canary_events_kubernetes_staging_dag.py to run the generation within the k8s pod (no Skein)
If not enough we can consider batching all canary events generation into a single job.
- Expected benefits: a single task per hour => fewer pressure on Airflow
- Known cons:
- Harder to debug individual canary event failures.
- Need partial failure handling
- can't sense on task execution