Page MenuHomePhabricator

Optimize canary event generation resources consumption on Airflow
Open, Needs TriagePublic

Description

The current implementation of canary events puts unnecessary pressure on Airflow main.

  • Canary events are launched as individual Skein jobs, doubling the number of container spawning (k8s+yarn).
  • It runs twice an hour

Quick wins:

  • Now that we don't rely on Gobblin to set the end of the partition (we use data in the next hour) a single canary event per hour is sufficient (instead of 2).
  • We can translate the work done from analytics_test/dags/canary_events/canary_events_kubernetes_staging_dag.py to run the generation within the k8s pod (no Skein)

If not enough we can consider batching all canary events generation into a single job.

  • Expected benefits: a single task per hour => fewer pressure on Airflow
  • Known cons:
    • Harder to debug individual canary event failures.
    • Need partial failure handling
    • can't sense on task execution

Event Timeline

Change #1218232 had a related patch set uploaded (by Aqu; author: Aqu):

[operations/deployment-charts@master] Allow tests of canary events generation from airflow-dev

https://gerrit.wikimedia.org/r/1218232

Change #1218232 merged by jenkins-bot:

[operations/deployment-charts@master] Allow tests of canary events generation from airflow-dev

https://gerrit.wikimedia.org/r/1218232

Change #1228599 had a related patch set uploaded (by Aqu; author: Aqu):

[operations/deployment-charts@master] Allow connections to eventgates from Airflow

https://gerrit.wikimedia.org/r/1228599

Change #1229072 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] global_config: add external-services for all eventgate LVS endpoints

https://gerrit.wikimedia.org/r/1229072

Change #1229072 abandoned by Brouberol:

[operations/puppet@production] global_config: add external-services for all eventgate LVS endpoints

Reason:

We're going to be relying on envoy proxying instead of direct access to these discovery endpoints from kubernetes

https://gerrit.wikimedia.org/r/1229072

Change #1229517 had a related patch set uploaded (by Joal; author: Joal):

[operations/puppet@production] Update services_proxy/envoy.yaml for eventgate

https://gerrit.wikimedia.org/r/1229517

Change #1229524 had a related patch set uploaded (by Joal; author: Joal):

[operations/deployment-charts@master] Update dse-k8s-eqiad airflow values

https://gerrit.wikimedia.org/r/1229524

Change #1229517 merged by Brouberol:

[operations/puppet@production] Update services_proxy/envoy.yaml for eventgate

https://gerrit.wikimedia.org/r/1229517