Page MenuHomePhabricator

ProduceCanaryEvents job should be scheduled by Airflow and/or a k8s service
Closed, ResolvedPublic

Description

The ProduceCanaryEvents job is currently (2024-01) scheduled by a systemd timer on an-launcher1002. Hive ingestion and Airflow job dependencies rely on canary events to manifest Hive partitions for Kafka topics on which there is no data, e.g. in the inactive datacenter.

The important requirement for ProduceCanaryEvents is that at least one canary event is produced every hour for all stream's Kafka topics. The systemd timer currently is run several times an hour. There is no harm in producing more canary events.

As experienced in T337055: Send a critical alert to data-engineering if produce_canary_events isn't running correctly, it is possible for the systemd timer schedule job to fail for longer than an hour. Perhaps it is stuck for some reason, or perhaps the systemd timer was down do to maintenance.

There are two potential solutions to fix this problem:

Airflow scheduled ProduceCanaryEvents

Currently, ProduceCanaryEvents uses a discovery/scheduling mechanism similar, but simpler, than Refine jobs.

Both of these jobs need to discover the datasets for which they need to do work. For Refine, we need a way to discover work, detect failures, rerun, mark as done, etc. For ProduceCanaryEvents, we only need to discover work and do it. Detecting failures will be nice, but there aren't any direct dependent downstream jobs.

So wee'll need to solve similar dynamic work discovery for both of these jobs. Some ideas for how to do this for Refine are in T307505: Refine jobs should be scheduled by Airflow.

We should do this task before we work on Refine, as it will help us answer dynamic airflow job questions, but with much lower risk.

Hopefully, doing this will help us better maintain this job, and troubleshoot issues when they arise. E.g. T326002: [Event Platform] eventgate-wikimedia occasionally fails to produce events due to stream config fetch errors and T337055: Send a critical alert to data-engineering if produce_canary_events isn't running correctly.

Doing this in airflow will also let us 'backfill' canary events. If a producing canary events is missed for some stream topic for some hour, Airflow will mark a task as not run or failed, and eventually run it, producing the canary event with an appropriate event timestamp.

Backfills will be useful in many situations, but they will also possibly result in late events, which could cause some issues and confusion for downstream jobs.

Done is:

ProduceCanaryEvents k8s service(s)

This would be a long running k8s service or k8s cron job that periodically produces canary events.
(If a service, the service would just sleep in between periods.) A k8s service would allow us to run ProduceCanaryEvents in both main datacenters (in wikikube), increasing the availability of canary events. Backfilling would not be supported, as the service would only produce events for the current time.

Done is:

  • Deployment CI Pipeline produced docker image for ProduceCanaryEvents
  • helm chart + helmfile for ProduceCanaryEvents
  • ProduceCanaryEvents k8s service/cronjob running in both main datacenters.

Because it never hurts to produce extra canary events, these two solutions are not incompatible. We could, and perhaps should, do both.

Event Timeline

Oo, it would be really nice if we could modify the job logic a little bit, to be able to produce events with the time appropriate for the schedule task time. That way, we could backfill more easily.

This might be more complicated given the decision we just made in T267648#8995454.

Another idea: instead of migrating to airflow, make this a dedicated long lived service and run in k8s. We could add metrics about produced events.

Ottomata renamed this task from ProduceCanaryEvents job should be scheduled by Airflow to ProduceCanaryEvents job should be scheduled by Airflow and/or a k8s service.Jan 18 2024, 6:24 PM
Ottomata updated the task description. (Show Details)
Ottomata updated the task description. (Show Details)

Change 1011336 had a related patch set uploaded (by Joal; author: Joal):

[wikimedia-event-utilities@master] Update CanaryEventProducer

https://gerrit.wikimedia.org/r/1011336

Change 1011336 merged by jenkins-bot:

[wikimedia-event-utilities@master] Update CanaryEventProducer

https://gerrit.wikimedia.org/r/1011336

Change 1011354 had a related patch set uploaded (by Joal; author: Joal):

[analytics/refinery/source@master] Update ProduceCanaryEvents job

https://gerrit.wikimedia.org/r/1011354

Change 1011354 merged by jenkins-bot:

[analytics/refinery/source@master] Update ProduceCanaryEvents job

https://gerrit.wikimedia.org/r/1011354