Currently, the [[ https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Hadoop_Event_Ingestion_Lifecycle#ProduceCanaryEvents | ProduceCanaryEvents ]] uses a discovery/scheduling mechanism similar, but simpler, than [[ https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Refine | Refine jobs ]].
Both of these jobs need to discover the datasets for which they need to do work. For Refine, we need a way to discover work, detect failures, rerun, mark as done, etc. For ProduceCanaryEvents, we only need to discover work and do it. Detecting failures will be nice, there shouldn't be any dependent downstream jobs.
So wee'll need to solve similar dynamic work discovery for both of these jobs. Some ideas for how to do this for Refine are in {T307505}.
We should do this task before we work on Refine, as it will help us answer dynamic airflow job questions, but with much lower risk.
Hopefully, doing this will help us better maintain this job, and troubleshoot issues when they arise. E.g. {T326002}
Done is:
[] Airflow can schedule ProduceCanaryEvents at least once an hour, ideally multiple times
[] Airflow UI schedules runs per dataset, so we can visually inspect and alert on ProduceCanaryEvent failures for a specific stream
[] {T337055}