To quote @Ottomata
if canary events aren’t produced each hour, bad things can happen
produce_canary_events is a systemd timer that is supposed to run every 15 minutes.
We receive email alerts if produce_canary_events generates errors and it will be picked up by the SystemdUnitFailed alertmanager check.
However, on two occasions we have seen produce_canary_events getting stuck and not returning correctly.
Ref:
- T330236: Event partitions missing since 2023-02-21T10:00 for stream without events (canary events not produced?)
- conversation in #data-engineering on Slack.
In the most recent case, our canary was effectively in a coma for two weeks and we didn't notice.
We have deployed a change that should help resolve the problem by adding a 10s timeout value to each HTTP call.
However it would also be good to know that there is a specific alert to check on the health of the canary.
There are plans to migrate this job from a systemd timer to Airflow, so it's possible that this would be the preferred approach, rather than monitoring the systemd timer/service for staleness.