Page MenuHomePhabricator

Send a critical alert to data-engineering if produce_canary_events isn't running correctly
Closed, DeclinedPublic

Description

To quote @Ottomata

if canary events aren’t produced each hour, bad things can happen

produce_canary_events is a systemd timer that is supposed to run every 15 minutes.

We receive email alerts if produce_canary_events generates errors and it will be picked up by the SystemdUnitFailed alertmanager check.
However, on two occasions we have seen produce_canary_events getting stuck and not returning correctly.

Ref:

In the most recent case, our canary was effectively in a coma for two weeks and we didn't notice.

We have deployed a change that should help resolve the problem by adding a 10s timeout value to each HTTP call.
However it would also be good to know that there is a specific alert to check on the health of the canary.

There are plans to migrate this job from a systemd timer to Airflow, so it's possible that this would be the preferred approach, rather than monitoring the systemd timer/service for staleness.

Event Timeline

We have deployed a change

FWIW, this is not actually deployed. I ran into an issue with scala dependencies being missing. Will ping more folks about this now.

There are plans to migrate this job from a systemd timer to Airflow, so it's possible that this would be the preferred approach, rather than monitoring the systemd timer/service for staleness.

@mforns Indeed! And doing this for ProduceCanaryEvents will be easier than Refine, but it is similar to Refine in that it needs to dynamically discover the work to be done. Marcel and I had a prototype for this long ago.

Yes, the ProduceCanaryEvents Airflow DAG would have a similar structure than the Refine one.
The main difference IIUC is that the Refine DAG would need to take care of late data arrival,
whereas the ProduceCanaryEvents won't need that.
Late data arrival could be a difficult feature to implement.

So yea, working on the ProduceCanaryEvents DAG first makes sense no?

Would moving ProduceCanaryEvents to Airflow solve the scala dependency problems?

So yea, working on the ProduceCanaryEvents DAG first makes sense no?

Ya!

Would moving ProduceCanaryEvents to Airflow solve the scala dependency problems?

No, this is something in the refinery jar that I haven't looked into yet.

There was another repro of this situation on 2024-01-17.

TL;DR:
event.mediawiki_page_content_change_v1 and event.mediawiki_page_change_v1 were affected. For some reason the systemd unit stopped responding. @Ottomata killed and restarted it.

Suggestions were made to move the canary mechanism to either Airflow or k8s.

Details in slack thread.

Being bold and declining this as producing canary events are now scheduled airflow