Send a critical alert to data-engineering if produce_canary_events isn't running correctly
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	BTullis
	May 19 2023, 5:07 PM

Description

To quote @Ottomata

if canary events aren’t produced each hour, bad things can happen

produce_canary_events is a systemd timer that is supposed to run every 15 minutes.

We receive email alerts if produce_canary_events generates errors and it will be picked up by the SystemdUnitFailed alertmanager check.
However, on two occasions we have seen produce_canary_events getting stuck and not returning correctly.

Ref:

In the most recent case, our canary was effectively in a coma for two weeks and we didn't notice.

We have deployed a change that should help resolve the problem by adding a 10s timeout value to each HTTP call.
However it would also be good to know that there is a specific alert to check on the health of the canary.

There are plans to migrate this job from a systemd timer to Airflow, so it's possible that this would be the preferred approach, rather than monitoring the systemd timer/service for staleness.

Related Objects
Search...

Status	Assigned	Task
Duplicate	None	T345698 [Epic] define a strategy around alerting for Data Platform SRE and implement it
Open	None	T346438 [Epic] Review alerting strategy for Data Platform SRE
Declined	None	T337055 Send a critical alert to data-engineering if produce_canary_events isn't running correctly

Event Timeline

BTullis created this task.May 19 2023, 5:07 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 19 2023, 5:07 PM

We have deployed a change

FWIW, this is not actually deployed. I ran into an issue with scala dependencies being missing. Will ping more folks about this now.

There are plans to migrate this job from a systemd timer to Airflow, so it's possible that this would be the preferred approach, rather than monitoring the systemd timer/service for staleness.

@mforns Indeed! And doing this for ProduceCanaryEvents will be easier than Refine, but it is similar to Refine in that it needs to dynamically discover the work to be done. Marcel and I had a prototype for this long ago.

Yes, the ProduceCanaryEvents Airflow DAG would have a similar structure than the Refine one.
The main difference IIUC is that the Refine DAG would need to take care of late data arrival,
whereas the ProduceCanaryEvents won't need that.
Late data arrival could be a difficult feature to implement.

So yea, working on the ProduceCanaryEvents DAG first makes sense no?

Would moving ProduceCanaryEvents to Airflow solve the scala dependency problems?

So yea, working on the ProduceCanaryEvents DAG first makes sense no?

Ya!

Would moving ProduceCanaryEvents to Airflow solve the scala dependency problems?

No, this is something in the refinery jar that I haven't looked into yet.

JArguello-WMF removed a project: Shared-Data-Infrastructure.Jun 29 2023, 1:44 PM

JArguello-WMF moved this task from Incoming (new tickets) to Event Platform Backlog on the Data-Engineering board.Jun 29 2023, 10:27 PM

JArguello-WMF removed a project: Data-Platform-SRE.Jun 29 2023, 10:56 PM

Ottomata mentioned this in T341229: ProduceCanaryEvents job should be scheduled by Airflow and/or a k8s service.Jul 6 2023, 1:51 PM

BTullis added a project: Data-Platform-SRE.Jul 15 2023, 12:03 AM

Gehel triaged this task as High priority.Oct 18 2023, 8:55 AM

Gehel added a parent task: T346438: [Epic] Review alerting strategy for Data Platform SRE.

Gehel moved this task from Incoming to Misc on the Data-Platform-SRE board.

bking subscribed.Oct 20 2023, 5:43 PM

Gehel moved this task from Misc to Observability on the Data-Platform-SRE board.Dec 6 2023, 1:21 PM

There was another repro of this situation on 2024-01-17.

TL;DR:
event.mediawiki_page_content_change_v1 and event.mediawiki_page_change_v1 were affected. For some reason the systemd unit stopped responding. @Ottomata killed and restarted it.

Suggestions were made to move the canary mechanism to either Airflow or k8s.

Details in slack thread.

• dcausse mentioned this in T356030: Search dag image_suggestions_weekly failed waiting for analytics_platform_eng.image_suggestions_search_index_delta/snapshot=2024-01-15.Jan 29 2024, 1:33 PM

Being bold and declining this as producing canary events are now scheduled airflow

Send a critical alert to data-engineering if produce_canary_events isn't running correctlyClosed, DeclinedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Send a critical alert to data-engineering if produce_canary_events isn't running correctly
Closed, DeclinedPublic
Actions

Related Objects
Search...