As part of GA internal release for T307959, we should set up (hopefully automated?) alerts for deployed Flink based enrichment jobs. Specifically, we need alerts now that will let us know if we are not meeting the SLOs for this job.
Since the metrics for these kinds of jobs should be standardized, it would be nice if we could define these alerts in an automated/parameterized way, or at least with a documented process to repeat the process when a new enrichment job is created and deployed.
Alerts might include things like
- Input / output throughput ratio (should match some %)?
- error event rate
- lag / backpressure
There are also Flink level operation alerts as well. These will likely be applicable to all flink apps in k8s that use the flink-app helm chart (and the flink-kubernetes-operator).
- # active TaskManagers and JobManagers
- Job state
- failed checkpoints
- checkpoint rate
- failovers?
- watermark lag?
The above are just guesses at the types of things Flink app maintainers would like to be alerted on. We probably don't need all of these before we do GA release of T307959, but important ones (like is the app running, error rate, etc.), we should do.