We'd like to have automated monitoring and alerts configured for flink-app deployments.
- Adapt and augment the existent Flink Cluster grafana dashboard for a shared flink app, pyflink, enrichment dashboard usable by Search and Event Platform. This should include metrics about latency, lag, throughput, memory usage, etc.
- If needed, add missing metrics to enrichment flink apps
Another task will be about defining alerts (aliveness, latency, lag, throughput, etc.) for mediawiki-page-content-change-enrichment job. If we can do this more generically for any enrichment job, we should, but TBD how easy that is.