We have some doubts about how the logs and metrics produced by our Flink pipelines are reaching Grafana and OpenSearch.
We need to ensure that both the logs produced by Flink itself and by our Python code appear in Opensearch.
We also need to ensure that all metrics are shown in Grafana, where we have a dashboard for Flink, but there are other dashboards that show interesting metrics related to our pipelines, like the number of calls the HTTP APIs.
Task is done if:
- All logs appear in Opensearch and we have documented it.
- We have documented where the metrics appear in Grafana.
Ideally, we could improve this to:
- Ensure we use a common format on logs, like JSON, and the fields produced are parsed by logstach/opensearch
- Ensure there is an Index Pattern that can understand our logs and extract valuable variables.
- We build an Opensearch dashboard based on the HTML (or general Flink?) pipeline, with important counts and grouped metrics, like:
- How many messages were rejected/failed by reason.
- How many messages were rejected/failed by "change_type_kind"
- ... TBD
- We build a Grafana dashboard that shows all the important metrics for our HTML (or all Flink?) pipelines together.
- General Flink status information
- Number of calls to the HTTP APIs, with codes returned
- Latency calling the HTTP APIs
- Avg latency of messages (from source to sink)
- Avg latency split by "change_type_kind" (to know if there are issues calling the APIs for specific change types, maybe not hitting cache?)
- Avg latency split by "Flink process step" (to know if a process is slower than others)
Maybe some of this "ideas" can be split into other tickets, some of them might require more custom logs or metrics to be produced.