Today, we noticed that the flink-app-consumer-cloudelastic taskmanagers were down for about 16 hours. The flink jobmanager marked the job as FAILED, but the flink kubernetes operator did not try to start a new job.
Creating this ticket to stabilize the service. Possible solutions include (but are not limited to):
- Set the kubernetes.operator.job.restart flink-operator helm chart configuration value, as described in the Flink docs
- Allocate more RAM to consumer-cloudelastic taskManagers