Page MenuHomePhabricator

Stabilize "consumer-cloudelastic" Search Update Pipeline job
Closed, ResolvedPublic

Description

Today, we noticed that the flink-app-consumer-cloudelastic taskmanagers were down for about 16 hours. The flink jobmanager marked the job as FAILED, but the flink kubernetes operator did not try to start a new job.

Creating this ticket to stabilize the service. Possible solutions include (but are not limited to):

  • Set the kubernetes.operator.job.restart flink-operator helm chart configuration value, as described in the Flink docs
  • Allocate more RAM to consumer-cloudelastic taskManagers

Event Timeline

Change #1017115 had a related patch set uploaded (by Ryan Kemper; author: Bking):

[operations/deployment-charts@master] flink-kubernetes-operator: restart failed jobs

https://gerrit.wikimedia.org/r/1017115

bking changed the task status from Open to In Progress.Apr 4 2024, 8:19 PM
bking claimed this task.
bking triaged this task as Medium priority.
bking updated Other Assignee, added: RKemper.
bking renamed this task from Enable flink-operator's "restart jobs" feature to Stabilize "consumer-cloudelastic" Search Update Pipeline job.Apr 5 2024, 2:29 PM
bking updated the task description. (Show Details)

Change #1017296 had a related patch set uploaded (by Bking; author: Bking):

[operations/deployment-charts@master] cirrus-streaming-updater: Increase taskManager memory for cloudelastic job

https://gerrit.wikimedia.org/r/1017296

Change #1017296 merged by jenkins-bot:

[operations/deployment-charts@master] cirrus-streaming-updater: Increase taskManager memory for cloudelastic job

https://gerrit.wikimedia.org/r/1017296

The consumer seems generally stable. It involved changes to both the application for better error handling, and an increase in the taskmanager memory above. The pods had been running for a week uninterrupted until we brought them down yesterday to verify some new alerting.

Change #1017115 abandoned by Bking:

[operations/deployment-charts@master] flink-kubernetes-operator: restart failed jobs

Reason:

Not needed to fix problems detailed in ticket; we can always make a new CR if needed in the future

https://gerrit.wikimedia.org/r/1017115

Per @EBernhardson comment above, I'm doing to resolve the ticket. @dcausse or anyone else, if you think we need to look into this some more, feel free to re-open.