Stabilize "consumer-cloudelastic" Search Update Pipeline job
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	bking
	Apr 4 2024, 7:10 PM

Description

Today, we noticed that the flink-app-consumer-cloudelastic taskmanagers were down for about 16 hours. The flink jobmanager marked the job as FAILED, but the flink kubernetes operator did not try to start a new job.

Creating this ticket to stabilize the service. Possible solutions include (but are not limited to):

Set the kubernetes.operator.job.restart flink-operator helm chart configuration value, as described in the Flink docs
Allocate more RAM to consumer-cloudelastic taskManagers

Details

Other Assignee: RKemper

	Subject	Repo	Branch	Lines +/-
	flink-kubernetes-operator: restart failed jobs	operations/deployment-charts	master	+1 -1
	cirrus-streaming-updater: Increase taskManager memory for cloudelastic job	operations/deployment-charts	master	+3 -2

Customize query in gerrit

Event Timeline

bking created this task.Apr 4 2024, 7:10 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 4 2024, 7:10 PM

Change #1017115 had a related patch set uploaded (by Ryan Kemper; author: Bking):

[operations/deployment-charts@master] flink-kubernetes-operator: restart failed jobs

https://gerrit.wikimedia.org/r/1017115

gerritbot added a project: Patch-For-Review.Apr 4 2024, 7:11 PM

bking changed the task status from Open to In Progress.Apr 4 2024, 8:19 PM

bking claimed this task.

bking triaged this task as Medium priority.

bking moved this task from Backlog to In Progress on the Data-Platform-SRE (2024.03.25 - 2024.04.14) board.

bking updated Other Assignee, added: RKemper.

bking renamed this task from Enable flink-operator's "restart jobs" feature to Stabilize "consumer-cloudelastic" Search Update Pipeline job.Apr 5 2024, 2:29 PM

bking updated the task description. (Show Details)

Change #1017296 had a related patch set uploaded (by Bking; author: Bking):

[operations/deployment-charts@master] cirrus-streaming-updater: Increase taskManager memory for cloudelastic job

https://gerrit.wikimedia.org/r/1017296

Change #1017296 merged by jenkins-bot:

[operations/deployment-charts@master] cirrus-streaming-updater: Increase taskManager memory for cloudelastic job

https://gerrit.wikimedia.org/r/1017296

Gehel edited projects, added Data-Platform-SRE (2024.04.15 - 2024.05.05); removed Data-Platform-SRE (2024.03.25 - 2024.04.14).Apr 15 2024, 12:39 PM

Gehel moved this task from Backlog to In Progress on the Data-Platform-SRE (2024.04.15 - 2024.05.05) board.

Gehel moved this task from needs triage to Current work on the Discovery-Search board.Apr 15 2024, 3:20 PM

Gehel edited projects, added Discovery-Search (Current work); removed Discovery-Search.

The consumer seems generally stable. It involved changes to both the application for better error handling, and an increase in the taskmanager memory above. The pods had been running for a week uninterrupted until we brought them down yesterday to verify some new alerting.

Change #1017115 abandoned by Bking:

[operations/deployment-charts@master] flink-kubernetes-operator: restart failed jobs

Reason:

Not needed to fix problems detailed in ticket; we can always make a new CR if needed in the future

https://gerrit.wikimedia.org/r/1017115

Maintenance_bot removed a project: Patch-For-Review.Apr 30 2024, 7:31 PM

Per @EBernhardson comment above, I'm doing to resolve the ticket. @dcausse or anyone else, if you think we need to look into this some more, feel free to re-open.

Stabilize "consumer-cloudelastic" Search Update Pipeline jobClosed, ResolvedPublicActions

Description

Details

Event Timeline

Stabilize "consumer-cloudelastic" Search Update Pipeline job
Closed, ResolvedPublic
Actions