Page MenuHomePhabricator

Set requests (not limits) for cirrus-streaming-updater in k8s
Closed, ResolvedPublic

Description

See https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/

The important thing to know is that requests can be set in k8s and they tell k8s that
a given host must have X ram/cpu available in order to schedule the pods on that host.

They do *not* perform throttling of CPU or cause OOMKills;
those roles are handled by k8s limits which are out of scope of this ticket.

AC

  • Requests set for any cirrus-streaming-updater containers

Event Timeline

Gehel triaged this task as Medium priority.Oct 11 2023, 8:39 AM

@RKemper This is what I found.

I started to take a look at our pods. The flink-main-container of our flink-producer has the following resources set:

resources:
  limits:
    cpu: "1"
    memory: 1000Mi
  requests:
    cpu: "1"
    memory: 1000Mi

The flink-main-container of our task manager pod has the following resources:

resources:
  limits:
    cpu: "1"
    memory: 2000Mi
  requests:
    cpu: "1"
    memory: 2000Mi

It seems that app.{jobManager,taskManager}.resources.{memory,cpu} are mapped to both limits and requests.

I had a look at the operator code, to see where the limits and requests were computed, as we're only passing a single cpu/memory "request". I found this, in which we call out to [[ https://nightlies.apache.org/flink/flink-docs-master/api/java/org/apache/flink/kubernetes/utils/KubernetesUtils.html | KubernetesUtils.getResourceRequirements ]], with memory and cpu limit factors. My understanding is that you can ask for limit that is 2x your request, for example.

I'm assuming that the default limit is 1. Let's check. This is how we get the memory limit factor, which is read from KubernetesConfigOptions.TASK_MANAGER_MEMORY_LIMIT_FACTOR.

After a bit of Googling, I found my answer in https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/:

Screenshot 2024-02-12 at 17.25.52.png (476×1 px, 94 KB)

If we're not happy with the requests we already have, we can change kubernetes.jobmanager.cpu.limit-factor and kubernetes.jobmanager.memory.limit-factor by overriding the defaultConfiguration.flink-conf.yaml configuration of our app. If not, well, we already have requests.

Should we tweak the limit factors or should we close @RKemper @bking ?

Gehel added a subscriber: brouberol.

Moving back to in progress (this isn't really blocked, just waiting for an answer / discussion within our team) and assigning to @bking to get his attention.

I'm OK with closing this one for now, as we haven't run into resource issues yet in production (we did see a bit of problems with staging, but that's been fixed). We can revisit if we run into more problems.