Page MenuHomePhabricator

refill-api memory usage is somewhat dangerous to the Kubernetes cluster on Toolforge
Closed, ResolvedPublic

Description

Over the past couple days, we have noticed Kubernetes nodes becoming somewhat crippled periodically because the OOM killer is randomly stopping processes and the system breaks down until it is able to recover maybe 20 minutes later (or sometimes it won't recover without a reboot).

This is quite clearly caused by spikes in RAM consumption by the celery container in the deployment (see graph from prometheus)

Screen Shot 2020-01-22 at 12.11.30 PM.png (954×1 px, 264 KB)

This needs a resources section for that container in the deployment manifest.

The celery container now looks like:

- name: celery
  image: docker-registry.tools.wmflabs.org/toollabs-python-base:latest
  imagePullPolicy: Always
  command: ["bash", "-c"]
  args: ["source $HOME/www/python/venv/bin/activate && cd $HOME/www/python/src && celery --autoscale=100,10 worker"]
  env:
    - name: HOME
      value: /data/project/refill-api
  volumeMounts:
  - mountPath: /public/dumps/
    name: dumps
  - mountPath: /data/project/
    name: home
  - mountPath: /etc/wmcs-project
    name: wmcs-project
  - mountPath: /data/scratch/
    name: scratch
  terminationMessagePath: /dev/termination-log
  workingDir: /data/project/refill-api/

I suggest you make it:

- name: celery
  image: docker-registry.tools.wmflabs.org/toollabs-python-base:latest
  imagePullPolicy: Always
  resources:
    limits:
      cpu: "2"
      memory: 3Gi
    requests:
      cpu: "1"
      memory: 2Gi
  command: ["bash", "-c"]
  args: ["source $HOME/www/python/venv/bin/activate && cd $HOME/www/python/src && celery --autoscale=100,10 worker"]
  env:
    - name: HOME
      value: /data/project/refill-api
  volumeMounts:
  - mountPath: /public/dumps/
    name: dumps
  - mountPath: /data/project/
    name: home
  - mountPath: /etc/wmcs-project
    name: wmcs-project
  - mountPath: /data/scratch/
    name: scratch
  terminationMessagePath: /dev/termination-log
  workingDir: /data/project/refill-api/

This will prevent the containers from being scheduled on nodes that are resource constrained and prevent any spikes over 3GB. At that level, it can still cause harm, but it is much less likely to than it is now.

Event Timeline

Bstorm created this task.

If this is moved to the new Kubernetes cluster (as described here: https://wikitech.wikimedia.org/wiki/News/2020_Kubernetes_cluster_migration), there are automatic limits to how much RAM the worker can consume, but it still must implement a higher requests value than the default or it can overrun the RAM of the node. The requests part of a container definition helps determine if there is room on the node it is getting placed on.

Mentioned in SAL (#wikimedia-cloud) [2020-01-24T21:23:21Z] <bstorm_> Added resources limit range to the celery container per T243465

After waiting a couple days, I took the liberty of modifying the deployment in place.

kubectl get pods -n refill-api
NAME                          READY     STATUS    RESTARTS   AGE
refill-api-1587887294-u46mh   2/2       Running   0          57s

The pods are running now with my proposed limit-range in place. I'll dump the yaml for that into the tool directory so that you can restart your tool from it with those settings intact.

Bstorm claimed this task.

I've edited the refill.yaml file to include the new setting and saved the old file to refill.yaml.old.

Please do not remove the limits on it for the sake of the cluster.