Page MenuHomePhabricator

toolforge-jobs: allow setting limits and requests
Closed, ResolvedPublicFeature

Description

Request: Allow setting CPU and memory limits and requests in the toolforge-jobs cli.

Motivation: I have a job that will always get OOMKilled because it needs more than 512Mi of memory.

Event Timeline

JJMC89 changed the subtype of this task from "Task" to "Feature Request".Sat, Jul 3, 5:36 PM
Killed pod for reference
---
api_version: v1
kind: Pod
metadata:
  annotations:
    cni.projectcalico.org/podIP: 192.168.68.110/32
    cni.projectcalico.org/podIPs: 192.168.68.110/32
    kubernetes.io/limit-ranger: 'LimitRanger plugin set: cpu, memory request for
      container bsicons-replacer-old; cpu, memory limit for container bsicons-replacer-old'
    kubernetes.io/psp: tool-jjmc89-bot-psp
    podpreset.admission.kubernetes.io/podpreset-mount-toolforge-vols: '77702203'
    seccomp.security.alpha.kubernetes.io/pod: runtime/default
  cluster_name: null
  creation_timestamp: 2021-07-03 07:07:12+00:00
  deletion_grace_period_seconds: null
  deletion_timestamp: null
  finalizers: null
  generate_name: bsicons-replacer-old-1625296020-
  generation: null
  initializers: null
  labels:
    app.kubernetes.io/component: cronjobs
    app.kubernetes.io/created-by: jjmc89-bot
    app.kubernetes.io/managed-by: toolforge-jobs-framework
    app.kubernetes.io/name: bsicons-replacer-old
    app.kubernetes.io/version: '1'
    controller-uid: a1986946-4b7a-4a67-8929-598c67dadf88
    job-name: bsicons-replacer-old-1625296020
    toolforge: tool
  managed_fields:
  - api_version: v1
    fields: null
    manager: kube-controller-manager
    operation: Update
    time: 2021-07-03 07:07:12+00:00
  - api_version: v1
    fields: null
    manager: calico
    operation: Update
    time: 2021-07-03 07:07:13+00:00
  - api_version: v1
    fields: null
    manager: kubelet
    operation: Update
    time: 2021-07-03 12:03:30+00:00
  name: bsicons-replacer-old-1625296020-mnb4n
  namespace: tool-jjmc89-bot
  owner_references:
  - api_version: batch/v1
    block_owner_deletion: true
    controller: true
    kind: Job
    name: bsicons-replacer-old-1625296020
    uid: a1986946-4b7a-4a67-8929-598c67dadf88
  resource_version: '418099553'
  self_link: /api/v1/namespaces/tool-jjmc89-bot/pods/bsicons-replacer-old-1625296020-mnb4n
  uid: 0d0459dd-4b6f-4056-9ca4-58fb0db499a9
spec:
  active_deadline_seconds: null
  affinity: null
  automount_service_account_token: null
  containers:
  - args: null
    command:
    - jobs/bsicons-replacer-old
    env:
    - name: HOME
      value: /data/project/jjmc89-bot
      value_from: null
    env_from: null
    image: docker-registry.tools.wmflabs.org/toolforge-python37-sssd-base:latest
    image_pull_policy: Always
    lifecycle: null
    liveness_probe: null
    name: bsicons-replacer-old
    ports: null
    readiness_probe: null
    resources:
      limits:
        cpu: 500m
        memory: 512Mi
      requests:
        cpu: 150m
        memory: 256Mi
    security_context:
      allow_privilege_escalation: false
      capabilities:
        add: null
        drop:
        - ALL
      privileged: null
      proc_mount: null
      read_only_root_filesystem: null
      run_as_group: 53000
      run_as_non_root: null
      run_as_user: 53000
      se_linux_options: null
      windows_options: null
    stdin: null
    stdin_once: null
    termination_message_path: /dev/termination-log
    termination_message_policy: File
    tty: null
    volume_devices: null
    volume_mounts:
    - mount_path: /data/project
      mount_propagation: null
      name: home
      read_only: null
      sub_path: null
      sub_path_expr: null
    - mount_path: /public/dumps
      mount_propagation: null
      name: dumps
      read_only: true
      sub_path: null
      sub_path_expr: null
    - mount_path: /mnt/nfs/dumps-labstore1007.wikimedia.org
      mount_propagation: null
      name: dumpsrc1
      read_only: true
      sub_path: null
      sub_path_expr: null
    - mount_path: /mnt/nfs/dumps-labstore1006.wikimedia.org
      mount_propagation: null
      name: dumpsrc2
      read_only: true
      sub_path: null
      sub_path_expr: null
    - mount_path: /etc/wmcs-project
      mount_propagation: null
      name: wmcs-project
      read_only: true
      sub_path: null
      sub_path_expr: null
    - mount_path: /data/scratch
      mount_propagation: null
      name: scratch
      read_only: null
      sub_path: null
      sub_path_expr: null
    - mount_path: /etc/ldap.conf
      mount_propagation: null
      name: etcldap-conf
      read_only: true
      sub_path: null
      sub_path_expr: null
    - mount_path: /etc/ldap.yaml
      mount_propagation: null
      name: etcldap-yaml
      read_only: true
      sub_path: null
      sub_path_expr: null
    - mount_path: /etc/novaobserver.yaml
      mount_propagation: null
      name: etcnovaobserver-yaml
      read_only: true
      sub_path: null
      sub_path_expr: null
    - mount_path: /var/lib/sss/pipes
      mount_propagation: null
      name: sssd-pipes
      read_only: null
      sub_path: null
      sub_path_expr: null
    - mount_path: /var/run/secrets/kubernetes.io/serviceaccount
      mount_propagation: null
      name: default-token-hdtmq
      read_only: true
      sub_path: null
      sub_path_expr: null
    working_dir: /data/project/jjmc89-bot
  dns_config: null
  dns_policy: ClusterFirst
  enable_service_links: true
  host_aliases: null
  host_ipc: null
  host_network: null
  host_pid: null
  hostname: null
  image_pull_secrets: null
  init_containers: null
  node_name: tools-k8s-worker-57
  node_selector: null
  preemption_policy: null
  priority: 0
  priority_class_name: null
  readiness_gates: null
  restart_policy: Never
  runtime_class_name: null
  scheduler_name: default-scheduler
  security_context:
    fs_group: 53000
    run_as_group: null
    run_as_non_root: null
    run_as_user: null
    se_linux_options: null
    supplemental_groups:
    - 1
    sysctls: null
    windows_options: null
  service_account: default
  service_account_name: default
  share_process_namespace: null
  subdomain: null
  termination_grace_period_seconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    toleration_seconds: 300
    value: null
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    toleration_seconds: 300
    value: null
  volumes:
  - aws_elastic_block_store: null
    azure_disk: null
    azure_file: null
    cephfs: null
    cinder: null
    config_map: null
    csi: null
    downward_api: null
    empty_dir: null
    fc: null
    flex_volume: null
    flocker: null
    gce_persistent_disk: null
    git_repo: null
    glusterfs: null
    host_path:
      path: /data/project
      type: Directory
    iscsi: null
    name: home
    nfs: null
    persistent_volume_claim: null
    photon_persistent_disk: null
    portworx_volume: null
    projected: null
    quobyte: null
    rbd: null
    scale_io: null
    secret: null
    storageos: null
    vsphere_volume: null
  - aws_elastic_block_store: null
    azure_disk: null
    azure_file: null
    cephfs: null
    cinder: null
    config_map: null
    csi: null
    downward_api: null
    empty_dir: null
    fc: null
    flex_volume: null
    flocker: null
    gce_persistent_disk: null
    git_repo: null
    glusterfs: null
    host_path:
      path: /public/dumps
      type: Directory
    iscsi: null
    name: dumps
    nfs: null
    persistent_volume_claim: null
    photon_persistent_disk: null
    portworx_volume: null
    projected: null
    quobyte: null
    rbd: null
    scale_io: null
    secret: null
    storageos: null
    vsphere_volume: null
  - aws_elastic_block_store: null
    azure_disk: null
    azure_file: null
    cephfs: null
    cinder: null
    config_map: null
    csi: null
    downward_api: null
    empty_dir: null
    fc: null
    flex_volume: null
    flocker: null
    gce_persistent_disk: null
    git_repo: null
    glusterfs: null
    host_path:
      path: /mnt/nfs/dumps-labstore1007.wikimedia.org
      type: Directory
    iscsi: null
    name: dumpsrc1
    nfs: null
    persistent_volume_claim: null
    photon_persistent_disk: null
    portworx_volume: null
    projected: null
    quobyte: null
    rbd: null
    scale_io: null
    secret: null
    storageos: null
    vsphere_volume: null
  - aws_elastic_block_store: null
    azure_disk: null
    azure_file: null
    cephfs: null
    cinder: null
    config_map: null
    csi: null
    downward_api: null
    empty_dir: null
    fc: null
    flex_volume: null
    flocker: null
    gce_persistent_disk: null
    git_repo: null
    glusterfs: null
    host_path:
      path: /mnt/nfs/dumps-labstore1006.wikimedia.org
      type: Directory
    iscsi: null
    name: dumpsrc2
    nfs: null
    persistent_volume_claim: null
    photon_persistent_disk: null
    portworx_volume: null
    projected: null
    quobyte: null
    rbd: null
    scale_io: null
    secret: null
    storageos: null
    vsphere_volume: null
  - aws_elastic_block_store: null
    azure_disk: null
    azure_file: null
    cephfs: null
    cinder: null
    config_map: null
    csi: null
    downward_api: null
    empty_dir: null
    fc: null
    flex_volume: null
    flocker: null
    gce_persistent_disk: null
    git_repo: null
    glusterfs: null
    host_path:
      path: /etc/wmcs-project
      type: File
    iscsi: null
    name: wmcs-project
    nfs: null
    persistent_volume_claim: null
    photon_persistent_disk: null
    portworx_volume: null
    projected: null
    quobyte: null
    rbd: null
    scale_io: null
    secret: null
    storageos: null
    vsphere_volume: null
  - aws_elastic_block_store: null
    azure_disk: null
    azure_file: null
    cephfs: null
    cinder: null
    config_map: null
    csi: null
    downward_api: null
    empty_dir: null
    fc: null
    flex_volume: null
    flocker: null
    gce_persistent_disk: null
    git_repo: null
    glusterfs: null
    host_path:
      path: /data/scratch
      type: Directory
    iscsi: null
    name: scratch
    nfs: null
    persistent_volume_claim: null
    photon_persistent_disk: null
    portworx_volume: null
    projected: null
    quobyte: null
    rbd: null
    scale_io: null
    secret: null
    storageos: null
    vsphere_volume: null
  - aws_elastic_block_store: null
    azure_disk: null
    azure_file: null
    cephfs: null
    cinder: null
    config_map: null
    csi: null
    downward_api: null
    empty_dir: null
    fc: null
    flex_volume: null
    flocker: null
    gce_persistent_disk: null
    git_repo: null
    glusterfs: null
    host_path:
      path: /etc/ldap.conf
      type: File
    iscsi: null
    name: etcldap-conf
    nfs: null
    persistent_volume_claim: null
    photon_persistent_disk: null
    portworx_volume: null
    projected: null
    quobyte: null
    rbd: null
    scale_io: null
    secret: null
    storageos: null
    vsphere_volume: null
  - aws_elastic_block_store: null
    azure_disk: null
    azure_file: null
    cephfs: null
    cinder: null
    config_map: null
    csi: null
    downward_api: null
    empty_dir: null
    fc: null
    flex_volume: null
    flocker: null
    gce_persistent_disk: null
    git_repo: null
    glusterfs: null
    host_path:
      path: /etc/ldap.yaml
      type: File
    iscsi: null
    name: etcldap-yaml
    nfs: null
    persistent_volume_claim: null
    photon_persistent_disk: null
    portworx_volume: null
    projected: null
    quobyte: null
    rbd: null
    scale_io: null
    secret: null
    storageos: null
    vsphere_volume: null
  - aws_elastic_block_store: null
    azure_disk: null
    azure_file: null
    cephfs: null
    cinder: null
    config_map: null
    csi: null
    downward_api: null
    empty_dir: null
    fc: null
    flex_volume: null
    flocker: null
    gce_persistent_disk: null
    git_repo: null
    glusterfs: null
    host_path:
      path: /etc/novaobserver.yaml
      type: File
    iscsi: null
    name: etcnovaobserver-yaml
    nfs: null
    persistent_volume_claim: null
    photon_persistent_disk: null
    portworx_volume: null
    projected: null
    quobyte: null
    rbd: null
    scale_io: null
    secret: null
    storageos: null
    vsphere_volume: null
  - aws_elastic_block_store: null
    azure_disk: null
    azure_file: null
    cephfs: null
    cinder: null
    config_map: null
    csi: null
    downward_api: null
    empty_dir: null
    fc: null
    flex_volume: null
    flocker: null
    gce_persistent_disk: null
    git_repo: null
    glusterfs: null
    host_path:
      path: /var/lib/sss/pipes
      type: Directory
    iscsi: null
    name: sssd-pipes
    nfs: null
    persistent_volume_claim: null
    photon_persistent_disk: null
    portworx_volume: null
    projected: null
    quobyte: null
    rbd: null
    scale_io: null
    secret: null
    storageos: null
    vsphere_volume: null
  - aws_elastic_block_store: null
    azure_disk: null
    azure_file: null
    cephfs: null
    cinder: null
    config_map: null
    csi: null
    downward_api: null
    empty_dir: null
    fc: null
    flex_volume: null
    flocker: null
    gce_persistent_disk: null
    git_repo: null
    glusterfs: null
    host_path: null
    iscsi: null
    name: default-token-hdtmq
    nfs: null
    persistent_volume_claim: null
    photon_persistent_disk: null
    portworx_volume: null
    projected: null
    quobyte: null
    rbd: null
    scale_io: null
    secret:
      default_mode: 420
      items: null
      optional: null
      secret_name: default-token-hdtmq
    storageos: null
    vsphere_volume: null
status:
  conditions:
  - last_probe_time: null
    last_transition_time: 2021-07-03 07:07:12+00:00
    message: null
    reason: null
    status: 'True'
    type: Initialized
  - last_probe_time: null
    last_transition_time: 2021-07-03 12:03:30+00:00
    message: 'containers with unready status: [bsicons-replacer-old]'
    reason: ContainersNotReady
    status: 'False'
    type: Ready
  - last_probe_time: null
    last_transition_time: 2021-07-03 12:03:30+00:00
    message: 'containers with unready status: [bsicons-replacer-old]'
    reason: ContainersNotReady
    status: 'False'
    type: ContainersReady
  - last_probe_time: null
    last_transition_time: 2021-07-03 07:07:12+00:00
    message: null
    reason: null
    status: 'True'
    type: PodScheduled
  container_statuses:
  - container_id: docker://c7ce910349a993b98434ce9da21a8903b2b8ad82b14534d2d692a0ed1c670475
    image: docker-registry.tools.wmflabs.org/toolforge-python37-sssd-base:latest
    image_id: docker-pullable://docker-registry.tools.wmflabs.org/toolforge-python37-sssd-base@sha256:e42965a00ec91f52d051723277b81cce5de8339146d9010c3e735e0924fcc4a5
    last_state:
      running: null
      terminated: null
      waiting: null
    name: bsicons-replacer-old
    ready: false
    restart_count: 0
    state:
      running: null
      terminated:
        container_id: docker://c7ce910349a993b98434ce9da21a8903b2b8ad82b14534d2d692a0ed1c670475
        exit_code: 137
        finished_at: 2021-07-03 12:03:29+00:00
        message: null
        reason: OOMKilled
        signal: null
        started_at: 2021-07-03 07:07:14+00:00
      waiting: null
  host_ip: 172.16.1.183
  init_container_statuses: null
  message: null
  nominated_node_name: null
  phase: Failed
  pod_ip: 192.168.68.110
  qos_class: Burstable
  reason: null
  start_time: 2021-07-03 07:07:12+00:00

I don't see any other option here but add additional parameters to our REST API. Will do!

Change 704997 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[cloud/toolforge/jobs-framework-api@main] jobs: add memory and cpu parameters

https://gerrit.wikimedia.org/r/704997

Change 705002 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[cloud/toolforge/jobs-framework-cli@master] toolforge-jobs: add support for job resource limits

https://gerrit.wikimedia.org/r/705002

Change 705002 merged by Arturo Borrero Gonzalez:

[cloud/toolforge/jobs-framework-cli@master] toolforge-jobs: add support for job resource limits

https://gerrit.wikimedia.org/r/705002

Change 704997 merged by jenkins-bot:

[cloud/toolforge/jobs-framework-api@main] jobs: add memory and cpu parameters

https://gerrit.wikimedia.org/r/704997

Mentioned in SAL (#wikimedia-cloud) [2021-07-20T17:45:50Z] <arturo> pushed new toolforge-jobs-framework-api docker image into the registry (3a6ae38d51202c5c765c8d800cb8380e2a20b998) (T286126

aborrero claimed this task.