Request: Allow setting CPU and memory limits and requests in the toolforge-jobs cli.
Motivation: I have a job that will always get OOMKilled because it needs more than 512Mi of memory.
Request: Allow setting CPU and memory limits and requests in the toolforge-jobs cli.
Motivation: I have a job that will always get OOMKilled because it needs more than 512Mi of memory.
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | • JHedden | T251027 "signatures" tool has failed job pods on Kubernetes cluster | |||
Resolved | aborrero | T251917 Design the Jobs service in k8s | |||
Resolved | aborrero | T283238 Toolforge: develop jobs-framework-api | |||
Resolved | aborrero | T285944 Toolforge: beta phase for the new jobs framework | |||
Resolved | Feature | aborrero | T286126 toolforge-jobs: allow setting limits and requests |
--- api_version: v1 kind: Pod metadata: annotations: cni.projectcalico.org/podIP: 192.168.68.110/32 cni.projectcalico.org/podIPs: 192.168.68.110/32 kubernetes.io/limit-ranger: 'LimitRanger plugin set: cpu, memory request for container bsicons-replacer-old; cpu, memory limit for container bsicons-replacer-old' kubernetes.io/psp: tool-jjmc89-bot-psp podpreset.admission.kubernetes.io/podpreset-mount-toolforge-vols: '77702203' seccomp.security.alpha.kubernetes.io/pod: runtime/default cluster_name: null creation_timestamp: 2021-07-03 07:07:12+00:00 deletion_grace_period_seconds: null deletion_timestamp: null finalizers: null generate_name: bsicons-replacer-old-1625296020- generation: null initializers: null labels: app.kubernetes.io/component: cronjobs app.kubernetes.io/created-by: jjmc89-bot app.kubernetes.io/managed-by: toolforge-jobs-framework app.kubernetes.io/name: bsicons-replacer-old app.kubernetes.io/version: '1' controller-uid: a1986946-4b7a-4a67-8929-598c67dadf88 job-name: bsicons-replacer-old-1625296020 toolforge: tool managed_fields: - api_version: v1 fields: null manager: kube-controller-manager operation: Update time: 2021-07-03 07:07:12+00:00 - api_version: v1 fields: null manager: calico operation: Update time: 2021-07-03 07:07:13+00:00 - api_version: v1 fields: null manager: kubelet operation: Update time: 2021-07-03 12:03:30+00:00 name: bsicons-replacer-old-1625296020-mnb4n namespace: tool-jjmc89-bot owner_references: - api_version: batch/v1 block_owner_deletion: true controller: true kind: Job name: bsicons-replacer-old-1625296020 uid: a1986946-4b7a-4a67-8929-598c67dadf88 resource_version: '418099553' self_link: /api/v1/namespaces/tool-jjmc89-bot/pods/bsicons-replacer-old-1625296020-mnb4n uid: 0d0459dd-4b6f-4056-9ca4-58fb0db499a9 spec: active_deadline_seconds: null affinity: null automount_service_account_token: null containers: - args: null command: - jobs/bsicons-replacer-old env: - name: HOME value: /data/project/jjmc89-bot value_from: null env_from: null image: docker-registry.tools.wmflabs.org/toolforge-python37-sssd-base:latest image_pull_policy: Always lifecycle: null liveness_probe: null name: bsicons-replacer-old ports: null readiness_probe: null resources: limits: cpu: 500m memory: 512Mi requests: cpu: 150m memory: 256Mi security_context: allow_privilege_escalation: false capabilities: add: null drop: - ALL privileged: null proc_mount: null read_only_root_filesystem: null run_as_group: 53000 run_as_non_root: null run_as_user: 53000 se_linux_options: null windows_options: null stdin: null stdin_once: null termination_message_path: /dev/termination-log termination_message_policy: File tty: null volume_devices: null volume_mounts: - mount_path: /data/project mount_propagation: null name: home read_only: null sub_path: null sub_path_expr: null - mount_path: /public/dumps mount_propagation: null name: dumps read_only: true sub_path: null sub_path_expr: null - mount_path: /mnt/nfs/dumps-labstore1007.wikimedia.org mount_propagation: null name: dumpsrc1 read_only: true sub_path: null sub_path_expr: null - mount_path: /mnt/nfs/dumps-labstore1006.wikimedia.org mount_propagation: null name: dumpsrc2 read_only: true sub_path: null sub_path_expr: null - mount_path: /etc/wmcs-project mount_propagation: null name: wmcs-project read_only: true sub_path: null sub_path_expr: null - mount_path: /data/scratch mount_propagation: null name: scratch read_only: null sub_path: null sub_path_expr: null - mount_path: /etc/ldap.conf mount_propagation: null name: etcldap-conf read_only: true sub_path: null sub_path_expr: null - mount_path: /etc/ldap.yaml mount_propagation: null name: etcldap-yaml read_only: true sub_path: null sub_path_expr: null - mount_path: /etc/novaobserver.yaml mount_propagation: null name: etcnovaobserver-yaml read_only: true sub_path: null sub_path_expr: null - mount_path: /var/lib/sss/pipes mount_propagation: null name: sssd-pipes read_only: null sub_path: null sub_path_expr: null - mount_path: /var/run/secrets/kubernetes.io/serviceaccount mount_propagation: null name: default-token-hdtmq read_only: true sub_path: null sub_path_expr: null working_dir: /data/project/jjmc89-bot dns_config: null dns_policy: ClusterFirst enable_service_links: true host_aliases: null host_ipc: null host_network: null host_pid: null hostname: null image_pull_secrets: null init_containers: null node_name: tools-k8s-worker-57 node_selector: null preemption_policy: null priority: 0 priority_class_name: null readiness_gates: null restart_policy: Never runtime_class_name: null scheduler_name: default-scheduler security_context: fs_group: 53000 run_as_group: null run_as_non_root: null run_as_user: null se_linux_options: null supplemental_groups: - 1 sysctls: null windows_options: null service_account: default service_account_name: default share_process_namespace: null subdomain: null termination_grace_period_seconds: 30 tolerations: - effect: NoExecute key: node.kubernetes.io/not-ready operator: Exists toleration_seconds: 300 value: null - effect: NoExecute key: node.kubernetes.io/unreachable operator: Exists toleration_seconds: 300 value: null volumes: - aws_elastic_block_store: null azure_disk: null azure_file: null cephfs: null cinder: null config_map: null csi: null downward_api: null empty_dir: null fc: null flex_volume: null flocker: null gce_persistent_disk: null git_repo: null glusterfs: null host_path: path: /data/project type: Directory iscsi: null name: home nfs: null persistent_volume_claim: null photon_persistent_disk: null portworx_volume: null projected: null quobyte: null rbd: null scale_io: null secret: null storageos: null vsphere_volume: null - aws_elastic_block_store: null azure_disk: null azure_file: null cephfs: null cinder: null config_map: null csi: null downward_api: null empty_dir: null fc: null flex_volume: null flocker: null gce_persistent_disk: null git_repo: null glusterfs: null host_path: path: /public/dumps type: Directory iscsi: null name: dumps nfs: null persistent_volume_claim: null photon_persistent_disk: null portworx_volume: null projected: null quobyte: null rbd: null scale_io: null secret: null storageos: null vsphere_volume: null - aws_elastic_block_store: null azure_disk: null azure_file: null cephfs: null cinder: null config_map: null csi: null downward_api: null empty_dir: null fc: null flex_volume: null flocker: null gce_persistent_disk: null git_repo: null glusterfs: null host_path: path: /mnt/nfs/dumps-labstore1007.wikimedia.org type: Directory iscsi: null name: dumpsrc1 nfs: null persistent_volume_claim: null photon_persistent_disk: null portworx_volume: null projected: null quobyte: null rbd: null scale_io: null secret: null storageos: null vsphere_volume: null - aws_elastic_block_store: null azure_disk: null azure_file: null cephfs: null cinder: null config_map: null csi: null downward_api: null empty_dir: null fc: null flex_volume: null flocker: null gce_persistent_disk: null git_repo: null glusterfs: null host_path: path: /mnt/nfs/dumps-labstore1006.wikimedia.org type: Directory iscsi: null name: dumpsrc2 nfs: null persistent_volume_claim: null photon_persistent_disk: null portworx_volume: null projected: null quobyte: null rbd: null scale_io: null secret: null storageos: null vsphere_volume: null - aws_elastic_block_store: null azure_disk: null azure_file: null cephfs: null cinder: null config_map: null csi: null downward_api: null empty_dir: null fc: null flex_volume: null flocker: null gce_persistent_disk: null git_repo: null glusterfs: null host_path: path: /etc/wmcs-project type: File iscsi: null name: wmcs-project nfs: null persistent_volume_claim: null photon_persistent_disk: null portworx_volume: null projected: null quobyte: null rbd: null scale_io: null secret: null storageos: null vsphere_volume: null - aws_elastic_block_store: null azure_disk: null azure_file: null cephfs: null cinder: null config_map: null csi: null downward_api: null empty_dir: null fc: null flex_volume: null flocker: null gce_persistent_disk: null git_repo: null glusterfs: null host_path: path: /data/scratch type: Directory iscsi: null name: scratch nfs: null persistent_volume_claim: null photon_persistent_disk: null portworx_volume: null projected: null quobyte: null rbd: null scale_io: null secret: null storageos: null vsphere_volume: null - aws_elastic_block_store: null azure_disk: null azure_file: null cephfs: null cinder: null config_map: null csi: null downward_api: null empty_dir: null fc: null flex_volume: null flocker: null gce_persistent_disk: null git_repo: null glusterfs: null host_path: path: /etc/ldap.conf type: File iscsi: null name: etcldap-conf nfs: null persistent_volume_claim: null photon_persistent_disk: null portworx_volume: null projected: null quobyte: null rbd: null scale_io: null secret: null storageos: null vsphere_volume: null - aws_elastic_block_store: null azure_disk: null azure_file: null cephfs: null cinder: null config_map: null csi: null downward_api: null empty_dir: null fc: null flex_volume: null flocker: null gce_persistent_disk: null git_repo: null glusterfs: null host_path: path: /etc/ldap.yaml type: File iscsi: null name: etcldap-yaml nfs: null persistent_volume_claim: null photon_persistent_disk: null portworx_volume: null projected: null quobyte: null rbd: null scale_io: null secret: null storageos: null vsphere_volume: null - aws_elastic_block_store: null azure_disk: null azure_file: null cephfs: null cinder: null config_map: null csi: null downward_api: null empty_dir: null fc: null flex_volume: null flocker: null gce_persistent_disk: null git_repo: null glusterfs: null host_path: path: /etc/novaobserver.yaml type: File iscsi: null name: etcnovaobserver-yaml nfs: null persistent_volume_claim: null photon_persistent_disk: null portworx_volume: null projected: null quobyte: null rbd: null scale_io: null secret: null storageos: null vsphere_volume: null - aws_elastic_block_store: null azure_disk: null azure_file: null cephfs: null cinder: null config_map: null csi: null downward_api: null empty_dir: null fc: null flex_volume: null flocker: null gce_persistent_disk: null git_repo: null glusterfs: null host_path: path: /var/lib/sss/pipes type: Directory iscsi: null name: sssd-pipes nfs: null persistent_volume_claim: null photon_persistent_disk: null portworx_volume: null projected: null quobyte: null rbd: null scale_io: null secret: null storageos: null vsphere_volume: null - aws_elastic_block_store: null azure_disk: null azure_file: null cephfs: null cinder: null config_map: null csi: null downward_api: null empty_dir: null fc: null flex_volume: null flocker: null gce_persistent_disk: null git_repo: null glusterfs: null host_path: null iscsi: null name: default-token-hdtmq nfs: null persistent_volume_claim: null photon_persistent_disk: null portworx_volume: null projected: null quobyte: null rbd: null scale_io: null secret: default_mode: 420 items: null optional: null secret_name: default-token-hdtmq storageos: null vsphere_volume: null status: conditions: - last_probe_time: null last_transition_time: 2021-07-03 07:07:12+00:00 message: null reason: null status: 'True' type: Initialized - last_probe_time: null last_transition_time: 2021-07-03 12:03:30+00:00 message: 'containers with unready status: [bsicons-replacer-old]' reason: ContainersNotReady status: 'False' type: Ready - last_probe_time: null last_transition_time: 2021-07-03 12:03:30+00:00 message: 'containers with unready status: [bsicons-replacer-old]' reason: ContainersNotReady status: 'False' type: ContainersReady - last_probe_time: null last_transition_time: 2021-07-03 07:07:12+00:00 message: null reason: null status: 'True' type: PodScheduled container_statuses: - container_id: docker://c7ce910349a993b98434ce9da21a8903b2b8ad82b14534d2d692a0ed1c670475 image: docker-registry.tools.wmflabs.org/toolforge-python37-sssd-base:latest image_id: docker-pullable://docker-registry.tools.wmflabs.org/toolforge-python37-sssd-base@sha256:e42965a00ec91f52d051723277b81cce5de8339146d9010c3e735e0924fcc4a5 last_state: running: null terminated: null waiting: null name: bsicons-replacer-old ready: false restart_count: 0 state: running: null terminated: container_id: docker://c7ce910349a993b98434ce9da21a8903b2b8ad82b14534d2d692a0ed1c670475 exit_code: 137 finished_at: 2021-07-03 12:03:29+00:00 message: null reason: OOMKilled signal: null started_at: 2021-07-03 07:07:14+00:00 waiting: null host_ip: 172.16.1.183 init_container_statuses: null message: null nominated_node_name: null phase: Failed pod_ip: 192.168.68.110 qos_class: Burstable reason: null start_time: 2021-07-03 07:07:12+00:00
I don't see any other option here but add additional parameters to our REST API. Will do!
Change 704997 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):
[cloud/toolforge/jobs-framework-api@main] jobs: add memory and cpu parameters
Change 705002 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):
[cloud/toolforge/jobs-framework-cli@master] toolforge-jobs: add support for job resource limits
Change 705002 merged by Arturo Borrero Gonzalez:
[cloud/toolforge/jobs-framework-cli@master] toolforge-jobs: add support for job resource limits
Change 704997 merged by jenkins-bot:
[cloud/toolforge/jobs-framework-api@main] jobs: add memory and cpu parameters
Mentioned in SAL (#wikimedia-cloud) [2021-07-20T17:45:50Z] <arturo> pushed new toolforge-jobs-framework-api docker image into the registry (3a6ae38d51202c5c765c8d800cb8380e2a20b998) (T286126