Feature summary:
When a job fails, email tool maintainers.
- Include the job name
- Include status/state information
Include logs (kubectl logs <pod>) or save logs to an indicated fileT286485: toolforge-jobs: figure out logging
status: conditions: - last_probe_time: null last_transition_time: 2021-07-03 07:07:12+00:00 message: null reason: null status: 'True' type: Initialized - last_probe_time: null last_transition_time: 2021-07-03 12:03:30+00:00 message: 'containers with unready status: [bsicons-replacer-old]' reason: ContainersNotReady status: 'False' type: Ready - last_probe_time: null last_transition_time: 2021-07-03 12:03:30+00:00 message: 'containers with unready status: [bsicons-replacer-old]' reason: ContainersNotReady status: 'False' type: ContainersReady - last_probe_time: null last_transition_time: 2021-07-03 07:07:12+00:00 message: null reason: null status: 'True' type: PodScheduled container_statuses: - container_id: docker://c7ce910349a993b98434ce9da21a8903b2b8ad82b14534d2d692a0ed1c670475 image: docker-registry.tools.wmflabs.org/toolforge-python37-sssd-base:latest image_id: docker-pullable://docker-registry.tools.wmflabs.org/toolforge-python37-sssd-base@sha256:e42965a00ec91f52d051723277b81cce5de8339146d9010c3e735e0924fcc4a5 last_state: running: null terminated: null waiting: null name: bsicons-replacer-old ready: false restart_count: 0 state: running: null terminated: container_id: docker://c7ce910349a993b98434ce9da21a8903b2b8ad82b14534d2d692a0ed1c670475 exit_code: 137 finished_at: 2021-07-03 12:03:29+00:00 message: null reason: OOMKilled signal: null started_at: 2021-07-03 07:07:14+00:00 waiting: null host_ip: 172.16.1.183 init_container_statuses: null message: null nominated_node_name: null phase: Failed pod_ip: 192.168.68.110 qos_class: Burstable reason: null start_time: 2021-07-03 07:07:12+00:00
Use case(s):
I want to know when a job fails and what caused the failure.