Page MenuHomePhabricator

Toolforge jobs framework: email maintainers on job failure
Open, MediumPublicFeature

Description

Feature summary:

When a job fails, email tool maintainers.

Example available status information
status:
  conditions:
  - last_probe_time: null
    last_transition_time: 2021-07-03 07:07:12+00:00
    message: null
    reason: null
    status: 'True'
    type: Initialized
  - last_probe_time: null
    last_transition_time: 2021-07-03 12:03:30+00:00
    message: 'containers with unready status: [bsicons-replacer-old]'
    reason: ContainersNotReady
    status: 'False'
    type: Ready
  - last_probe_time: null
    last_transition_time: 2021-07-03 12:03:30+00:00
    message: 'containers with unready status: [bsicons-replacer-old]'
    reason: ContainersNotReady
    status: 'False'
    type: ContainersReady
  - last_probe_time: null
    last_transition_time: 2021-07-03 07:07:12+00:00
    message: null
    reason: null
    status: 'True'
    type: PodScheduled
  container_statuses:
  - container_id: docker://c7ce910349a993b98434ce9da21a8903b2b8ad82b14534d2d692a0ed1c670475
    image: docker-registry.tools.wmflabs.org/toolforge-python37-sssd-base:latest
    image_id: docker-pullable://docker-registry.tools.wmflabs.org/toolforge-python37-sssd-base@sha256:e42965a00ec91f52d051723277b81cce5de8339146d9010c3e735e0924fcc4a5
    last_state:
      running: null
      terminated: null
      waiting: null
    name: bsicons-replacer-old
    ready: false
    restart_count: 0
    state:
      running: null
      terminated:
        container_id: docker://c7ce910349a993b98434ce9da21a8903b2b8ad82b14534d2d692a0ed1c670475
        exit_code: 137
        finished_at: 2021-07-03 12:03:29+00:00
        message: null
        reason: OOMKilled
        signal: null
        started_at: 2021-07-03 07:07:14+00:00
      waiting: null
  host_ip: 172.16.1.183
  init_container_statuses: null
  message: null
  nominated_node_name: null
  phase: Failed
  pod_ip: 192.168.68.110
  qos_class: Burstable
  reason: null
  start_time: 2021-07-03 07:07:12+00:00

Use case(s):

I want to know when a job fails and what caused the failure.

Event Timeline

Is this something the grid supports today?

Yes. The grid can email in more cases than failures.

qsub -m<value> -M <email> allows customizing the emails. jsub defaults the -M value to to <tool>@tools.wmflabs.org.

The cron daemon also sends emails in certain cases.

Yeah, I don't know if we need or want to replicate all the scope creep that the grid has gathered over the years, but emailing for failures is definitely a basic expectation from grid users. I get an email every time the cdnjs job starts from the cron server, which is how I know if it has stopped working (I stop getting daily emails).

A dashboard would be a more modern replacement in a future iteration that would allow a lot more arbitrary questions to be answered much better, but the grid has always been kind of old-school so far.

I don't recommend replicating all of it either.

A dashboard would be a more modern replacement in a future iteration that would allow a lot more arbitrary questions to be answered much better, but the grid has always been kind of old-school so far.

We have Grafana to monitor k8s namespace resources. I've been using it to monitor for now, but I don't want to have to actively monitor for job failures.

I'm currently evaluating this: https://kubernetes.io/docs/tasks/configure-pod-container/attach-handler-lifecycle-event/ i.e, simply executing a shell oneliner to send an email from within the job pod.

Oher options I've considered:

  • Sidecar option: injecting a sidecar that is subscribed to its own pod events, and react to them sending emails accordingly. Not very robust.
  • Deployment option: when the first job that requires email notification is created, a deployment with a small watcher program is created, to listen to job events. We can store in a label if a given job requires email notification. If the last job with email notifications is deleted, we can drop this watcher deploy. Perhaps over engineered.

But I want to know more first:

  • is the expectation that the email is sent when the job starts?
  • when the job ends? when it ends successfully? or in error?
  • both?, fully configurable?

Personally, I only want emails for failures. I silence everything else on the grid.

nskaggs triaged this task as Medium priority.Aug 10 2021, 4:30 PM
nskaggs moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.

Change 719304 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[cloud/toolforge/jobs-framework-api@main] jobs-framework-api: add emailer component

https://gerrit.wikimedia.org/r/719304

Change 720297 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[cloud/toolforge/jobs-framework-emailer@main] Initial code commit

https://gerrit.wikimedia.org/r/720297

Change 720297 merged by Arturo Borrero Gonzalez:

[cloud/toolforge/jobs-framework-emailer@main] Initial code commit

https://gerrit.wikimedia.org/r/720297

Change 720300 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[cloud/toolforge/jobs-framework-emailer@main] devel/deployment-local.yaml: use jobs-emailer namespace

https://gerrit.wikimedia.org/r/720300

Change 720709 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[cloud/toolforge/jobs-framework-api@main] api: add new parameter to allow users to request email notifications

https://gerrit.wikimedia.org/r/720709

Change 720714 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[cloud/toolforge/jobs-framework-cli@master] jobs: add support for selecting email notifications

https://gerrit.wikimedia.org/r/720714

Change 720709 merged by jenkins-bot:

[cloud/toolforge/jobs-framework-api@main] api: add new parameter to allow users to request email notifications

https://gerrit.wikimedia.org/r/720709

Change 720714 merged by Arturo Borrero Gonzalez:

[cloud/toolforge/jobs-framework-cli@master] jobs: add support for selecting email notifications

https://gerrit.wikimedia.org/r/720714

Change 720300 merged by Arturo Borrero Gonzalez:

[cloud/toolforge/jobs-framework-emailer@main] deployment: refresh

https://gerrit.wikimedia.org/r/720300

Change 720745 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[cloud/toolforge/jobs-framework-emailer@main] emailer: filter events from label configuration

https://gerrit.wikimedia.org/r/720745

Change 720745 merged by Arturo Borrero Gonzalez:

[cloud/toolforge/jobs-framework-emailer@main] emailer: filter events from label configuration

https://gerrit.wikimedia.org/r/720745

Change 720933 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[cloud/toolforge/jobs-framework-emailer@main] emailer: revisit email send routine

https://gerrit.wikimedia.org/r/720933

Change 720933 merged by Arturo Borrero Gonzalez:

[cloud/toolforge/jobs-framework-emailer@main] emailer: revisit email send routine

https://gerrit.wikimedia.org/r/720933

Change 720942 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[cloud/toolforge/jobs-framework-emailer@main] emailer: handle containers in waiting state

https://gerrit.wikimedia.org/r/720942

Change 720942 merged by Arturo Borrero Gonzalez:

[cloud/toolforge/jobs-framework-emailer@main] emailer: handle containers in waiting state

https://gerrit.wikimedia.org/r/720942

Change 720946 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[cloud/toolforge/jobs-framework-cli@master] toolforge-jobs-framework-cli.1: refresh manpage

https://gerrit.wikimedia.org/r/720946

Change 720946 merged by jenkins-bot:

[cloud/toolforge/jobs-framework-cli@master] toolforge-jobs-framework-cli.1: refresh manpage

https://gerrit.wikimedia.org/r/720946

Mentioned in SAL (#wikimedia-cloud) [2021-09-14T10:39:27Z] <arturo> deploying jobs-framework-api 16fbf51 (T286135)

Mentioned in SAL (#wikimedia-cloud) [2021-09-14T10:44:57Z] <arturo> deploying jobs-framework-emailer 51032af (T286135)

Change 720954 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[cloud/toolforge/jobs-framework-emailer@main] deployment/toolsbeta: update smtp server port number

https://gerrit.wikimedia.org/r/720954

Change 720954 merged by Arturo Borrero Gonzalez:

[cloud/toolforge/jobs-framework-emailer@main] deployment/toolsbeta: update smtp server port number

https://gerrit.wikimedia.org/r/720954

Change 720956 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[cloud/toolforge/jobs-framework-emailer@main] emailer: don't use TLS to contact the smtp server

https://gerrit.wikimedia.org/r/720956

Change 720956 merged by Arturo Borrero Gonzalez:

[cloud/toolforge/jobs-framework-emailer@main] emailer: don't use TLS to contact the smtp server

https://gerrit.wikimedia.org/r/720956

Mentioned in SAL (#wikimedia-cloud) [2021-09-14T11:42:33Z] <arturo> deploying jobs-framework-emailer 3045601 (T286135)

Is it supposed to be working?

I did toolforge-jobs run test-fail --command this-does-not-exist --image tf-python39 --emails onfailure --no-filelog, but didn't get an email.

For my other jobs that I added --emails onfailure (via load from yaml) all show Emails: Unknown.

Would you include the tool name in the email subject? I'd also prefer the job name in the subject if the email applies to a single job.

Is it supposed to be working?

No, this wasn't deployed in toolforge yet, only in toolsbeta.

The new software piece emailer works pretty well apparently but I need to conduct a few more tests and adjusts a few things before deploying to Toolforge.

Unfortunately it is unlikely that I will have time for this until next month due to some priority changes in the team.

@Majavah asked me to share my TODO and here it is, from the top of my head:

  • the email composition phase needs to be a bit smarter, which is probably related to the event filtering stage.. As of yesterday, we could end up emailing duplicated events for a given job
  • try/catch more exceptions
  • conduct some careful tests to ensure the email flooding controls that were introduced in the system actually works (I only tested them partially with actual emails in toolsbeta)
  • halt the program entirely if one of the tasks fails. Otherwise, the daemon will keep looping but not doing its job. If we halt the program, at least k8s will restart the pod. One way to do this is to create *another task* that every 2 minutes checks if all other tasks are still scheduled, and if not, exit().
  • add comments to functions in the source code, update README and such
  • validate that our deployment manifest is right (kubernetes configuration to be loaded by kubectl -f), probably just need another pair of eyes.
  • automate deployment steps (I was planning to use our new spicerack setup for this). Steps are: build docker image, push docker image to the registry, deploy on k8s

The first 3 things are probably the real blockers for deploying this in toolforge now.