Page MenuHomePhabricator

Toolforge jobs framework: email maintainers on job failure
Closed, ResolvedPublicFeature

Description

Feature summary:

When a job fails, email tool maintainers.

Example available status information
status:
  conditions:
  - last_probe_time: null
    last_transition_time: 2021-07-03 07:07:12+00:00
    message: null
    reason: null
    status: 'True'
    type: Initialized
  - last_probe_time: null
    last_transition_time: 2021-07-03 12:03:30+00:00
    message: 'containers with unready status: [bsicons-replacer-old]'
    reason: ContainersNotReady
    status: 'False'
    type: Ready
  - last_probe_time: null
    last_transition_time: 2021-07-03 12:03:30+00:00
    message: 'containers with unready status: [bsicons-replacer-old]'
    reason: ContainersNotReady
    status: 'False'
    type: ContainersReady
  - last_probe_time: null
    last_transition_time: 2021-07-03 07:07:12+00:00
    message: null
    reason: null
    status: 'True'
    type: PodScheduled
  container_statuses:
  - container_id: docker://c7ce910349a993b98434ce9da21a8903b2b8ad82b14534d2d692a0ed1c670475
    image: docker-registry.tools.wmflabs.org/toolforge-python37-sssd-base:latest
    image_id: docker-pullable://docker-registry.tools.wmflabs.org/toolforge-python37-sssd-base@sha256:e42965a00ec91f52d051723277b81cce5de8339146d9010c3e735e0924fcc4a5
    last_state:
      running: null
      terminated: null
      waiting: null
    name: bsicons-replacer-old
    ready: false
    restart_count: 0
    state:
      running: null
      terminated:
        container_id: docker://c7ce910349a993b98434ce9da21a8903b2b8ad82b14534d2d692a0ed1c670475
        exit_code: 137
        finished_at: 2021-07-03 12:03:29+00:00
        message: null
        reason: OOMKilled
        signal: null
        started_at: 2021-07-03 07:07:14+00:00
      waiting: null
  host_ip: 172.16.1.183
  init_container_statuses: null
  message: null
  nominated_node_name: null
  phase: Failed
  pod_ip: 192.168.68.110
  qos_class: Burstable
  reason: null
  start_time: 2021-07-03 07:07:12+00:00

Use case(s):

I want to know when a job fails and what caused the failure.

Details

ProjectBranchLines +/-Subject
cloud/toolforge/jobs-framework-emailermain+6 -4
cloud/toolforge/jobs-framework-emailermain+1 -1
cloud/toolforge/jobs-framework-emailermain+3 -1
cloud/toolforge/jobs-framework-emailermain+13 -3
cloud/toolforge/jobs-framework-emailermain+2 -2
cloud/toolforge/jobs-framework-emailermain+80 -20
cloud/toolforge/jobs-framework-emailermain+42 -16
cloud/toolforge/jobs-framework-emailermain+8 -16
cloud/toolforge/jobs-framework-emailermain+321 -187
cloud/toolforge/jobs-framework-emailermain+211 -185
cloud/toolforge/jobs-framework-emailermain+341 -158
cloud/toolforge/jobs-framework-apimain+283 -0
cloud/toolforge/jobs-framework-emailermain+7 -6
cloud/toolforge/jobs-framework-emailermain+1 -1
cloud/toolforge/jobs-framework-climaster+15 -8
cloud/toolforge/jobs-framework-emailermain+19 -0
cloud/toolforge/jobs-framework-emailermain+15 -14
cloud/toolforge/jobs-framework-emailermain+67 -17
cloud/toolforge/jobs-framework-emailermain+66 -43
cloud/toolforge/jobs-framework-climaster+20 -2
cloud/toolforge/jobs-framework-apimain+37 -2
cloud/toolforge/jobs-framework-emailermain+1 K -0
Show related patches Customize query in gerrit

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 720714 merged by Arturo Borrero Gonzalez:

[cloud/toolforge/jobs-framework-cli@master] jobs: add support for selecting email notifications

https://gerrit.wikimedia.org/r/720714

Change 720300 merged by Arturo Borrero Gonzalez:

[cloud/toolforge/jobs-framework-emailer@main] deployment: refresh

https://gerrit.wikimedia.org/r/720300

Change 720745 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[cloud/toolforge/jobs-framework-emailer@main] emailer: filter events from label configuration

https://gerrit.wikimedia.org/r/720745

Change 720745 merged by Arturo Borrero Gonzalez:

[cloud/toolforge/jobs-framework-emailer@main] emailer: filter events from label configuration

https://gerrit.wikimedia.org/r/720745

Change 720933 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[cloud/toolforge/jobs-framework-emailer@main] emailer: revisit email send routine

https://gerrit.wikimedia.org/r/720933

Change 720933 merged by Arturo Borrero Gonzalez:

[cloud/toolforge/jobs-framework-emailer@main] emailer: revisit email send routine

https://gerrit.wikimedia.org/r/720933

Change 720942 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[cloud/toolforge/jobs-framework-emailer@main] emailer: handle containers in waiting state

https://gerrit.wikimedia.org/r/720942

Change 720942 merged by Arturo Borrero Gonzalez:

[cloud/toolforge/jobs-framework-emailer@main] emailer: handle containers in waiting state

https://gerrit.wikimedia.org/r/720942

Change 720946 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[cloud/toolforge/jobs-framework-cli@master] toolforge-jobs-framework-cli.1: refresh manpage

https://gerrit.wikimedia.org/r/720946

Change 720946 merged by jenkins-bot:

[cloud/toolforge/jobs-framework-cli@master] toolforge-jobs-framework-cli.1: refresh manpage

https://gerrit.wikimedia.org/r/720946

Mentioned in SAL (#wikimedia-cloud) [2021-09-14T10:39:27Z] <arturo> deploying jobs-framework-api 16fbf51 (T286135)

Mentioned in SAL (#wikimedia-cloud) [2021-09-14T10:44:57Z] <arturo> deploying jobs-framework-emailer 51032af (T286135)

Change 720954 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[cloud/toolforge/jobs-framework-emailer@main] deployment/toolsbeta: update smtp server port number

https://gerrit.wikimedia.org/r/720954

Change 720954 merged by Arturo Borrero Gonzalez:

[cloud/toolforge/jobs-framework-emailer@main] deployment/toolsbeta: update smtp server port number

https://gerrit.wikimedia.org/r/720954

Change 720956 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[cloud/toolforge/jobs-framework-emailer@main] emailer: don't use TLS to contact the smtp server

https://gerrit.wikimedia.org/r/720956

Change 720956 merged by Arturo Borrero Gonzalez:

[cloud/toolforge/jobs-framework-emailer@main] emailer: don't use TLS to contact the smtp server

https://gerrit.wikimedia.org/r/720956

Mentioned in SAL (#wikimedia-cloud) [2021-09-14T11:42:33Z] <arturo> deploying jobs-framework-emailer 3045601 (T286135)

Is it supposed to be working?

I did toolforge-jobs run test-fail --command this-does-not-exist --image tf-python39 --emails onfailure --no-filelog, but didn't get an email.

For my other jobs that I added --emails onfailure (via load from yaml) all show Emails: Unknown.

Would you include the tool name in the email subject? I'd also prefer the job name in the subject if the email applies to a single job.

Is it supposed to be working?

No, this wasn't deployed in toolforge yet, only in toolsbeta.

The new software piece emailer works pretty well apparently but I need to conduct a few more tests and adjusts a few things before deploying to Toolforge.

Unfortunately it is unlikely that I will have time for this until next month due to some priority changes in the team.

@Majavah asked me to share my TODO and here it is, from the top of my head:

  • the email composition phase needs to be a bit smarter, which is probably related to the event filtering stage.. As of yesterday, we could end up emailing duplicated events for a given job
  • try/catch more exceptions
  • conduct some careful tests to ensure the email flooding controls that were introduced in the system actually works (I only tested them partially with actual emails in toolsbeta)
  • halt the program entirely if one of the tasks fails. Otherwise, the daemon will keep looping but not doing its job. If we halt the program, at least k8s will restart the pod. One way to do this is to create *another task* that every 2 minutes checks if all other tasks are still scheduled, and if not, exit().
  • add comments to functions in the source code, update README and such
  • validate that our deployment manifest is right (kubernetes configuration to be loaded by kubectl -f), probably just need another pair of eyes.
  • automate deployment steps (I was planning to use our new spicerack setup for this). Steps are: build docker image, push docker image to the registry, deploy on k8s

The first 3 things are probably the real blockers for deploying this in toolforge now.

Change 747107 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[cloud/toolforge/jobs-framework-emailer@main] jobs-framework-emailer: introduce helm chart

https://gerrit.wikimedia.org/r/747107

aborrero changed the task status from Open to Stalled.Dec 15 2021, 10:18 AM
aborrero removed aborrero as the assignee of this task.
aborrero moved this task from Doing to Soon! on the cloud-services-team (Kanban) board.

Change 719304 abandoned by Arturo Borrero Gonzalez:

[cloud/toolforge/jobs-framework-api@main] jobs-framework-api: add emailer component

Reason:

this lives now on its own repository

https://gerrit.wikimedia.org/r/719304

aborrero changed the task status from Stalled to In Progress.Feb 21 2022, 11:16 AM
aborrero claimed this task.
aborrero moved this task from Soon! to Doing on the cloud-services-team (Kanban) board.

I'm restarting work on this.

Change 764432 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[cloud/toolforge/jobs-framework-emailer@main] emailer: introduce job event abstraction

https://gerrit.wikimedia.org/r/764432

Hi, all! Do you think something like this could help with T53434? I mean, would it make sense adding something like this to the webservice tool so an email is sent when the webservice container fails?

One would still one to know if a web service is working as expected from the outside, as discussed in T53434. But I feel that getting a notification when the service's container fails (even if it is automatically restarted immediately) may also help detect possible bugs.

I'm new to Toolforge and Kubernetes, so please apologize if my question is too naive.

On the other hand, I just gave a quick look at the source code for jobs-framework-api, jobs-framework-client, and jobs-framework-emailer. I couldn't find where the api is using the emailer code. I was expecting to find something in the line of lifecycle event hooks, as per the resource linked by @aborrero above. Is this still pending?

Edit: Well, I've been playing with this idea a bit. I've forked the webservice tool to have it create the container with a lifecycle preStop hook that runs a prestop.sh script in the tool's home directory. In my case this script in turn runs a node script that sends an email. This works when the container is terminated via webservice stop, for example. However, the preStop event doesn't seem to be sent when the container fails and is automatically restarted due to the restart policy. So, no luck with this, unfortunately :(

hey @diegodlh thanks for your interest. Let me share a bit more information here.

Now, to the other question: it should be simple to extend this emailer component to support notifications for other pods (not only those jobs-related). After all, we evaluate if a pod event is relevant based on things like labels and such.

That being said, I think webservices need some metric-based monitoring and alerting system. In general, a webservice workload is a bit different than the ephemeral-in-nature job workloads, so monitoring them need to use different approaches. What I'm tryng to say is that toolforge webservices could definitely benefit the most from any development in T194333: [Epic] Provide logging/metrics/monitoring SaaS for Cloud VPS tenants.
Think of: a loop running curl/wget against all active webservices and alerting if they any are down. Discussing this with more details is out of scope for this ticket though.

Change 764432 merged by Arturo Borrero Gonzalez:

[cloud/toolforge/jobs-framework-emailer@main] emailer: introduce job event abstraction

https://gerrit.wikimedia.org/r/764432

Change 769694 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[cloud/toolforge/jobs-framework-emailer@main] jobs-framework-emailer: introduce kustomize deployment

https://gerrit.wikimedia.org/r/769694

Change 769694 merged by Arturo Borrero Gonzalez:

[cloud/toolforge/jobs-framework-emailer@main] jobs-framework-emailer: introduce kustomize deployment

https://gerrit.wikimedia.org/r/769694

Change 747107 abandoned by Arturo Borrero Gonzalez:

[cloud/toolforge/jobs-framework-emailer@main] jobs-framework-emailer: introduce helm chart

Reason:

merged https://gerrit.wikimedia.org/r/c/cloud/toolforge/jobs-framework-emailer/+/769694 instead

https://gerrit.wikimedia.org/r/747107

Mentioned in SAL (#wikimedia-cloud) [2022-03-11T11:59:40Z] <arturo> deploy jobs-framework-emailer d60ffd6 (T286135)

Change 769970 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[cloud/toolforge/jobs-framework-emailer@main] events: flush cache all at once

https://gerrit.wikimedia.org/r/769970

Change 769971 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[cloud/toolforge/jobs-framework-emailer@main] emailer: refresh email handling

https://gerrit.wikimedia.org/r/769971

Change 769984 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[cloud/toolforge/jobs-framework-emailer@main] events: validate all labels are present

https://gerrit.wikimedia.org/r/769984

Change 769970 merged by jenkins-bot:

[cloud/toolforge/jobs-framework-emailer@main] events: flush cache all at once

https://gerrit.wikimedia.org/r/769970

Change 769971 merged by jenkins-bot:

[cloud/toolforge/jobs-framework-emailer@main] emailer: refresh email handling

https://gerrit.wikimedia.org/r/769971

Change 769984 merged by jenkins-bot:

[cloud/toolforge/jobs-framework-emailer@main] events: validate all labels are present

https://gerrit.wikimedia.org/r/769984

Mentioned in SAL (#wikimedia-cloud) [2022-03-11T15:02:35Z] <arturo> deploy jobs-framework-emailer 9470a5f (T286135)

Change 770477 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[cloud/toolforge/jobs-framework-emailer@main] configmap: update from address

https://gerrit.wikimedia.org/r/770477

Change 770477 merged by Arturo Borrero Gonzalez:

[cloud/toolforge/jobs-framework-emailer@main] configmap: update from address

https://gerrit.wikimedia.org/r/770477

hey @JJMC89 could you please give this a try?

When creating a new job, try the --emails parameter.

hey @JJMC89 could you please give this a try?

When creating a new job, try the --emails parameter.

I tested with the below command and got two emails with the same content.

$ toolforge-jobs run test-fail --command this-does-not-exist --image tf-python39 --emails onfailure --no-filelog
[Toolforge] notification about 1 jobs
We wanted to notify you about the activity of some jobs in Toolforge.

* Job 'test-fail' (normal) (emails: onfailure) had 2 events:
  -- Pod 'test-fail-lnf5q'. Phase: 'failed'. Container state: 'terminated'. Start timestamp 2022-03-14T17:00:23Z. Finish timestamp 2022-03-14T17:00:23Z. Exit code was '127'. With reason 'Error'.
  -- Pod 'test-fail-9n4x6'. Phase: 'failed'. Container state: 'terminated'. Start timestamp 2022-03-14T17:00:26Z. Finish timestamp 2022-03-14T17:00:26Z. Exit code was '127'. With reason 'Error'.


If you requested 'filelog' for any of the jobs mentioned above, you may find additional information about what happened in the associated log files. Check them from Toolforge bastions as usual.

You are receiving this email because:
 1) when the job was created, it was requested to send email notfications.
 2) you are listed as tool maintainer for this tool.

Find help and more information in wikitech: https://wikitech.wikimedia.org/

Thanks for your contributions to the Wikimedia movement.

I've also reloaded my cronjobs with onfailure emails for ongoing testing.

Would you include the tool name in the email subject? I'd also prefer the job name in the subject if the email applies to a single job.

I also get two emails when only one pod for a job fails.

A single email is sent each time:

2022-03-15 02:09:16 INFO: 1 new pending emails in the queue, new total queue size: 1
2022-03-15 02:09:43 INFO: Sending email FROM: noreply@toolforge.org TO: tools.jjmc89-bot@tools.wmflabs.org via mail.tools.wmflabs.org:25

This may suggest you have some kind of alias or duplicated inbox behind tools.jjmc89-bot@tools.wmflabs.org

Change 770901 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[cloud/toolforge/jobs-framework-emailer@main] compose: make the email subject more dynamic

https://gerrit.wikimedia.org/r/770901

Change 770901 merged by Arturo Borrero Gonzalez:

[cloud/toolforge/jobs-framework-emailer@main] compose: make the email subject more dynamic

https://gerrit.wikimedia.org/r/770901

Change 770902 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[cloud/toolforge/jobs-framework-emailer@main] compose: include tool name in the body

https://gerrit.wikimedia.org/r/770902

Change 770902 merged by Arturo Borrero Gonzalez:

[cloud/toolforge/jobs-framework-emailer@main] compose: include tool name in the body

https://gerrit.wikimedia.org/r/770902

Change 770908 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[cloud/toolforge/jobs-framework-emailer@main] send: relax log message about not sending emails

https://gerrit.wikimedia.org/r/770908

Change 770908 merged by Arturo Borrero Gonzalez:

[cloud/toolforge/jobs-framework-emailer@main] send: relax log message about not sending emails

https://gerrit.wikimedia.org/r/770908

A single email is sent each time:

2022-03-15 02:09:16 INFO: 1 new pending emails in the queue, new total queue size: 1
2022-03-15 02:09:43 INFO: Sending email FROM: noreply@toolforge.org TO: tools.jjmc89-bot@tools.wmflabs.org via mail.tools.wmflabs.org:25

This may suggest you have some kind of alias or duplicated inbox behind tools.jjmc89-bot@tools.wmflabs.org

I don't think that is the cause since I never received duplicate emails from the grid, which are sent to the same address.

Below are the Message-Ids and dates for the emails that I received. Emails 1 and 2 are the same, and 3 and 4 are the same. 5 is the one from your log, which I didn't receive two of.

  1. Message-Id: <E1nTo47-00067c-Ay@mail.tools.wmflabs.org>; Date: Mon, 14 Mar 2022 17:01:03 +0000
  2. Message-Id: <E1nTo74-0006Gu-L1@mail.tools.wmflabs.org>; Date: Mon, 14 Mar 2022 17:04:06 +0000
  3. Message-Id: <E1nTrwB-0002uZ-VK@mail.tools.wmflabs.org>; Date: Mon, 14 Mar 2022 21:09:07 +0000
  4. Message-Id: <E1nTrzF-0003Lf-8d@mail.tools.wmflabs.org>; Date: Mon, 14 Mar 2022 21:12:17 +0000
  5. Message-Id: <E1nTwd5-0002Uh-3c@mail.tools.wmflabs.org>; Date: Tue, 15 Mar 2022 02:09:43 +0000

I don't think I can explain why there are 2 emails. I couldn't reproduce this.

My working theory is the daemon being restarted and both the old/new daemon sending an email about the same events. But I don't think the emailer was restarted in those timestamps.

In case this helps, I received another pair of duplicates today.

  1. Message-Id: <E1nVIuM-0005sr-Nm@mail.tools.wmflabs.org>; Date: Fri, 18 Mar 2022 20:09:10 +0000
  2. Message-Id: <E1nVIxD-0006Hy-0A@mail.tools.wmflabs.org>; Date: Fri, 18 Mar 2022 20:12:07 +0000

Can you please paste here the full repeated emails, with the complete email source and headers?

  1. P22908, P22909
  2. P22910, P22911
  3. P22912, P22913
  4. P22914, P22915

I think I have a theory of what's happening. The k8s API is really chatty about events going on for pods, which is good, but forces the emailer to do some filtering and caching to avoid flooding you with meaningless emails, which could be tricky.

  • certain pod events happen for a tool (creation, destroy, state change, restart, etc).
  • we filter out irrelevant events and cache the rest. The cache is to catch repeated events that may happen a few moments later. We cannot simply send an email per pod event.
  • This works, and at some point an email is sent to you with that event. We clear the event from the cache.
  • A few moments later, the pod (still running) generates a similar event (which I don't know yet how to uniquely identify), but we already sent an email and the cache is clean, so the code doesn't know it is duplicated.
  • The code understand is a new event, caches it, sends it a few moment later. Duplicated.

As a quick counter measure, will try increasing the time we cache events before we send an email. Hopefully this is enough to catch repeated events.

From your pastes:

  1. timestamp diff is ~3 minutes
  2. timestamp diff is ~3 minutes
  3. timestamp diff is ~3 minutes
  4. timestamp diff is ~3 minutes

So I will increase caching time by 4 minutes. This has the side effect of emails being delayed.

Other option would be to clearly identify what is the event k8s is reporting and effectively ignore it right away.

Done, note the 400:

2022-03-22 10:54:38 INFO: new configuration: {'task_compose_emails_loop_sleep': '400', 'task_send_emails_loop_sleep': '10', 'task_send_emails_max': '10', 'task_watch_pods_timeout': '60', 'task_read_configmap_sleep': '10', 'email_to_domain': 'tools.wmflabs.org', 'email_to_prefix': 'tools', 'email_from_addr': 'noreply@toolforge.org', 'smtp_server_fqdn': 'mail.tools.wmflabs.org', 'smtp_server_port': '25', 'send_emails_for_real': 'yes', 'debug': 'yes'}

Change 772808 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[cloud/toolforge/jobs-framework-emailer@main] emailer: config: increase task_compose_emails_loop_sleep value

https://gerrit.wikimedia.org/r/772808

Change 772808 merged by Arturo Borrero Gonzalez:

[cloud/toolforge/jobs-framework-emailer@main] emailer: config: increase task_compose_emails_loop_sleep value

https://gerrit.wikimedia.org/r/772808

This is done for now.