Page MenuHomePhabricator

[jobs-emailer] duplicate failure emails
Open, MediumPublic

Description

Example exact duplicates:

* Job 'recent-vios' (cronjob) (emails: onfailure) had 1 events:
  -- Pod 'recent-vios-28904197-btkm8'. Phase: 'failed'. Container state: 'terminated'. Exit code was '137'. With reason 'ContainerStatusUnknown'. With message: 'The container could not be located when the pod was terminated'.
  • Message-Id: <E1tMkF9-00F9Pk-1z@mail.tools.wmcloud.org>; Date: Sun, 15 Dec 2024 08:44:51 +0000
  • Message-Id: <E1tMkrx-00FA4P-0v@mail.tools.wmcloud.org>; Date: Sun, 15 Dec 2024 09:24:57 +0000
* Job 'recent-vios' (cronjob) (emails: onfailure) had 1 events:
  -- Pod 'recent-vios-28907077-h9mxt'. Phase: 'failed'. Container state: 'terminated'. Exit code was '137'. With reason 'ContainerStatusUnknown'. With message: 'The container could not be located when the pod was terminated'.
  • Message-Id: <E1tNTIh-00GCRJ-2w@mail.tools.wmcloud.org>; Date: Tue, 17 Dec 2024 08:51:31 +0000
  • Message-Id: <E1tNTik-00GDAg-1O@mail.tools.wmcloud.org>; Date: Tue, 17 Dec 2024 09:18:26 +0000
* Job 'admin-activity-early' (cronjob) (emails: onfailure) had 1 events:
  -- Pod 'admin-activity-early-28909447-lm5wt'. Phase: 'failed'. Container state: 'terminated'. Exit code was '137'. With reason 'ContainerStatusUnknown'. With message: 'The container could not be located when the pod was terminated'.
  • Message-Id: <E1tO4BY-00HEVt-1O@mail.tools.wmcloud.org>; Date: Thu, 19 Dec 2024 00:14:36 +0000
  • Message-Id: <E1tO4I2-00HEd6-21@mail.tools.wmcloud.org>; Date: Thu, 19 Dec 2024 00:21:18 +0000
* Job 'massmessage-list-updater' (cronjob) (emails: onfailure) had 1 events:
  -- Pod 'massmessage-list-updater-28910893-vjstw'. Phase: 'failed'. Container state: 'terminated'. Start timestamp 2024-12-20T00:13:18Z. Finish timestamp 2024-12-20T00:14:08Z. Exit code was '137'. With reason 'Error'.
  • Message-Id: <E1tOQex-000Sro-0t@mail.tools.wmcloud.org>; Date: Fri, 20 Dec 2024 00:14:27 +0000
  • Message-Id: <E1tOQlW-000Sym-0V@mail.tools.wmcloud.org>; Date: Fri, 20 Dec 2024 00:21:14 +0000
* Job 'cfdw' (cronjob) (emails: onfailure) had 1 events:
  -- Pod 'cfdw-28913897-frfjt'. Phase: 'failed'. Container state: 'terminated'. Exit code was '137'. With reason 'ContainerStatusUnknown'. With message: 'The container could not be located when the pod was terminated'.
  • Message-Id: <E1tPBng-001jO5-00@mail.tools.wmcloud.org>; Date: Sun, 22 Dec 2024 02:34:36 +0000
  • Message-Id: <E1tPCWw-001kqx-1V@mail.tools.wmcloud.org>; Date: Sun, 22 Dec 2024 03:21:22 +0000
* Job 'cfdw' (cronjob) (emails: onfailure) had 1 events:
  -- Pod 'cfdw-28916657-gzdgd'. Phase: 'failed'. Container state: 'terminated'. Exit code was '137'. With reason 'ContainerStatusUnknown'. With message: 'The container could not be located when the pod was terminated'.
  • Message-Id: <E1tPszF-002bOx-0h@mail.tools.wmcloud.org>; Date: Tue, 24 Dec 2024 00:41:25 +0000
  • Message-Id: <E1tPtVh-002cD8-1g@mail.tools.wmcloud.org>; Date: Tue, 24 Dec 2024 01:14:57 +0000
* Job 'cfdw' (cronjob) (emails: onfailure) had 1 events:
  -- Pod 'cfdw-28920737-gqdhp'. Phase: 'failed'. Container state: 'terminated'. Start timestamp 2024-12-26T20:17:18Z. Finish timestamp 2024-12-26T20:19:41Z. Exit code was '137'. With reason 'Error'.
  • Message-Id: <E1tQuKv-003ugK-2P@mail.tools.wmcloud.org>; Date: Thu, 26 Dec 2024 20:20:01 +0000
  • Message-Id: <E1tQurH-003vHn-2x@mail.tools.wmcloud.org>; Date: Thu, 26 Dec 2024 20:53:27 +0000
* Job 'admin-activity-early' (cronjob) (emails: onfailure) had 1 events:
  -- Pod 'admin-activity-early-28920967-wncl4'. Phase: 'failed'. Container state: 'terminated'. Exit code was '137'. With reason 'ContainerStatusUnknown'. With message: 'The container could not be located when the pod was terminated'.
  • Message-Id: <E1tQxzW-003z8P-2P@mail.tools.wmcloud.org>; Date: Fri, 27 Dec 2024 00:14:10 +0000
  • Message-Id: <E1tQy60-003zKN-0H@mail.tools.wmcloud.org>; Date: Fri, 27 Dec 2024 00:20:52 +0000
* Job 'cfdw' (cronjob) (emails: onfailure) had 1 events:
  -- Pod 'cfdw-28922117-5tfs7'. Phase: 'failed'. Container state: 'terminated'. Start timestamp 2024-12-27T19:17:20Z. Finish timestamp 2024-12-27T19:20:48Z. Exit code was '137'. With reason 'Error'.
  • Message-Id: <E1tRFzA-004LmY-16@mail.tools.wmcloud.org>; Date: Fri, 27 Dec 2024 19:27:00 +0000
  • Message-Id: <E1tRGC9-004LwH-0e@mail.tools.wmcloud.org>; Date: Fri, 27 Dec 2024 19:40:25 +0000
* Job 'cfdw' (cronjob) (emails: onfailure) had 1 events:
  -- Pod 'cfdw-28922417-dqn7r'. Phase: 'failed'. Container state: 'terminated'. Exit code was '137'. With reason 'ContainerStatusUnknown'. With message: 'The container could not be located when the pod was terminated'.
  • Message-Id: <E1tRKZv-004RC3-0u@mail.tools.wmcloud.org>; Date: Sat, 28 Dec 2024 00:21:15 +0000
  • Message-Id: <E1tRKt5-004RP0-1W@mail.tools.wmcloud.org>; Date: Sat, 28 Dec 2024 00:41:03 +0000
* Job 'cfdw' (cronjob) (emails: onfailure) had 1 events:
  -- Pod 'cfdw-28925747-rt9cg'. Phase: 'failed'. Container state: 'terminated'. Exit code was '137'. With reason 'ContainerStatusUnknown'. With message: 'The container could not be located when the pod was terminated'.
  • Message-Id: <E1tSAev-005OrS-2j@mail.tools.wmcloud.org>; Date: Mon, 30 Dec 2024 07:57:53 +0000
  • Message-Id: <E1tSAlY-005Oyu-03@mail.tools.wmcloud.org>; Date: Mon, 30 Dec 2024 08:04:44 +0000
* Job 'recent-vios' (cronjob) (emails: onfailure) had 1 events:
  -- Pod 'recent-vios-28928197-r887w'. Phase: 'failed'. Container state: 'terminated'. Start timestamp 2025-01-01T00:37:18Z. Finish timestamp 2025-01-01T01:12:43Z. Exit code was '137'. With reason 'Error'.
  • Message-Id: <E1tSnJS-006HIF-1D@mail.tools.wmcloud.org>; Date: Wed, 01 Jan 2025 01:14:18 +0000
  • Message-Id: <E1tSo2p-006IDJ-26@mail.tools.wmcloud.org>; Date: Wed, 01 Jan 2025 02:01:11 +0000
* Job 'purge-dup-args' (cronjob) (emails: onfailure) had 1 events:
  -- Pod 'purge-dup-args-28929393-d65lf'. Phase: 'failed'. Container state: 'terminated'. Exit code was '137'. With reason 'ContainerStatusUnknown'. With message: 'The container could not be located when the pod was terminated'.
  • Message-Id: <E1tT5j5-006ehF-2G@mail.tools.wmcloud.org>; Date: Wed, 01 Jan 2025 20:53:59 +0000
  • Message-Id: <E1tT6sC-006fs9-0o@mail.tools.wmcloud.org>; Date: Wed, 01 Jan 2025 22:07:28 +0000
* Job 'purge-weekly' (cronjob) (emails: onfailure) had 1 events:
  -- Pod 'purge-weekly-28925367-n28t6'. Phase: 'failed'. Container state: 'terminated'. Start timestamp 2024-12-30T01:27:17Z. Finish timestamp 2024-12-30T01:35:58Z. Exit code was '137'. With reason 'Error'.
  • Message-Id: <E1tS4ma-005HqU-0z@mail.tools.wmcloud.org>; Date: Mon, 30 Dec 2024 01:41:24 +0000
  • Message-Id: <E1tU2af-007jdL-0X@mail.tools.wmcloud.org>; Date: Sat, 04 Jan 2025 11:45:13 +0000
* Job 'all-vios' (cronjob) (emails: onfailure) had 1 events:
  -- Pod 'all-vios-28923268-sfx8b'. Phase: 'failed'. Container state: 'terminated'. Start timestamp 2024-12-28T14:28:14Z. Finish timestamp 2024-12-28T14:40:36Z. Exit code was '137'. With reason 'Error'.
  • Message-Id: <E1tRY4k-004hJS-0n@mail.tools.wmcloud.org>; Date: Sat, 28 Dec 2024 14:45:58 +0000
  • Message-Id: <E1tU0Ou-007erX-2E@mail.tools.wmcloud.org>; Date: Sat, 04 Jan 2025 09:24:56 +0000
* Job 'cfdw' (cronjob) (emails: onfailure) had 1 events:
  -- Pod 'cfdw-28936397-7m5lm'. Phase: 'failed'. Container state: 'terminated'. Exit code was '137'. With reason 'ContainerStatusUnknown'. With message: 'The container could not be located when the pod was terminated'.
  • Message-Id: <E1tUqry-008vMw-0s@mail.tools.wmcloud.org>; Date: Mon, 06 Jan 2025 17:26:26 +0000
  • Message-Id: <E1tUr4n-008veS-1j@mail.tools.wmcloud.org>; Date: Mon, 06 Jan 2025 17:39:41 +0000
* Job 'recent-vios' (cronjob) (emails: onfailure) had 1 events:
  -- Pod 'recent-vios-28940677-pbwj9'. Phase: 'failed'. Container state: 'terminated'. Exit code was '137'. With reason 'ContainerStatusUnknown'. With message: 'The container could not be located when the pod was terminated'.
  • Message-Id: <E1tVvdL-00Amon-38@mail.tools.wmcloud.org>; Date: Thu, 09 Jan 2025 16:43:47 +0000
  • Message-Id: <E1tVwGG-00Ao0s-0t@mail.tools.wmcloud.org>; Date: Thu, 09 Jan 2025 17:24:00 +0000
* Job 'cfdw' (cronjob) (emails: onfailure) had 1 events:
  -- Pod 'cfdw-28943207-lfxgd'. Phase: 'failed'. Container state: 'terminated'. Exit code was '137'. With reason 'ContainerStatusUnknown'. With message: 'The container could not be located when the pod was terminated'.
  • Message-Id: <E1tWZHe-00BbrD-0E@mail.tools.wmcloud.org>; Date: Sat, 11 Jan 2025 11:04:02 +0000
  • Message-Id: <E1tWZhS-00BcJd-04@mail.tools.wmcloud.org>; Date: Sat, 11 Jan 2025 11:30:42 +0000
* Job 'cfdw' (cronjob) (emails: onfailure) had 1 events:
  -- Pod 'cfdw-28943807-gjv4h'. Phase: 'failed'. Container state: 'terminated'. Exit code was '137'. With reason 'ContainerStatusUnknown'. With message: 'The container could not be located when the pod was terminated'.
  • Message-Id: <E1tWifr-00Bl7s-33@mail.tools.wmcloud.org>; Date: Sat, 11 Jan 2025 21:05:39 +0000
  • Message-Id: <E1tWj5q-00BlXy-1B@mail.tools.wmcloud.org>; Date: Sat, 11 Jan 2025 21:32:30 +0000
* Job 'cfdw' (cronjob) (emails: onfailure) had 1 events:
  -- Pod 'cfdw-28944617-4fjtq'. Phase: 'failed'. Container state: 'terminated'. Exit code was '137'. With reason 'ContainerStatusUnknown'. With message: 'The container could not be located when the pod was terminated'.
  • Message-Id: <E1tWvII-00C0ZW-1d@mail.tools.wmcloud.org>; Date: Sun, 12 Jan 2025 10:34:10 +0000
  • Message-Id: <E1tWvbn-00C0wX-2j@mail.tools.wmcloud.org>; Date: Sun, 12 Jan 2025 10:54:19 +0000
* Job 'cfdw' (cronjob) (emails: onfailure) had 1 events:
  -- Pod 'cfdw-28945127-cwfv2'. Phase: 'failed'. Container state: 'terminated'. Exit code was '137'. With reason 'ContainerStatusUnknown'. With message: 'The container could not be located when the pod was terminated'.
  • Message-Id: <E1tX37e-00C8ig-2C@mail.tools.wmcloud.org>; Date: Sun, 12 Jan 2025 18:55:42 +0000
  • Message-Id: <E1tX3R3-00C96l-12@mail.tools.wmcloud.org>; Date: Sun, 12 Jan 2025 19:15:45 +0000
* Job 'cfdw' (cronjob) (emails: onfailure) had 1 events:
  -- Pod 'cfdw-28949957-bhcjv'. Phase: 'failed'. Container state: 'terminated'. Start timestamp 2025-01-16T03:17:11Z. Finish timestamp 2025-01-16T03:33:17Z. Exit code was '137'. With reason 'Error'.
  • Message-Id: <E1tYGhh-00DjVB-3A@mail.tools.wmcloud.org>; Date: Thu, 16 Jan 2025 03:37:57 +0000
  • Message-Id: <E1tYH7c-00Dk8Y-0r@mail.tools.wmcloud.org>; Date: Thu, 16 Jan 2025 04:04:44 +0000
* Job 'cfdw' (cronjob) (emails: onfailure) had 1 events:
  -- Pod 'cfdw-28951217-cxt47'. Phase: 'failed'. Container state: 'terminated'. Exit code was '137'. With reason 'ContainerStatusUnknown'. With message: 'The container could not be located when the pod was terminated'.
  • Message-Id: <E1tYaDK-00EA8v-28@mail.tools.wmcloud.org>; Date: Fri, 17 Jan 2025 00:27:54 +0000
  • Message-Id: <E1tYaWs-00EAf4-0d@mail.tools.wmcloud.org>; Date: Fri, 17 Jan 2025 00:48:06 +0000
* Job 'purge-dup-args' (cronjob) (emails: onfailure) had 1 events:
  -- Pod 'purge-dup-args-28951293-hsthf'. Phase: 'failed'. Container state: 'terminated'. Exit code was '137'. With reason 'ContainerStatusUnknown'. With message: 'The container could not be located when the pod was terminated'.
  • Message-Id: <E1tYbZS-00EByt-0J@mail.tools.wmcloud.org>; Date: Fri, 17 Jan 2025 01:54:50 +0000
  • Message-Id: <E1tYcCE-00EClW-1z@mail.tools.wmcloud.org>; Date: Fri, 17 Jan 2025 02:34:54 +0000
* Job 'purge-dup-args' (cronjob) (emails: onfailure) had 1 events:
  -- Pod 'purge-dup-args-28952673-r92dm'. Phase: 'failed'. Container state: 'terminated'. Start timestamp 2025-01-18T00:33:14Z. Finish timestamp 2025-01-18T00:56:01Z. Exit code was '137'. With reason 'Error'.
  • Message-Id: <E1tYx9l-00EfSi-2Q@mail.tools.wmcloud.org>; Date: Sat, 18 Jan 2025 00:57:45 +0000
  • Message-Id: <E1tYxmf-00EgGW-0b@mail.tools.wmcloud.org>; Date: Sat, 18 Jan 2025 01:37:57 +0000
* Job 'cfdw' (cronjob) (emails: onfailure) had 1 events:
  -- Pod 'cfdw-28953497-5lv26'. Phase: 'failed'. Container state: 'terminated'. Start timestamp 2025-01-18T14:17:21Z. Finish timestamp 2025-01-18T14:29:38Z. Exit code was '137'. With reason 'Error
  • Message-Id: <E1tZ9si-00Etz8-0p@mail.tools.wmcloud.org>; Date: Sat, 18 Jan 2025 14:33:00 +0000
  • Message-Id: <E1tZACG-00EuO0-2Q@mail.tools.wmcloud.org>; Date: Sat, 18 Jan 2025 14:53:12 +0000
* Job 'recent-vios' (cronjob) (emails: onfailure) had 1 events:
  -- Pod 'recent-vios-28954357-4c4rr'. Phase: 'failed'. Container state: 'terminated'. Exit code was '137'. With reason 'ContainerStatusUnknown'. With message: 'The container could not be located when the pod was terminated'.
  • Message-Id: <E1tZN8G-00F7nJ-0c@mail.tools.wmcloud.org>; Date: Sun, 19 Jan 2025 04:41:56 +0000
  • Message-Id: <E1tZO4P-00F8lk-1Y@mail.tools.wmcloud.org>; Date: Sun, 19 Jan 2025 05:42:01 +0000
* Job 'recent-vios' (cronjob) (emails: onfailure) had 1 events:
  -- Pod 'recent-vios-28955557-4gwzn'. Phase: 'failed'. Container state: 'terminated'. Exit code was '137'. With reason 'ContainerStatusUnknown'. With message: 'The container could not be located when the pod was terminated'.
  • Message-Id: <E1tZgJp-00FU1P-2K@mail.tools.wmcloud.org>; Date: Mon, 20 Jan 2025 01:11:09 +0000
  • Message-Id: <E1tZh3D-00FV46-0Z@mail.tools.wmcloud.org>; Date: Mon, 20 Jan 2025 01:58:03 +0000
* Job 'purge-daily' (cronjob) (emails: onfailure) had 1 events:
  -- Pod 'purge-daily-28955580-dg494'. Phase: 'failed'. Container state: 'terminated'. Exit code was '137'. With reason 'ContainerStatusUnknown'. With message: 'The container could not be located when the pod was terminated'.
  • Message-Id: <E1tZgJp-00FU0r-0N@mail.tools.wmcloud.org>; Date: Mon, 20 Jan 2025 01:11:09 +0000
  • Message-Id: <E1tZhzN-00FWLb-1w@mail.tools.wmcloud.org>; Date: Mon, 20 Jan 2025 02:58:09 +0000
* Job 'cfdw' (cronjob) (emails: onfailure) had 1 events:
  -- Pod 'cfdw-28955837-tjhvp'. Phase: 'failed'. Container state: 'terminated'. Start timestamp 2025-01-20T05:21:34Z. Finish timestamp 2025-01-20T05:29:58Z. Exit code was '137'. With reason 'Error'.
  • Message-Id: <E1tZkO1-00FZYB-2F@mail.tools.wmcloud.org>; Date: Mon, 20 Jan 2025 05:31:45 +0000
  • Message-Id: <E1tZkhc-00Fa10-0c@mail.tools.wmcloud.org>; Date: Mon, 20 Jan 2025 05:52:00 +0000

Example partial duplicate (event sent more than once):

* Job 'cfdw' (cronjob) (emails: onfailure) had 1 events:
  -- Pod 'cfdw-28928147-9qtjx'. Phase: 'running'. Container state: 'terminated'. Start timestamp 2024-12-31T23:47:01Z. Finish timestamp 2025-01-03T21:46:33Z. Exit code was '255'. With reason 'Unknown'.
  • Message-Id: <E1tTpVp-007RsS-1n@mail.tools.wmcloud.org>; Date: Fri, 03 Jan 2025 21:47:21 +0000
* Job 'cfdw' (cronjob) (emails: onfailure) had 2 events:
  -- Pod 'cfdw-28928147-9qtjx'. Phase: 'running'. Container state: 'terminated'. Start timestamp 2024-12-31T23:47:01Z. Finish timestamp 2025-01-03T21:46:33Z. Exit code was '255'. With reason 'Unknown'.
  -- Pod 'cfdw-28928147-9qtjx'. Phase: 'failed'. Container state: 'terminated'. Start timestamp 2024-12-31T23:47:01Z. Finish timestamp 2025-01-03T21:46:33Z. Exit code was '255'. With reason 'Unknown'.
  • Message-Id: <E1tTpcC-007RwH-1q@mail.tools.wmcloud.org>; Date: Fri, 03 Jan 2025 21:53:56 +0000

My previous report of duplicates that appeared to be solved at the time: T286135#7780505

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
dcaro subscribed.

Interesting, there's a "big" change pending to get deployed improving performance, but it's not there yet, so afaik, there have not been any big changes in the code.

The current "debouncing" time is 400s (6:30s):
https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-emailer/-/blob/main/emailer/cfg.py?ref_type=heads#L33

# The default here is 6.5 minutes
"task_compose_emails_loop_sleep": "400",

That matches almost exactly the time between emails (Fri, 03 Jan 2025 21:47:21 +0000 -> Fri, 03 Jan 2025 21:53:56 +0000), so my guess is that the increased time for debouncing has stopped being as effective.

We might want to increase that and/or find a better process to gather the events we care about.

dcaro triaged this task as Medium priority.Jan 6 2025, 9:52 AM