Page MenuHomePhabricator

SystemdUnitFailed - zuul-executor
Closed, ResolvedPublic

Description

Common information

  • alertname: SystemdUnitFailed
  • name: zuul-executor.service
  • prometheus: ops
  • severity: critical
  • source: prometheus
  • team: collaboration-services

Firing alerts



Event Timeline

Dzahn renamed this task from SystemdUnitFailed to SystemdUnitFailed - zuul-executor.Thu, Mar 26, 3:43 PM
Dzahn claimed this task.

Change #1261690 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] zuul: break out mTLS setup into separate class

https://gerrit.wikimedia.org/r/1261690

This is an issue that might require puppet refactoring.

core issue: we have only a single zuul config that is used for zuul-web, zuul-scheduler and zuul-executor.

The former 2 are on the machines with the zuul "main" role, and the latter has it's own role and machines.

We setup zookeeper and TLS for it in the "main" class but if the global config contains references to TLS certs for zookeeper they can't be found on executor roles.

And if we move the code that generates the certs it has dependencies on the existence of zookeeper itself.. which so far we have not installed (or needed?) on the executor role.

Fixes can be to have different zuul configs for different roles or install zookeeper everywhere.

Mentioned in SAL (#wikimedia-operations) [2026-03-28T14:48:00Z] <dzahn@cumin2002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on zuul1002.eqiad.wmnet with reason: T421398

Mentioned in SAL (#wikimedia-operations) [2026-03-28T14:48:40Z] <dzahn@cumin2002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on zuul2002.codfw.wmnet with reason: T421398

Change #1261701 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] zuul: have 2 separate configs for main vs executor

https://gerrit.wikimedia.org/r/1261701

Mentioned in SAL (#wikimedia-operations) [2026-04-03T23:48:50Z] <dzahn@cumin2002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on zuul1002.eqiad.wmnet with reason: T421398

Mentioned in SAL (#wikimedia-operations) [2026-04-03T23:49:04Z] <dzahn@cumin2002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on zuul2002.codfw.wmnet with reason: T421398

Dzahn triaged this task as High priority.Fri, Apr 3, 11:51 PM

Change #1261701 abandoned by Dzahn:

[operations/puppet@production] zuul: have 2 separate configs for main vs executor

Reason:

keeping one global config - the fix should be to break out certificate and TLS setup into a base class that exists on all zuul machines - while NOT installing zookeeper on all of them

https://gerrit.wikimedia.org/r/1261701

Apr 08 23:10:54 zuul1002 docker[2954603]: FileNotFoundError: [Errno 2] No such file or directory: '/etc/zuul/zookeeper-tls/zuul_full_chain.pem'

Ok, so it's not about the wrong or missing directory or path to the certs. It's just the missing full chain file.

root@zuul1002:/etc/zuul/zookeeper-tls# ls
zuul__zuul.chained.pem	zuul__zuul.chain.pem  zuul__zuul.csr  zuul__zuul-key.pem  zuul__zuul.pem

Change #1269073 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] zuul::executor: add TLS full chain needed for zookeeper config

https://gerrit.wikimedia.org/r/1269073

Change #1269073 merged by Dzahn:

[operations/puppet@production] zuul::executor: add TLS full chain needed for zookeeper config

https://gerrit.wikimedia.org/r/1269073

root@zuul1002:/etc/zuul/zookeeper-tls# systemctl status zuul-executor
● zuul-executor.service - zuul executor service
     Loaded: loaded (/usr/lib/systemd/system/zuul-executor.service; enabled; preset: enabled)
     Active: active (running) since Wed 2026-04-08 23:30:43 UTC; 10s ago
 Invocation: af29308774ae4534866f6af753a38e09
   Main PID: 2959045 (docker)

Change #1261690 abandoned by Dzahn:

[operations/puppet@production] zuul: break out mTLS setup into separate class

Reason:

solved in another way

https://gerrit.wikimedia.org/r/1261690

Mentioned in SAL (#wikimedia-operations) [2026-04-10T16:39:33Z] <dzahn@cumin2002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on zuul1001.eqiad.wmnet with reason: T421398

Mentioned in SAL (#wikimedia-operations) [2026-04-14T16:43:23Z] <dzahn@cumin2002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on zuul1001.eqiad.wmnet with reason: T421398

Mentioned in SAL (#wikimedia-operations) [2026-04-14T16:43:49Z] <dzahn@cumin2002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on zuul2001.codfw.wmnet with reason: T421398