Page MenuHomePhabricator

SystemdUnitFailed - lists1004 - wmf_auto_restart_exim4
Closed, ResolvedPublic

Description

Common information

  • alertname: SystemdUnitFailed
  • instance: lists1004:9100
  • name: wmf_auto_restart_exim4.service
  • prometheus: ops
  • severity: critical
  • site: eqiad
  • source: prometheus
  • team: collaboration-services

Firing alerts


Event Timeline

LSobanski renamed this task from SystemdUnitFailed to SystemdUnitFailed - lists1004.Apr 15 2025, 11:13 AM

The daemon is running fine.

root@lists1004:/# systemctl is-failed exim4
active

Wondering more why it got started so much.

root@lists1004:/home/dzahn# grep "daemon " /var/log/exim4/mainlog 
2025-04-15 06:14:00 exim 4.96 daemon started: pid=2945, -q1m, listening for SMTP on port 25 (IPv6 and IPv4)
2025-04-15 06:14:00 exim 4.96 daemon started: pid=2961, -q1m, listening for SMTP on port 25 (IPv6 and IPv4)
2025-04-15 06:14:01 exim 4.96 daemon started: pid=2991, -q1m, listening for SMTP on port 25 (IPv6 and IPv4)
2025-04-15 06:14:01 exim 4.96 daemon started: pid=3021, -q1m, listening for SMTP on port 25 (IPv6 and IPv4)
2025-04-15 06:14:01 exim 4.96 daemon started: pid=3096, -q1m, listening for SMTP on port 25 (IPv6 and IPv4)
2025-04-15 06:18:03 exim 4.96 daemon started: pid=6372, -q1m, listening for SMTP on port 25 (IPv6 and IPv4)
2025-04-15 11:17:29 exim 4.96 daemon started: pid=168749, -q1m, listening for SMTP on port 25 (IPv6 and IPv4)

Oh, nevermind. Of course this isn't exim4 service itself but the "restart_exim4" service again.

We have seen this before and it's a race of some kind.

Apr 15 11:17:29 lists1004 systemd[1]: exim4.service: Found left-over process 167622 (exim4) in control group while starting unit. Ignoring.
Apr 15 11:17:29 lists1004 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Apr 15 11:17:29 lists1004 systemd[1]: exim4.service: Found left-over process 168172 (exim4) in control group while starting unit. Ignoring.
Apr 15 11:17:29 lists1004 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Apr 15 11:17:29 lists1004 systemd[1]: Starting exim4.service - LSB: exim Mail Transport Agent...
Apr 15 06:14:00 lists1004 wmf-auto-restart[2911]: INFO: 2025-04-15 06:14:00,441 : Detected necessary restart for service exim4 (2901)
Apr 15 06:14:00 lists1004 wmf-auto-restart[2911]: WARNING: 2025-04-15 06:14:00,441 : Service exim4 uses a legacy sysvinit script
Apr 15 06:14:00 lists1004 wmf-auto-restart[2911]: WARNING: 2025-04-15 06:14:00,441 : Consider using a systemd unit instead
Apr 15 06:14:00 lists1004 wmf-auto-restart[2911]: INFO: 2025-04-15 06:14:00,632 : Restarted service exim4
Apr 15 06:14:00 lists1004 wmf-auto-restart[2911]: INFO: 2025-04-15 06:14:00,632 : Detected necessary restart for service exim4 (2873)
Apr 15 06:14:00 lists1004 wmf-auto-restart[2911]: WARNING: 2025-04-15 06:14:00,632 : Service exim4 uses a legacy sysvinit script
Apr 15 06:14:00 lists1004 wmf-auto-restart[2911]: WARNING: 2025-04-15 06:14:00,632 : Consider using a systemd unit instead
Apr 15 06:14:00 lists1004 wmf-auto-restart[2911]: INFO: 2025-04-15 06:14:00,830 : Restarted service exim4
Apr 15 06:14:00 lists1004 wmf-auto-restart[2911]: INFO: 2025-04-15 06:14:00,830 : Detected necessary restart for service exim4 (2886)
Apr 15 06:14:00 lists1004 wmf-auto-restart[2911]: WARNING: 2025-04-15 06:14:00,830 : Service exim4 uses a legacy sysvinit script
Apr 15 06:14:00 lists1004 wmf-auto-restart[2911]: WARNING: 2025-04-15 06:14:00,830 : Consider using a systemd unit instead
Apr 15 06:14:01 lists1004 wmf-auto-restart[2911]: INFO: 2025-04-15 06:14:01,083 : Restarted service exim4
Apr 15 06:14:01 lists1004 wmf-auto-restart[2911]: INFO: 2025-04-15 06:14:01,083 : Detected necessary restart for service exim4 (2885)
Apr 15 06:14:01 lists1004 wmf-auto-restart[2911]: WARNING: 2025-04-15 06:14:01,084 : Service exim4 uses a legacy sysvinit script
Apr 15 06:14:00 lists1004 wmf-auto-restart[2911]: WARNING: 2025-04-15 06:14:00,441 : Service exim4 uses a legacy sysvinit script
Apr 15 06:14:00 lists1004 wmf-auto-restart[2911]: WARNING: 2025-04-15 06:14:00,441 : Consider using a systemd unit instead
Apr 15 06:14:00 lists1004 wmf-auto-restart[2911]: INFO: 2025-04-15 06:14:00,632 : Restarted service exim4
Apr 15 06:14:00 lists1004 wmf-auto-restart[2911]: INFO: 2025-04-15 06:14:00,632 : Detected necessary restart for service exim4 (2873)
Apr 15 06:14:00 lists1004 wmf-auto-restart[2911]: WARNING: 2025-04-15 06:14:00,632 : Service exim4 uses a legacy sysvinit script
Apr 15 06:14:00 lists1004 wmf-auto-restart[2911]: WARNING: 2025-04-15 06:14:00,632 : Consider using a systemd unit instead
Apr 15 06:14:00 lists1004 wmf-auto-restart[2911]: INFO: 2025-04-15 06:14:00,830 : Restarted service exim4
Apr 15 06:14:00 lists1004 wmf-auto-restart[2911]: INFO: 2025-04-15 06:14:00,830 : Detected necessary restart for service exim4 (2886)
Apr 15 06:14:00 lists1004 wmf-auto-restart[2911]: WARNING: 2025-04-15 06:14:00,830 : Service exim4 uses a legacy sysvinit script
Apr 15 06:14:00 lists1004 wmf-auto-restart[2911]: WARNING: 2025-04-15 06:14:00,830 : Consider using a systemd unit instead
Apr 15 06:14:01 lists1004 wmf-auto-restart[2911]: INFO: 2025-04-15 06:14:01,083 : Restarted service exim4
Apr 15 06:14:01 lists1004 wmf-auto-restart[2911]: INFO: 2025-04-15 06:14:01,083 : Detected necessary restart for service exim4 (2885)
Apr 15 06:14:01 lists1004 wmf-auto-restart[2911]: WARNING: 2025-04-15 06:14:01,084 : Service exim4 uses a legacy sysvinit script
Apr 15 06:14:01 lists1004 wmf-auto-restart[2911]: WARNING: 2025-04-15 06:14:01,084 : Consider using a systemd unit instead
Apr 15 06:14:01 lists1004 wmf-auto-restart[2911]: INFO: 2025-04-15 06:14:01,286 : Restarted service exim4
Apr 15 06:14:01 lists1004 wmf-auto-restart[2911]: INFO: 2025-04-15 06:14:01,286 : Detected necessary restart for service exim4 (2832)
Apr 15 06:14:01 lists1004 wmf-auto-restart[2911]: WARNING: 2025-04-15 06:14:01,286 : Service exim4 uses a legacy sysvinit script
Apr 15 06:14:01 lists1004 wmf-auto-restart[2911]: WARNING: 2025-04-15 06:14:01,286 : Consider using a systemd unit instead
Apr 15 06:14:01 lists1004 wmf-auto-restart[2911]: INFO: 2025-04-15 06:14:01,534 : Restarted service exim4
Apr 15 06:14:01 lists1004 wmf-auto-restart[2911]: INFO: 2025-04-15 06:14:01,534 : Detected necessary restart for service exim4 (2877)
Apr 15 06:14:01 lists1004 wmf-auto-restart[2911]: WARNING: 2025-04-15 06:14:01,534 : Service exim4 uses a legacy sysvinit script
Apr 15 06:14:01 lists1004 wmf-auto-restart[2911]: WARNING: 2025-04-15 06:14:01,535 : Consider using a systemd unit instead
Apr 15 06:14:01 lists1004 wmf-auto-restart[2911]: ERROR: 2025-04-15 06:14:01,656 : Failed to restart service exim4:
Apr 15 06:14:01 lists1004 wmf-auto-restart[2911]: ERROR: 2025-04-15 06:14:01,656 : b'Job for exim4.service failed.\nSee "systemctl status exim4.service" and "journalctl -xeu exim4.service" >
Apr 15 06:14:01 lists1004 systemd[1]: wmf_auto_restart_exim4.service: Main process exited, code=exited, status=1/FAILURE

It works eventually without any intervention.

░░ The unit wmf_auto_restart_exim4.service has successfully entered the 'dead' state.
Apr 15 11:17:31 lists1004 systemd[1]: Finished wmf_auto_restart_exim4.service - Auto restart job: exim4.
░░ Subject: A start job for unit wmf_auto_restart_exim4.service has finished successfully
Dzahn renamed this task from SystemdUnitFailed - lists1004 to SystemdUnitFailed - lists1004 - wmf_auto_restart_exim4.Apr 15 2025, 8:48 PM
[snip]
Apr 15 06:14:01 lists1004 wmf-auto-restart[2911]: INFO: 2025-04-15 06:14:01,286 : Restarted service exim4
Apr 15 06:14:01 lists1004 wmf-auto-restart[2911]: INFO: 2025-04-15 06:14:01,286 : Detected necessary restart for service exim4 (2832)
Apr 15 06:14:01 lists1004 wmf-auto-restart[2911]: WARNING: 2025-04-15 06:14:01,286 : Service exim4 uses a legacy sysvinit script
Apr 15 06:14:01 lists1004 wmf-auto-restart[2911]: WARNING: 2025-04-15 06:14:01,286 : Consider using a systemd unit instead
Apr 15 06:14:01 lists1004 wmf-auto-restart[2911]: INFO: 2025-04-15 06:14:01,534 : Restarted service exim4
Apr 15 06:14:01 lists1004 wmf-auto-restart[2911]: INFO: 2025-04-15 06:14:01,534 : Detected necessary restart for service exim4 (2877)
Apr 15 06:14:01 lists1004 wmf-auto-restart[2911]: WARNING: 2025-04-15 06:14:01,534 : Service exim4 uses a legacy sysvinit script
Apr 15 06:14:01 lists1004 wmf-auto-restart[2911]: WARNING: 2025-04-15 06:14:01,535 : Consider using a systemd unit instead
Apr 15 06:14:01 lists1004 wmf-auto-restart[2911]: ERROR: 2025-04-15 06:14:01,656 : Failed to restart service exim4:
Apr 15 06:14:01 lists1004 wmf-auto-restart[2911]: ERROR: 2025-04-15 06:14:01,656 : b'Job for exim4.service failed.\nSee "systemctl status exim4.service" and "journalctl -xeu exim4.service"
Apr 15 06:14:01 lists1004 systemd[1]: wmf_auto_restart_exim4.service: Main process exited, code=exited, status=1/FAILURE
[snip]

I'd be inclined to think sysv-init is the root cause for this issue. We could try to replace the current handler by something like:

[Unit]
Description=Exim4 Mail Transport Agent
After=network.target
Requires=network.target

[Service]
Type=forking
PIDFile=/run/exim4/exim.pid
ExecStartPre=/usr/sbin/update-exim4.conf
ExecStartPre=/usr/sbin/exim4 -bV
ExecStart=/usr/sbin/exim4 -bd -q30m
ExecReload=/bin/kill -HUP $MAINPID
ExecStop=/bin/kill -TERM $MAINPID
Restart=on-failure
User=Debian-exim
Group=Debian-exim

[Install]
WantedBy=multi-user.target

Delegating process termination to systemd will probably avoid running into this:

Apr 15 06:14:01 lists1004 wmf-auto-restart[2911]: INFO: 2025-04-15 06:14:01,286 : Detected necessary restart for service exim4 (2832)
Apr 15 06:14:01 lists1004 wmf-auto-restart[2911]: WARNING: 2025-04-15 06:14:01,286 : Service exim4 uses a legacy sysvinit script

wdyt?

I think that would be a good test, yea. But then.. we are also supposed to replace that exim with postfix anyways, right?

we are also supposed to replace that exim with postfix anyways, right?

Agreed, we should consider T378021: Replace Exim on lists.wikimedia.org with Postfix as an alternative solution for this ticket!

ABran-WMF changed the task status from Open to Stalled.Jul 7 2025, 1:28 PM
ABran-WMF moved this task from Work in Progress to Backlog on the collaboration-services board.