Page MenuHomePhabricator

SystemdUnitFailed - contint1003 - envoyproxy
Closed, ResolvedPublic

Description

Common information

  • alertname: SystemdUnitFailed
  • instance: contint1003:9100
  • name: wmf_auto_restart_envoyproxy.service
  • prometheus: ops
  • severity: critical
  • site: eqiad
  • source: prometheus
  • team: collaboration-services

Firing alerts


Event Timeline

Dzahn renamed this task from SystemdUnitFailed to SystemdUnitFailed - contint1003 - envoyproxy.Wed, May 8, 10:16 PM
Dzahn claimed this task.

This is the test server for releng from T358237

Recently https://gerrit.wikimedia.org/r/c/operations/puppet/+/1028796 was merged which adds auto-restart service for envoyproxy.

That exists on the production contint servers but not on this test server.

Though because it's in class profile::ci::website it is applied here as well, leading to:

[contint1003:~] $ sudo systemctl status envoyproxy
Unit envoyproxy.service could not be found.

leading to:

[contint1003:~] $ sudo systemctl status wmf_auto_restart_envoyproxy
● wmf_auto_restart_envoyproxy.service - Auto restart job: envoyproxy
     Loaded: loaded (/lib/systemd/system/wmf_auto_restart_envoyproxy.service; static)
     Active: failed (Result: exit-code) since Wed 2024-05-08 19:44:03 UTC; 2h 39min ago

Change #1029295 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] ci: include envoyproxy in ci_test role

https://gerrit.wikimedia.org/r/1029295

Change #1029295 merged by Dzahn:

[operations/puppet@production] ci: include envoyproxy in ci_test role

https://gerrit.wikimedia.org/r/1029295

Mentioned in SAL (#wikimedia-operations) [2024-05-08T22:53:03Z] <mutante> contint1003 - systemctl start wmf_auto_restart_envoyproxy T364510 T358237

22:54 < jinxer-wm> RESOLVED: SystemdUnitFailed: wmf_auto_restart_envoyproxy.service on contint1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state -

https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed