Page MenuHomePhabricator

SystemdUnitFailed - people1005 - envoyproxy
Closed, ResolvedPublic

Description

Common information

  • alertname: SystemdUnitFailed
  • instance: people1005:9100
  • prometheus: ops
  • severity: critical
  • site: eqiad
  • source: prometheus
  • team: collaboration-services

Firing alerts



Event Timeline

LSobanski renamed this task from SystemdUnitFailed to SystemdUnitFailed - people1005 - envoyproxy.Sep 3 2025, 11:09 AM
LSobanski assigned this task to Dzahn.

Change #1184553 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] site: add peopleweb role to new peopleweb hosts again

https://gerrit.wikimedia.org/r/1184553

Change #1184553 merged by Dzahn:

[operations/puppet@production] site: add peopleweb role to new peopleweb hosts again

https://gerrit.wikimedia.org/r/1184553

"wmf_auto_restart_envoyproxy.service" failing is not a surprise when envoyproxy just did not exist.. so there was nothing to restart. now there is again.

but actual envoyproxy itself failing to start IS surprising and another thing to debug now .. it should have just worked

It turned out the /etc/envoy/envoy.yaml config file was simply empty.

I had some vague memory this happened to me in the past due to some race condition.

So then just did a rm -rf /etc/envoy and ran puppet to let it attempt to recreate it all from scratch.. and it did.. and then the file had content and envoy started.

After this there was another issue with apache restart on people2004 with an error that the CAS LoginUrl was not configured.. but the values WERE in config files and somehow after a couple restarts the issue was gone.-