Page MenuHomePhabricator

SystemdUnitFailed (planet and gitlab)
Closed, ResolvedPublic

Description

Common information

  • alertname: SystemdUnitFailed
  • prometheus: ops
  • severity: critical
  • source: prometheus
  • team: collaboration-services

Firing alerts







Event Timeline

Jelto renamed this task from SystemdUnitFailed to SystemdUnitFailed (planet and gitlab).Mar 8 2024, 9:31 AM
Jelto triaged this task as Medium priority.
Jelto subscribed.

GitLab alerts are expected due to version upgrade. But probably we need some kind of silence or workaround to not get 24h of alerts for that.

For planet the exporter fails to start, see T359556#9614359.

GitLab restore errors are resolved by manually running sudo systemctl start backup-restore.service. I think the delay between syncing the backup and restoring is not enough when there are multiple backups for one day (when we've done an upgrade). So the older backup is used which has the wrong version. I'm not sure if we need a followup for that, probably low priority. Just waiting a full 24h would resolve this automatically.

Change 1009775 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] prometheus/apache_exporter: fix argument syntax in bookworm

https://gerrit.wikimedia.org/r/1009775

Mentioned in SAL (#wikimedia-operations) [2024-03-08T20:46:47Z] <mutante> planet1003/2003: apt-get remove prometheus-apache-exporter - T359596

Jelto claimed this task.

This was due to work in T359556 and a GitLab upgrade, two unrelated work items. I'm closing the task

Change #1009775 merged by Dzahn:

[operations/puppet@production] prometheus/apache_exporter: drop argument parameter

https://gerrit.wikimedia.org/r/1009775