$ systemctl status firstname.lastname@example.org ● email@example.com - prometheus server (instance tools) Loaded: loaded (/firstname.lastname@example.org; static) Active: activating (auto-restart) (Result: exit-code) since Fri 2018-06-22 19:02:37 UTC; 1s ago Process: 30784 ExecStart=/usr/bin/prometheus -storage.local.max-chunks-to-persist 524288 -storage.local.memory-chunks 1048576 -storage.local.path /srv/prometheus/tools/metrics -web.listen-address 127.0.0.1:9902 -web.external-url https://tools-prometheus.wmflabs.org/tools -storage.local.retention 730h0m0s -config.file /srv/prometheus/tools/prometheus.yml -storage.local.chunk-encoding-version 2 (code=exited, status=1/FAILURE) Main PID: 30784 (code=exited, status=1/FAILURE)
The issue appears to be that it failed suddenly at some point, and a lockfile at
$ sudo ls -l /srv/prometheus/tools/metrics/DIRTY -rw-r--r-- 1 14736 wikidev 0 Jan 31 15:49 /srv/prometheus/tools/metrics/DIRTY
Apparently wasn't cleaned up. Now it is stuck in a restart loop trying to lock that file. I wonder if this died during network maintenance recently--or is the problem that old? I'm not 100% sure the safest way to ensure this comes back up in a consistent state. Removing the lock file should allow it to start the service, but...
Apparently, the file's existence shouldn't matter (at least in current Prometheus). It should be able to lock it, but it cannot https://gitlab.cncf.ci/prometheus/prometheus/blob/934d86b936f3e6e35581c44e6ec22236f4c69de3/util/flock/flock.go#L31
Dude! The answer was right in front of me. On a *nix system, prometheus tries to open the file for reading: https://gitlab.cncf.ci/prometheus/prometheus/blob/934d86b936f3e6e35581c44e6ec22236f4c69de3/util/flock/flock_unix.go#L43
Above, the permissions on the file are from the LDAP user that we deleted. Chowning to the local prometheus user.
$ systemctl status email@example.com ● firstname.lastname@example.org - prometheus server (instance tools) Loaded: loaded (/email@example.com; static) Active: active (running) since Fri 2018-06-22 19:44:36 UTC; 7min ago Main PID: 12729 (prometheus) CGroup: /firstname.lastname@example.org └─12729 /usr/bin/prometheus -storage.local.max-chunks-to-persist 524288 -storage.local.memory-chunks 1048576 -storage.local.path /srv/prometheus/tools/metrics -web.listen-a.