It seems https://tools-prometheus.wmflabs.org/tools is non-functional, responding with 503 errors for any request.
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | None | T53434 Establish an internal system or a recommended external system for monitoring user-created Toolforge web services | |||
Resolved | • Bstorm | T197977 https://tools-prometheus.wmflabs.org/tools responds with 503 |
Event Timeline
Adding observability to help with classification and maybe to flag this for someone who can help figure out if we have a runbook for debugging Prometheus.
$ systemctl status prometheus@tools.service ● prometheus@tools.service - prometheus server (instance tools) Loaded: loaded (/lib/systemd/system/prometheus@tools.service; static) Active: activating (auto-restart) (Result: exit-code) since Fri 2018-06-22 19:02:37 UTC; 1s ago Process: 30784 ExecStart=/usr/bin/prometheus -storage.local.max-chunks-to-persist 524288 -storage.local.memory-chunks 1048576 -storage.local.path /srv/prometheus/tools/metrics -web.listen-address 127.0.0.1:9902 -web.external-url https://tools-prometheus.wmflabs.org/tools -storage.local.retention 730h0m0s -config.file /srv/prometheus/tools/prometheus.yml -storage.local.chunk-encoding-version 2 (code=exited, status=1/FAILURE) Main PID: 30784 (code=exited, status=1/FAILURE)
The issue appears to be that it failed suddenly at some point, and a lockfile at
$ sudo ls -l /srv/prometheus/tools/metrics/DIRTY -rw-r--r-- 1 14736 wikidev 0 Jan 31 15:49 /srv/prometheus/tools/metrics/DIRTY
Apparently wasn't cleaned up. Now it is stuck in a restart loop trying to lock that file. I wonder if this died during network maintenance recently--or is the problem that old? I'm not 100% sure the safest way to ensure this comes back up in a consistent state. Removing the lock file should allow it to start the service, but...
Apparently, the file's existence shouldn't matter (at least in current Prometheus). It should be able to lock it, but it cannot https://gitlab.cncf.ci/prometheus/prometheus/blob/934d86b936f3e6e35581c44e6ec22236f4c69de3/util/flock/flock.go#L31
Dude! The answer was right in front of me. On a *nix system, prometheus tries to open the file for reading: https://gitlab.cncf.ci/prometheus/prometheus/blob/934d86b936f3e6e35581c44e6ec22236f4c69de3/util/flock/flock_unix.go#L43
Above, the permissions on the file are from the LDAP user that we deleted. Chowning to the local prometheus user.
$ systemctl status prometheus@tools.service ● prometheus@tools.service - prometheus server (instance tools) Loaded: loaded (/lib/systemd/system/prometheus@tools.service; static) Active: active (running) since Fri 2018-06-22 19:44:36 UTC; 7min ago Main PID: 12729 (prometheus) CGroup: /system.slice/system-prometheus.slice/prometheus@tools.service └─12729 /usr/bin/prometheus -storage.local.max-chunks-to-persist 524288 -storage.local.memory-chunks 1048576 -storage.local.path /srv/prometheus/tools/metrics -web.listen-a.