Page MenuHomePhabricator

https://tools-prometheus.wmflabs.org/tools responds with 503
Closed, ResolvedPublic

Description

It seems https://tools-prometheus.wmflabs.org/tools is non-functional, responding with 503 errors for any request.

Event Timeline

bd808 subscribed.

Adding observability to help with classification and maybe to flag this for someone who can help figure out if we have a runbook for debugging Prometheus.

$ systemctl status prometheus@tools.service
● prometheus@tools.service - prometheus server (instance tools)
   Loaded: loaded (/lib/systemd/system/prometheus@tools.service; static)
   Active: activating (auto-restart) (Result: exit-code) since Fri 2018-06-22 19:02:37 UTC; 1s ago
  Process: 30784 ExecStart=/usr/bin/prometheus -storage.local.max-chunks-to-persist 524288 -storage.local.memory-chunks 1048576 -storage.local.path /srv/prometheus/tools/metrics -web.listen-address 127.0.0.1:9902 -web.external-url https://tools-prometheus.wmflabs.org/tools -storage.local.retention 730h0m0s -config.file /srv/prometheus/tools/prometheus.yml -storage.local.chunk-encoding-version 2 (code=exited, status=1/FAILURE)
 Main PID: 30784 (code=exited, status=1/FAILURE)

The issue appears to be that it failed suddenly at some point, and a lockfile at

$ sudo ls -l /srv/prometheus/tools/metrics/DIRTY
-rw-r--r-- 1 14736 wikidev 0 Jan 31 15:49 /srv/prometheus/tools/metrics/DIRTY

Apparently wasn't cleaned up. Now it is stuck in a restart loop trying to lock that file. I wonder if this died during network maintenance recently--or is the problem that old? I'm not 100% sure the safest way to ensure this comes back up in a consistent state. Removing the lock file should allow it to start the service, but...

Apparently, the file's existence shouldn't matter (at least in current Prometheus). It should be able to lock it, but it cannot https://gitlab.cncf.ci/prometheus/prometheus/blob/934d86b936f3e6e35581c44e6ec22236f4c69de3/util/flock/flock.go#L31

Dude! The answer was right in front of me. On a *nix system, prometheus tries to open the file for reading: https://gitlab.cncf.ci/prometheus/prometheus/blob/934d86b936f3e6e35581c44e6ec22236f4c69de3/util/flock/flock_unix.go#L43

Above, the permissions on the file are from the LDAP user that we deleted. Chowning to the local prometheus user.

$ systemctl status prometheus@tools.service
● prometheus@tools.service - prometheus server (instance tools)
   Loaded: loaded (/lib/systemd/system/prometheus@tools.service; static)
   Active: active (running) since Fri 2018-06-22 19:44:36 UTC; 7min ago
 Main PID: 12729 (prometheus)
   CGroup: /system.slice/system-prometheus.slice/prometheus@tools.service
           └─12729 /usr/bin/prometheus -storage.local.max-chunks-to-persist 524288 -storage.local.memory-chunks 1048576 -storage.local.path /srv/prometheus/tools/metrics -web.listen-a.
Bstorm claimed this task.
Vvjjkkii renamed this task from https://tools-prometheus.wmflabs.org/tools responds with 503 to yfaaaaaaaa.Jul 1 2018, 1:02 AM
Vvjjkkii reopened this task as Open.
Vvjjkkii removed Bstorm as the assignee of this task.
Vvjjkkii triaged this task as High priority.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
Matthewrbowker renamed this task from yfaaaaaaaa to https://tools-prometheus.wmflabs.org/tools responds with 503.Jul 1 2018, 1:31 AM
Matthewrbowker closed this task as Resolved.
Matthewrbowker assigned this task to Bstorm.
Matthewrbowker raised the priority of this task from High to Needs Triage.
Matthewrbowker updated the task description. (Show Details)
Matthewrbowker added a subscriber: Aklapper.