Page MenuHomePhabricator

prometheus-blazegraph-exporter failing to start after reboot
Closed, ResolvedPublic

Description

While doing reboots for kernel upgrade, I realized that prometheus-blazegraph-exporter fails to start after reboot. A manual restart once other services are started works.

There isn't a dependency between the exporter and blazegraph itself, but that should not be an issue. The exporter should start even without blazegraph being present.

I'll take a deeper look once the restarts are done.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Extract from the logs:

Jan 08 14:20:08 wdqs2001 systemd[1]: Started Prometheus Blazegraph Exporter.
Jan 08 14:20:08 wdqs2001 prometheus-blazegraph-exporter[2001]: Traceback (most recent call last):
Jan 08 14:20:08 wdqs2001 prometheus-blazegraph-exporter[2001]: File "/usr/bin/prometheus-blazegraph-exporter", line 167, in <module>
Jan 08 14:20:08 wdqs2001 prometheus-blazegraph-exporter[2001]: sys.exit(main())
Jan 08 14:20:08 wdqs2001 prometheus-blazegraph-exporter[2001]: File "/usr/bin/prometheus-blazegraph-exporter", line 156, in main
Jan 08 14:20:08 wdqs2001 prometheus-blazegraph-exporter[2001]: REGISTRY.register(PrometheusBlazeGraphExporter())
Jan 08 14:20:08 wdqs2001 prometheus-blazegraph-exporter[2001]: File "/usr/lib/python2.7/dist-packages/prometheus_client/core.py", line 50, in register
Jan 08 14:20:08 wdqs2001 prometheus-blazegraph-exporter[2001]: names = self._get_names(collector)
Jan 08 14:20:08 wdqs2001 prometheus-blazegraph-exporter[2001]: File "/usr/lib/python2.7/dist-packages/prometheus_client/core.py", line 86, in _get_names
Jan 08 14:20:08 wdqs2001 prometheus-blazegraph-exporter[2001]: for metric in desc_func():
Jan 08 14:20:08 wdqs2001 prometheus-blazegraph-exporter[2001]: File "/usr/bin/prometheus-blazegraph-exporter", line 100, in collect
Jan 08 14:20:08 wdqs2001 prometheus-blazegraph-exporter[2001]: metric_value = self.get_counter(metric_name)
Jan 08 14:20:08 wdqs2001 prometheus-blazegraph-exporter[2001]: File "/usr/bin/prometheus-blazegraph-exporter", line 57, in get_counter
Jan 08 14:20:08 wdqs2001 prometheus-blazegraph-exporter[2001]: response = urllib2.urlopen(req)
Jan 08 14:20:08 wdqs2001 prometheus-blazegraph-exporter[2001]: File "/usr/lib/python2.7/urllib2.py", line 154, in urlopen
Jan 08 14:20:08 wdqs2001 prometheus-blazegraph-exporter[2001]: return opener.open(url, data, timeout)
Jan 08 14:20:08 wdqs2001 prometheus-blazegraph-exporter[2001]: File "/usr/lib/python2.7/urllib2.py", line 431, in open
Jan 08 14:20:08 wdqs2001 prometheus-blazegraph-exporter[2001]: response = self._open(req, data)
Jan 08 14:20:08 wdqs2001 prometheus-blazegraph-exporter[2001]: File "/usr/lib/python2.7/urllib2.py", line 449, in _open
Jan 08 14:20:08 wdqs2001 prometheus-blazegraph-exporter[2001]: '_open', req)
Jan 08 14:20:08 wdqs2001 prometheus-blazegraph-exporter[2001]: File "/usr/lib/python2.7/urllib2.py", line 409, in _call_chain
Jan 08 14:20:08 wdqs2001 prometheus-blazegraph-exporter[2001]: result = func(*args)
Jan 08 14:20:08 wdqs2001 prometheus-blazegraph-exporter[2001]: File "/usr/lib/python2.7/urllib2.py", line 1227, in http_open
Jan 08 14:20:08 wdqs2001 prometheus-blazegraph-exporter[2001]: return self.do_open(httplib.HTTPConnection, req)
Jan 08 14:20:08 wdqs2001 prometheus-blazegraph-exporter[2001]: File "/usr/lib/python2.7/urllib2.py", line 1197, in do_open
Jan 08 14:20:08 wdqs2001 prometheus-blazegraph-exporter[2001]: raise URLError(err)
Jan 08 14:20:08 wdqs2001 prometheus-blazegraph-exporter[2001]: urllib2.URLError: <urlopen error [Errno 111] Connection refused>
Jan 08 14:20:08 wdqs2001 systemd[1]: prometheus-blazegraph-exporter.service: main process exited, code=exited, status=1/FAILURE
Jan 08 14:20:08 wdqs2001 systemd[1]: Unit prometheus-blazegraph-exporter.service entered failed state.
Jan 08 14:20:08 wdqs2001 systemd[1]: prometheus-blazegraph-exporter.service holdoff time over, scheduling restart.
Jan 08 14:20:08 wdqs2001 systemd[1]: Stopping Prometheus Blazegraph Exporter...
Jan 08 14:20:08 wdqs2001 systemd[1]: Starting Prometheus Blazegraph Exporter...
Jan 08 14:20:08 wdqs2001 systemd[1]: prometheus-blazegraph-exporter.service start request repeated too quickly, refusing to start.
Jan 08 14:20:08 wdqs2001 systemd[1]: Failed to start Prometheus Blazegraph Exporter.
Jan 08 14:20:08 wdqs2001 systemd[1]: Unit prometheus-blazegraph-exporter.service entered failed state.

It looks like a check is already done on startup, not waiting for an incoming HTTP request to the exporter.

That's a bug in the systemd unit of prometheus-blazegraph-exporter, it needs to start after Blazegraph, but the current version doesn't declare that, so systemd tries to start it when multi-user.target is reached. I can fix it some time this week.

Change 403915 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/debs/prometheus-blazegraph-exporter@master] debian: start after blazegraph

https://gerrit.wikimedia.org/r/403915

Change 403933 had a related patch set uploaded (by Gehel; owner: Guillaume Lederrey):
[operations/debs/prometheus-blazegraph-exporter@master] prometheus blazegraph exporter should not fail when blazegraph is down

https://gerrit.wikimedia.org/r/403933

Change 403915 merged by Filippo Giunchedi:
[operations/debs/prometheus-blazegraph-exporter@master] debian: start after blazegraph

https://gerrit.wikimedia.org/r/403915

Change 403933 merged by Gehel:
[operations/debs/prometheus-blazegraph-exporter@master] prometheus blazegraph exporter should not fail when blazegraph is down

https://gerrit.wikimedia.org/r/403933

Change 404316 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/debs/prometheus-blazegraph-exporter@master] Don't depend on blazegraph

https://gerrit.wikimedia.org/r/404316

Change 404316 merged by Filippo Giunchedi:
[operations/debs/prometheus-blazegraph-exporter@master] Don't depend on blazegraph

https://gerrit.wikimedia.org/r/404316

fgiunchedi claimed this task.
fgiunchedi subscribed.

Done, fix deployed

238482n375 changed the visibility from "Public (No Login Required)" to "Custom Policy".
This comment was removed by Vgutierrez.