Page MenuHomePhabricator

Investigate/setup prometheus blackbox_exporter
Open, LowPublic

Description

Looking at how to get smokeping data in prometheus/grafana I found that the Prometheus blackbox_exporter could potentially replace Smokeping in a more efficient way:

  • better event correlation (eg. can have on the same dashboard network latency/loss and applications errors)
  • centralized data (remove the need of yet another tool)
  • time series database (instead or RRD files)
  • Distributed (can run on any server in a P2P way, which is possible with Smokeping but more complex)
  • Easier configuration

On the points to be researched more:

Event Timeline

ayounsi created this task.Jul 6 2017, 9:51 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 6 2017, 9:51 AM
ema added a subscriber: ema.Jul 6 2017, 9:54 AM

Sounds like a good idea! re: the research questions:

  • Alerting we can do through grafana alerts itself or via icinga and check_prometheus, IMHO the latter would be preferrable to not add yet another system in the alerting pipeline
  • Frequency depends on how often we set prometheus for that particular job, I think for this use case we can do 15s yes, we'll have to try!

We have already some scaffolding for blackbox_exporter in puppet (for tools) but it will need some puppet-love, the debian package is relatively straightforward

Change 365239 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: use blackbox-exporter package from Debian

https://gerrit.wikimedia.org/r/365239

Change 365240 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: additional blackbox checks

https://gerrit.wikimedia.org/r/365240

fgiunchedi moved this task from Backlog to Doing on the User-fgiunchedi board.Jul 21 2017, 10:57 AM

Change 365239 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: use blackbox-exporter package from Debian

https://gerrit.wikimedia.org/r/365239

Change 365240 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: additional blackbox checks

https://gerrit.wikimedia.org/r/365240

faidon moved this task from Inbox to In progress on the observability board.Aug 21 2017, 3:18 PM

Change 373062 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: add blackbox configuration for prometheus::ops

https://gerrit.wikimedia.org/r/373062

Change 373062 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: add blackbox configuration for prometheus::ops

https://gerrit.wikimedia.org/r/373062

Mentioned in SAL (#wikimedia-operations) [2017-08-23T09:08:03Z] <godog> upload prometheus-blackbox-exporter 0.7.0+ds1-1~wmf1 to jessie-wikimedia, backported - T169860

Change 373261 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] role: add ssh blackbox probes for bastions

https://gerrit.wikimedia.org/r/373261

Change 373261 merged by Filippo Giunchedi:
[operations/puppet@production] role: add ssh blackbox probes for bastions

https://gerrit.wikimedia.org/r/373261

Change 373266 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] role: use port 22 for ssh probing

https://gerrit.wikimedia.org/r/373266

Change 373266 merged by Filippo Giunchedi:
[operations/puppet@production] role: use port 22 for ssh probing

https://gerrit.wikimedia.org/r/373266

Change 373280 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] role: collect blackbox_exporter metrics in Prometheus global

https://gerrit.wikimedia.org/r/373280

Change 373280 merged by Filippo Giunchedi:
[operations/puppet@production] role: collect blackbox_exporter metrics in Prometheus global

https://gerrit.wikimedia.org/r/373280

fgiunchedi added a comment.EditedAug 23 2017, 1:32 PM

I've put a sample dashboard at https://grafana.wikimedia.org/dashboard/db/network-probes showing for a given "target" (i.e. a bastion at the moment) its maximum latency from all sites and the number of times the probe has flapped.

ATM a check for the ssh banner is performed, IOW a full tcp connection. ICMP probing requires CAP_NET_RAW which isn't configurable in the package yet (reported as https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=872997)

The scaffolding for blackbox testing is now in place WRT software and puppet, I'll kick the ball over to @ayounsi to tweak things further.

fgiunchedi moved this task from Doing to Radar on the User-fgiunchedi board.Aug 24 2017, 3:41 PM
fgiunchedi added a comment.EditedMar 5 2019, 9:12 AM

Potentially useful too: https://bitbucket.org/Svedrin/meshping and https://github.com/SuperQ/smokeping_prober (just came across it, haven't looked at the code and/or tried)

Had a quick look, the main limitation compared to blackbox_exporter is that meshping only supports pings, while Smokeping and bb_exporter supports DNS, TCP, etc.