Page MenuHomePhabricator

Replace smokeping with a Prometheus-based solution
Open, MediumPublic

Description

Looking at how to get smokeping data in prometheus/grafana I found that the Prometheus blackbox_exporter could potentially replace Smokeping in a more efficient way:

  • better event correlation (eg. can have on the same dashboard network latency/loss and applications errors)
  • centralized data (remove the need of yet another tool)
  • time series database (instead or RRD files)
  • Distributed (can run on any server in a P2P way, which is possible with Smokeping but more complex)
  • Easier configuration

On the points to be researched more:

Experimental/PoC dashboard: https://grafana.wikimedia.org/d/CbNAwAXnk/filippo-blackbox-smoke-icmp

Details

ProjectBranchLines +/-Subject
operations/puppetproduction+4 -0
operations/puppetproduction+0 -130
operations/puppetproduction+123 -2
operations/puppetproduction+64 -0
operations/puppetproduction+0 -14
operations/alertsmaster+34 -0
operations/puppetproduction+45 -0
operations/puppetproduction+18 -0
operations/puppetproduction+0 -49
operations/puppetproduction+21 -0
operations/puppetproduction+0 -87
operations/alertsmaster+43 -0
operations/puppetproduction+83 -3
operations/puppetproduction+513 -101
operations/puppetproduction+2 -0
operations/puppetproduction+3 -1
operations/puppetproduction+26 -1
operations/puppetproduction+93 -1
operations/puppetproduction+37 -0
operations/puppetproduction+6 -13
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Sounds like a good idea! re: the research questions:

  • Alerting we can do through grafana alerts itself or via icinga and check_prometheus, IMHO the latter would be preferrable to not add yet another system in the alerting pipeline
  • Frequency depends on how often we set prometheus for that particular job, I think for this use case we can do 15s yes, we'll have to try!

We have already some scaffolding for blackbox_exporter in puppet (for tools) but it will need some puppet-love, the debian package is relatively straightforward

Change 365239 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: use blackbox-exporter package from Debian

https://gerrit.wikimedia.org/r/365239

Change 365240 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: additional blackbox checks

https://gerrit.wikimedia.org/r/365240

Change 365239 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: use blackbox-exporter package from Debian

https://gerrit.wikimedia.org/r/365239

Change 365240 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: additional blackbox checks

https://gerrit.wikimedia.org/r/365240

Change 373062 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: add blackbox configuration for prometheus::ops

https://gerrit.wikimedia.org/r/373062

Change 373062 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: add blackbox configuration for prometheus::ops

https://gerrit.wikimedia.org/r/373062

Mentioned in SAL (#wikimedia-operations) [2017-08-23T09:08:03Z] <godog> upload prometheus-blackbox-exporter 0.7.0+ds1-1~wmf1 to jessie-wikimedia, backported - T169860

Change 373261 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] role: add ssh blackbox probes for bastions

https://gerrit.wikimedia.org/r/373261

Change 373261 merged by Filippo Giunchedi:
[operations/puppet@production] role: add ssh blackbox probes for bastions

https://gerrit.wikimedia.org/r/373261

Change 373266 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] role: use port 22 for ssh probing

https://gerrit.wikimedia.org/r/373266

Change 373266 merged by Filippo Giunchedi:
[operations/puppet@production] role: use port 22 for ssh probing

https://gerrit.wikimedia.org/r/373266

Change 373280 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] role: collect blackbox_exporter metrics in Prometheus global

https://gerrit.wikimedia.org/r/373280

Change 373280 merged by Filippo Giunchedi:
[operations/puppet@production] role: collect blackbox_exporter metrics in Prometheus global

https://gerrit.wikimedia.org/r/373280

I've put a sample dashboard at https://grafana.wikimedia.org/dashboard/db/network-probes showing for a given "target" (i.e. a bastion at the moment) its maximum latency from all sites and the number of times the probe has flapped.

ATM a check for the ssh banner is performed, IOW a full tcp connection. ICMP probing requires CAP_NET_RAW which isn't configurable in the package yet (reported as https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=872997)

The scaffolding for blackbox testing is now in place WRT software and puppet, I'll kick the ball over to @ayounsi to tweak things further.

Potentially useful too: https://bitbucket.org/Svedrin/meshping and https://github.com/SuperQ/smokeping_prober (just came across it, haven't looked at the code and/or tried)

Had a quick look, the main limitation compared to blackbox_exporter is that meshping only supports pings, while Smokeping and bb_exporter supports DNS, TCP, etc.

fgiunchedi renamed this task from Investigate/setup prometheus blackbox_exporter to Replace smokeping with a Prometheus-based solution.Dec 15 2021, 9:22 AM

Following up on your IRC request on what a Smokeping replacement MVP would look like.

Smokeping currently sends pings to a manually defined list of devices from a single location (netmon1002 by default).
It also test DNS latency to ns0/1/2 (which I won't cover here, focusing on the network aspect).
It then reports latency, jitter and loss on a dashboard and send emails when such metrics are outside of thresholds (silencing those require a Puppet commit).

This host list is curated to have:

  • all routers
  • some management interfaces
  • one host per failure domain (best effort)
    • one per row in eqiad/codfw
    • one per site for POPs
    • one per rack for drmrs

An MVP using Blackbox Exporter would ideally have a roughly similar set of features, while not painting ourselves in a corner to match feature parity.
Only ping is needed.
It would needs to test latency to routers, one possibility is to re-use the devices defined in netops::monitoring.

For hosts, I don't think there is a way (until T229397 is done) to get one per row or per rack, picking a host automatically also risks triggering alerting if maintenance is performed on that host.

A good middle ground would be to test well known hosts, for example bastion hosts or hosts running blackbox_exporter.
As a safety net, the MVP would needs a way to manually define target hosts.

A dashboard with latency/jitter/loss, for each targets, and matching customizable alerts (especially loss).

If I understand the config correctly. Steps 300 pings 20 means smokeping currently sends 20 pings over 300s (5min), so ~1 every 15s.

Probably out of scope for the MVP but to be considered:

  • Have the routers or bastion hosts pinged from multiple vantage points (eg. full POP mesh). That would allow us to quickly pinpoint a problematic link or device.
  • v4 and v6 needs to be separated (Smokeping doesn't make the distinction)
  • Dashboard: being able to filter down on sites (eg. show all targets for a given site), or aggregate by source/destination/address family/etc.
  • Higher layer tests (eg. TCP)
  • Explicit payload size (this would require bumping the blackbox_exporter MTU, but would allow checking MTU missconfigs)

Thank you @ayounsi for taking the time to do this

Following up on your IRC request on what a Smokeping replacement MVP would look like.

Smokeping currently sends pings to a manually defined list of devices from a single location (netmon1002 by default).
It also test DNS latency to ns0/1/2 (which I won't cover here, focusing on the network aspect).
It then reports latency, jitter and loss on a dashboard and send emails when such metrics are outside of thresholds (silencing those require a Puppet commit).

This host list is curated to have:

  • all routers
  • some management interfaces
  • one host per failure domain (best effort)
    • one per row in eqiad/codfw
    • one per site for POPs
    • one per rack for drmrs

An MVP using Blackbox Exporter would ideally have a roughly similar set of features, while not painting ourselves in a corner to match feature parity.
Only ping is needed.

+1

It would needs to test latency to routers, one possibility is to re-use the devices defined in netops::monitoring.

For hosts, I don't think there is a way (until T229397 is done) to get one per row or per rack, picking a host automatically also risks triggering alerting if maintenance is performed on that host.

A good middle ground would be to test well known hosts, for example bastion hosts or hosts running blackbox_exporter.

That's a possibility for sure, I was also imagining to drive the list of target hosts and devices from netbox. I don't know what are the preferred mechanics (e.g. do we run queries towards netbox, vs consuming exported data from netbox itself vs sth else) but I think we have all information to express the semantics you outlined above.

For the MVP though we could/should re-use the same data source(s) as smokeping for now I think.

As a safety net, the MVP would needs a way to manually define target hosts.

+1 on being able to augment the list of targets manually

A dashboard with latency/jitter/loss, for each targets, and matching customizable alerts (especially loss).

If I understand the config correctly. Steps 300 pings 20 means smokeping currently sends 20 pings over 300s (5min), so ~1 every 15s.

Yes a ping every 15s should be no problem. When going full mesh in practice target hosts will see one ping every (<period> times <number of prometheus hosts> (8x as of today))

Probably out of scope for the MVP but to be considered:

  • Have the routers or bastion hosts pinged from multiple vantage points (eg. full POP mesh). That would allow us to quickly pinpoint a problematic link or device.

Due to the way Prometheus is deployed this should be reasonably easy to have in the MVP (i.e. there's little difference in practice between deploying in eqiad only or all sites).

  • v4 and v6 needs to be separated (Smokeping doesn't make the distinction)

Absolutely, depending on the data source for the list of targets the MVP could do v4 and be easily extended to support v6 too (or both from the get go, based on my experience while developing T291946

  • Dashboard: being able to filter down on sites (eg. show all targets for a given site), or aggregate by source/destination/address family/etc.

+1, I'm thinking this should be part of MVP

  • Higher layer tests (eg. TCP)

Easy to add post MVP

  • Explicit payload size (this would require bumping the blackbox_exporter MTU, but would allow checking MTU missconfigs)

Nice idea, and good to know, should be easy enough to extend post-MVP. How many payload sizes did you have in mind ?

It would needs to test latency to routers, one possibility is to re-use the devices defined in netops::monitoring.

For hosts, I don't think there is a way (until T229397 is done) to get one per row or per rack, picking a host automatically also risks triggering alerting if maintenance is performed on that host.

A good middle ground would be to test well known hosts, for example bastion hosts or hosts running blackbox_exporter.

That's a possibility for sure, I was also imagining to drive the list of target hosts and devices from netbox. I don't know what are the preferred mechanics (e.g. do we run queries towards netbox, vs consuming exported data from netbox itself vs sth else) but I think we have all information to express the semantics you outlined above.

In the long run getting that data from Netbox makes sens but we're still far from having the tooling. netops::monitoring is already there, and maintaining a static YAML list seems good enough for a MVP.

If I understand the config correctly. Steps 300 pings 20 means smokeping currently sends 20 pings over 300s (5min), so ~1 every 15s.

Yes a ping every 15s should be no problem. When going full mesh in practice target hosts will see one ping every (<period> times <number of prometheus hosts> (8x as of today))

Overall that's fine, even more than every 8x15s would give us better data.

  • Explicit payload size (this would require bumping the blackbox_exporter MTU, but would allow checking MTU missconfigs)

Nice idea, and good to know, should be easy enough to extend post-MVP. How many payload sizes did you have in mind ?

I'd say 9192-headers, as that's what we have configured everywhere, except for external (last resort) transport GRE tunnels, that depends on Transit providers.

Great idea overall. I fear Grafana won't be able to visualise the data as nicely as Smokeping did, but I'm sure we can get something functional.

Explicit payload size (this would require bumping the blackbox_exporter MTU, but would allow checking MTU missconfigs)

Nice idea, and good to know, should be easy enough to extend post-MVP. How many payload sizes did you have in mind

We'd probably want an option to to set the "don't fragment" flag on the pings, or do that by default, if we go down this road.

Otherwise there could be an issue where we get a successful result, due to fragmentation, in spite of having an incorrect MTU set somewhere.

Change 777330 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] WIP test replacing smokeping with blackbox exporter

https://gerrit.wikimedia.org/r/777330

I took a stab at this in https://gerrit.wikimedia.org/r/c/operations/puppet/+/777330 and it seems doable by re-using most the work I did for service::catalog probes.

The Prometheus bits are fine, though we need to move $routers (and other "data sources") somewhere for easier sharing, most likely hiera and re-arrange the data structures so they are grouped explicitly by site.

Change 777347 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] WIP move core routers definitions to hiera

https://gerrit.wikimedia.org/r/777347

Change 777347 merged by Filippo Giunchedi:

[operations/puppet@production] netops: move network routers/devices definitions to hiera

https://gerrit.wikimedia.org/r/777347

Change 777330 merged by Filippo Giunchedi:

[operations/puppet@production] netops: ping core routers from Prometheus

https://gerrit.wikimedia.org/r/777330

Pinging all core routers from all sites is now live! i.e. metrics are there

fgiunchedi raised the priority of this task from Low to Medium.May 19 2022, 11:54 AM
fgiunchedi updated the task description. (Show Details)
fgiunchedi updated the task description. (Show Details)

I want to get a matrix with sites and target sites as rows and columns, with cells being e.g. latency. I _almost_ got it in Grafana but not quite (https://grafana.wikimedia.org/d/D5-lIJX7z/filippo-blackbox-smoke-icmp-matrix) and I posted a question upstream: https://community.grafana.com/t/bi-dimensional-matrix-table-display-from-prometheus-query/65972

I spent a bit of time thinking how to best visualize that data.
The matrix is a great idea and would need to be completed with some kind of color coding to know if something is out of the ordinary.

I created a "global" graph: https://grafana.wikimedia.org/d/CbNAwAXnk/filippo-blackbox-smoke-icmp?orgId=1&var-site=All&var-target_site=All&forceLogin&from=now-30m&to=now&viewPanel=9
Which is messy "as it" but allows to quickly drill down the data when responding to an incident.
Eg. first filter on the site showing an issue.

Then I think it will be better to have each site specific graphs be grouped by target sites instead of source, as well as include all the devices present in this site.

Adding a filter for address family (IPv4/IPv6) would be useful as well.

Longer term, it could be useful to reproduce our topology with something like:
https://grafana.com/docs/grafana/latest/visualizations/node-graph/ or even an overlay on a world map (I don't think that's even possible now)
This would be quite beneficial to more efficiently figure out where an issue is happening.

Thank you for the feedback! Super useful

I spent a bit of time thinking how to best visualize that data.
The matrix is a great idea and would need to be completed with some kind of color coding to know if something is out of the ordinary.

I created a "global" graph: https://grafana.wikimedia.org/d/CbNAwAXnk/filippo-blackbox-smoke-icmp?orgId=1&var-site=All&var-target_site=All&forceLogin&from=now-30m&to=now&viewPanel=9
Which is messy "as it" but allows to quickly drill down the data when responding to an incident.
Eg. first filter on the site showing an issue.

Then I think it will be better to have each site specific graphs be grouped by target sites instead of source, as well as include all the devices present in this site.

Agreed, I've changed the dashboard to group by target site and device (instance)

Adding a filter for address family (IPv4/IPv6) would be useful as well.

Done (defaults to average across families)

Longer term, it could be useful to reproduce our topology with something like:
https://grafana.com/docs/grafana/latest/visualizations/node-graph/ or even an overlay on a world map (I don't think that's even possible now)
This would be quite beneficial to more efficiently figure out where an issue is happening.

Agreed that'd be very nice!

That's nice! I think we're at a point where we can make this dashboard official, add alerting (and doc on how to silence a device) and remove the matching devices from Smokeping.

That's nice! I think we're at a point where we can make this dashboard official, add alerting (and doc on how to silence a device) and remove the matching devices from Smokeping.

SGTM! I've added (un)availability panel and renamed the dashboard to https://grafana-rw.wikimedia.org/d/m1LYjVjnz/network-icmp-probes . I'll move on to alerting, docs, etc

Change 804304 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] netops: add PingUnavailable alert

https://gerrit.wikimedia.org/r/804304

Change 804304 merged by Filippo Giunchedi:

[operations/alerts@master] netops: add PingUnreachable alert

https://gerrit.wikimedia.org/r/804304

Change 807100 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] smokeping: stop targetting cr devices, moved to Prometheus

https://gerrit.wikimedia.org/r/807100

Change 807179 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: ping access switches and FR firewalls

https://gerrit.wikimedia.org/r/807179

Change 807100 merged by Filippo Giunchedi:

[operations/puppet@production] smokeping: stop targetting cr devices, moved to Prometheus

https://gerrit.wikimedia.org/r/807100

Change 807179 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: ping access switches and FR firewalls

https://gerrit.wikimedia.org/r/807179

Change 808914 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] smokeping: remove asw/pfw, moved to Prometheus

https://gerrit.wikimedia.org/r/808914

Change 808914 merged by Filippo Giunchedi:

[operations/puppet@production] smokeping: remove pfw some asw, moved to Prometheus

https://gerrit.wikimedia.org/r/808914

Status update:

  1. all cr devices are pinged full-mesh
  2. all l3sw and pfw devices are pinged from within their site

We have left in smokeping:

  1. "canaries" for the mgmt network, namely a mix of msw and asw
  2. a few hosts
    1. frack: frbast-eqiad and frpig1001
    2. a manually-maintained list of per-row (more or less) list of well-known hosts
  3. DNS latency towards ns*

The same general issue of "blackbox probes reaching out to the mgmt network" is being tackled as part of T310266: Move mgmt SSH checks from Icinga to Prometheus/Alertmanager and will likely result in running blackbox exporter from hosts that do have access to the mgmt network (e.g. cumin)

The "pick a couple of hosts per row to be probed" problem could be tackled either via netbox-exported hiera data, or a puppetdb query (I think), or even more simply target e.g. all bastions (either site-local or full-mesh). Of course depends on what we'd like to be observing!

For frack I'm not quite sure what the right answer is, though the answer is likely of bigger scope and ties into what we'll be doing wrt frack and icinga and all that. At the moment for example frpig1001 is mentioned in puppet in the manually-maintained list of frack hosts in icinga (modules/icinga/templates/nsca_frack.cfg.erb) and smokeping (modules/smokeping/templates/config.d/Targets.erb) frbast-eqiad shows up in smokeping only.

DNS checks should be ported over blackbox exporter too, and while we're at it make sure we're performing the right checks

Change 809535 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: add initial blackbox dns probes for wikipedia

https://gerrit.wikimedia.org/r/809535

Change 809536 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: probe DNS for (www).wikipedia.org

https://gerrit.wikimedia.org/r/809536

Change 809535 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: add initial blackbox dns probes for wikipedia

https://gerrit.wikimedia.org/r/809535

Change 809536 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: probe DNS for (www).wikipedia.org

https://gerrit.wikimedia.org/r/809536

Change 811207 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] netops: add DNS probes alerts

https://gerrit.wikimedia.org/r/811207

Change 811207 merged by Filippo Giunchedi:

[operations/alerts@master] netops: add DNS probes alerts

https://gerrit.wikimedia.org/r/811207

Change 812329 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] smokeping: remove DNS targets, moved to Prometheus

https://gerrit.wikimedia.org/r/812329

Change 812330 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: add support to blackbox icmp probe hosts

https://gerrit.wikimedia.org/r/812330

Change 812331 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: blackbox icmp probes for hosts

https://gerrit.wikimedia.org/r/812331

Change 812329 merged by Filippo Giunchedi:

[operations/puppet@production] smokeping: remove DNS targets, moved to Prometheus

https://gerrit.wikimedia.org/r/812329

Change 812330 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: add support to blackbox icmp probe hosts

https://gerrit.wikimedia.org/r/812330

Change 812331 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: blackbox icmp probes for hosts

https://gerrit.wikimedia.org/r/812331

Change 814792 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] smokeping: remove sampled hosts, probed by Prometheus

https://gerrit.wikimedia.org/r/814792

Change 814792 merged by Filippo Giunchedi:

[operations/puppet@production] smokeping: remove sampled hosts, probed by Prometheus

https://gerrit.wikimedia.org/r/814792

Change 814849 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] smokeping: fix targets configuration for drmrs

https://gerrit.wikimedia.org/r/814849

Change 814849 merged by Filippo Giunchedi:

[operations/puppet@production] smokeping: fix targets configuration for drmrs

https://gerrit.wikimedia.org/r/814849