Replace smokeping with a Prometheus-based solution
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	ayounsi
	Jul 6 2017, 9:51 AM

Description

Looking at how to get smokeping data in prometheus/grafana I found that the Prometheus blackbox_exporter could potentially replace Smokeping in a more efficient way:

better event correlation (eg. can have on the same dashboard network latency/loss and applications errors)
centralized data (remove the need of yet another tool)
time series database (instead or RRD files)
Distributed (can run on any server in a P2P way, which is possible with Smokeping but more complex)
Easier configuration

On the points to be researched more:

Is it possible to reproduce Smokeping's alerts in Grafana/Prometheus? https://github.com/wikimedia/puppet/blob/production/modules/smokeping/files/config.d/Alerts
At which frequency will the tests be ran? Eg. Smokeping currently spreads 20 pings across 300s

Experimental/PoC dashboard: https://grafana.wikimedia.org/d/CbNAwAXnk/filippo-blackbox-smoke-icmp

Details

Subject	Repo	Branch	Lines +/-
wikimedia.org: remove smokeping.w.o	operations/dns	master	+0 -1
smokeping: remove ancillary data	operations/puppet	production	+0 -13
smokeping: remove module and profile	operations/puppet	production	+0 -380
profile: absent smokeping	operations/puppet	production	+2 -2
smokeping: add ensure parameter, set to present	operations/puppet	production	+18 -6
prometheus: temp disable mgmt checks until hiera export script is fixed	operations/puppet	production	+4 -4
prometheus: probe a sample of hosts in mgmt network	operations/puppet	production	+67 -0
mr: allow icmp from prometheus_group	operations/homer/public	master	+1 -1
netmon: add blackbox-exporter for mgmt probes	operations/puppet	production	+9 -1
smokeping: fix targets configuration for drmrs	operations/puppet	production	+4 -0
smokeping: remove sampled hosts, probed by Prometheus	operations/puppet	production	+0 -130
prometheus: blackbox icmp probes for hosts	operations/puppet	production	+123 -2
prometheus: add support to blackbox icmp probe hosts	operations/puppet	production	+64 -0
smokeping: remove DNS targets, moved to Prometheus	operations/puppet	production	+0 -14
netops: add DNS probes alerts	operations/alerts	master	+34 -0
prometheus: probe DNS for (www).wikipedia.org	operations/puppet	production	+45 -0
prometheus: add initial blackbox dns probes for wikipedia	operations/puppet	production	+18 -0
smokeping: remove pfw some asw, moved to Prometheus	operations/puppet	production	+0 -49
prometheus: ping access switches and FR firewalls	operations/puppet	production	+21 -0
smokeping: stop targetting cr devices, moved to Prometheus	operations/puppet	production	+0 -87
netops: add PingUnreachable alert	operations/alerts	master	+43 -0
netops: ping core routers from Prometheus	operations/puppet	production	+83 -3
netops: move network routers/devices definitions to hiera	operations/puppet	production	+513 -101
role: collect blackbox_exporter metrics in Prometheus global	operations/puppet	production	+2 -0
role: use port 22 for ssh probing	operations/puppet	production	+3 -1
role: add ssh blackbox probes for bastions	operations/puppet	production	+26 -1
prometheus: add blackbox configuration for prometheus::ops	operations/puppet	production	+93 -1
prometheus: additional blackbox checks	operations/puppet	production	+37 -0
prometheus: use blackbox-exporter package from Debian	operations/puppet	production	+6 -13

Related Objects

Mentioned In: T273478: Investigate/setup prometheus blackbox_exporter for fundraising
T258675: Change smokeping to have pinging active/active, with alerts active/standby
T83196: internal network packet loss alerting
Mentioned Here: T310266: Move mgmt SSH checks from Icinga to Prometheus/Alertmanager
T229397: Puppet: get data (row, rack, site, and other information) from Netbox
T291946: Move service::catalog checks (“monitoring” section) to blackbox exporter and Alertmanager

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Mentioned in SAL (#wikimedia-operations) [2017-08-23T09:08:03Z] <godog> upload prometheus-blackbox-exporter 0.7.0+ds1-1~wmf1 to jessie-wikimedia, backported - T169860

Change 373261 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] role: add ssh blackbox probes for bastions

https://gerrit.wikimedia.org/r/373261

Change 373261 merged by Filippo Giunchedi:
[operations/puppet@production] role: add ssh blackbox probes for bastions

https://gerrit.wikimedia.org/r/373261

Change 373266 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] role: use port 22 for ssh probing

https://gerrit.wikimedia.org/r/373266

Change 373266 merged by Filippo Giunchedi:
[operations/puppet@production] role: use port 22 for ssh probing

https://gerrit.wikimedia.org/r/373266

Change 373280 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] role: collect blackbox_exporter metrics in Prometheus global

https://gerrit.wikimedia.org/r/373280

Change 373280 merged by Filippo Giunchedi:
[operations/puppet@production] role: collect blackbox_exporter metrics in Prometheus global

https://gerrit.wikimedia.org/r/373280

I've put a sample dashboard at https://grafana.wikimedia.org/dashboard/db/network-probes showing for a given "target" (i.e. a bastion at the moment) its maximum latency from all sites and the number of times the probe has flapped.

ATM a check for the ssh banner is performed, IOW a full tcp connection. ICMP probing requires CAP_NET_RAW which isn't configurable in the package yet (reported as https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=872997)

The scaffolding for blackbox testing is now in place WRT software and puppet, I'll kick the ball over to @ayounsi to tweak things further.

fgiunchedi moved this task from Doing to Radar on the User-fgiunchedi board.Aug 24 2017, 3:41 PM

faidon moved this task from In progress to Externally blocked on the observability board.Sep 6 2017, 3:06 PM

Potentially useful too: https://bitbucket.org/Svedrin/meshping and https://github.com/SuperQ/smokeping_prober (just came across it, haven't looked at the code and/or tried)

Had a quick look, the main limitation compared to blackbox_exporter is that meshping only supports pings, while Smokeping and bb_exporter supports DNS, TCP, etc.

ayounsi mentioned this in T258675: Change smokeping to have pinging active/active, with alerts active/standby.Jul 23 2020, 8:30 AM

lmata moved this task from Externally blocked to Radar on the observability board.Sep 21 2020, 8:28 PM

CDanis subscribed.Sep 25 2020, 4:19 PM

Jgreen mentioned this in T273478: Investigate/setup prometheus blackbox_exporter for fundraising.Feb 1 2021, 3:55 PM

For the record, blackbox exporter will be put in active use this quarter as part of T291946: Move service::catalog checks (“monitoring” section) to blackbox exporter and Alertmanager

fgiunchedi renamed this task from Investigate/setup prometheus blackbox_exporter to Replace smokeping with a Prometheus-based solution.Dec 15 2021, 9:22 AM

Following up on your IRC request on what a Smokeping replacement MVP would look like.

Smokeping currently sends pings to a manually defined list of devices from a single location (netmon1002 by default).
It also test DNS latency to ns0/1/2 (which I won't cover here, focusing on the network aspect).
It then reports latency, jitter and loss on a dashboard and send emails when such metrics are outside of thresholds (silencing those require a Puppet commit).

This host list is curated to have:

all routers
some management interfaces
one host per failure domain (best effort)
- one per row in eqiad/codfw
- one per site for POPs
- one per rack for drmrs

An MVP using Blackbox Exporter would ideally have a roughly similar set of features, while not painting ourselves in a corner to match feature parity.
Only ping is needed.
It would needs to test latency to routers, one possibility is to re-use the devices defined in netops::monitoring.

For hosts, I don't think there is a way (until T229397 is done) to get one per row or per rack, picking a host automatically also risks triggering alerting if maintenance is performed on that host.

A good middle ground would be to test well known hosts, for example bastion hosts or hosts running blackbox_exporter.
As a safety net, the MVP would needs a way to manually define target hosts.

A dashboard with latency/jitter/loss, for each targets, and matching customizable alerts (especially loss).

If I understand the config correctly. Steps 300 pings 20 means smokeping currently sends 20 pings over 300s (5min), so ~1 every 15s.

Probably out of scope for the MVP but to be considered:

Have the routers or bastion hosts pinged from multiple vantage points (eg. full POP mesh). That would allow us to quickly pinpoint a problematic link or device.
v4 and v6 needs to be separated (Smokeping doesn't make the distinction)
Dashboard: being able to filter down on sites (eg. show all targets for a given site), or aggregate by source/destination/address family/etc.
Higher layer tests (eg. TCP)
Explicit payload size (this would require bumping the blackbox_exporter MTU, but would allow checking MTU missconfigs)

Thank you @ayounsi for taking the time to do this

In T169860#7759278, @ayounsi wrote:

Following up on your IRC request on what a Smokeping replacement MVP would look like.

Smokeping currently sends pings to a manually defined list of devices from a single location (netmon1002 by default).
It also test DNS latency to ns0/1/2 (which I won't cover here, focusing on the network aspect).
It then reports latency, jitter and loss on a dashboard and send emails when such metrics are outside of thresholds (silencing those require a Puppet commit).

This host list is curated to have:

all routers

some management interfaces

one host per failure domain (best effort)

one per row in eqiad/codfw

one per site for POPs

one per rack for drmrs

An MVP using Blackbox Exporter would ideally have a roughly similar set of features, while not painting ourselves in a corner to match feature parity.
Only ping is needed.

It would needs to test latency to routers, one possibility is to re-use the devices defined in netops::monitoring.

For hosts, I don't think there is a way (until T229397 is done) to get one per row or per rack, picking a host automatically also risks triggering alerting if maintenance is performed on that host.

A good middle ground would be to test well known hosts, for example bastion hosts or hosts running blackbox_exporter.

That's a possibility for sure, I was also imagining to drive the list of target hosts and devices from netbox. I don't know what are the preferred mechanics (e.g. do we run queries towards netbox, vs consuming exported data from netbox itself vs sth else) but I think we have all information to express the semantics you outlined above.

For the MVP though we could/should re-use the same data source(s) as smokeping for now I think.

As a safety net, the MVP would needs a way to manually define target hosts.

+1 on being able to augment the list of targets manually

A dashboard with latency/jitter/loss, for each targets, and matching customizable alerts (especially loss).

If I understand the config correctly. Steps 300 pings 20 means smokeping currently sends 20 pings over 300s (5min), so ~1 every 15s.

Yes a ping every 15s should be no problem. When going full mesh in practice target hosts will see one ping every (<period> times <number of prometheus hosts> (8x as of today))

Probably out of scope for the MVP but to be considered:

Have the routers or bastion hosts pinged from multiple vantage points (eg. full POP mesh). That would allow us to quickly pinpoint a problematic link or device.

Due to the way Prometheus is deployed this should be reasonably easy to have in the MVP (i.e. there's little difference in practice between deploying in eqiad only or all sites).

v4 and v6 needs to be separated (Smokeping doesn't make the distinction)

Absolutely, depending on the data source for the list of targets the MVP could do v4 and be easily extended to support v6 too (or both from the get go, based on my experience while developing T291946

Dashboard: being able to filter down on sites (eg. show all targets for a given site), or aggregate by source/destination/address family/etc.

+1, I'm thinking this should be part of MVP

Higher layer tests (eg. TCP)

Easy to add post MVP

Explicit payload size (this would require bumping the blackbox_exporter MTU, but would allow checking MTU missconfigs)

Nice idea, and good to know, should be easy enough to extend post-MVP. How many payload sizes did you have in mind ?

It would needs to test latency to routers, one possibility is to re-use the devices defined in netops::monitoring.

For hosts, I don't think there is a way (until T229397 is done) to get one per row or per rack, picking a host automatically also risks triggering alerting if maintenance is performed on that host.

A good middle ground would be to test well known hosts, for example bastion hosts or hosts running blackbox_exporter.

That's a possibility for sure, I was also imagining to drive the list of target hosts and devices from netbox. I don't know what are the preferred mechanics (e.g. do we run queries towards netbox, vs consuming exported data from netbox itself vs sth else) but I think we have all information to express the semantics you outlined above.

In the long run getting that data from Netbox makes sens but we're still far from having the tooling. netops::monitoring is already there, and maintaining a static YAML list seems good enough for a MVP.

If I understand the config correctly. Steps 300 pings 20 means smokeping currently sends 20 pings over 300s (5min), so ~1 every 15s.

Yes a ping every 15s should be no problem. When going full mesh in practice target hosts will see one ping every (<period> times <number of prometheus hosts> (8x as of today))

Overall that's fine, even more than every 8x15s would give us better data.

Explicit payload size (this would require bumping the blackbox_exporter MTU, but would allow checking MTU missconfigs)

Nice idea, and good to know, should be easy enough to extend post-MVP. How many payload sizes did you have in mind ?

I'd say 9192-headers, as that's what we have configured everywhere, except for external (last resort) transport GRE tunnels, that depends on Transit providers.

cmooney subscribed.Mar 15 2022, 11:50 AM

Great idea overall. I fear Grafana won't be able to visualise the data as nicely as Smokeping did, but I'm sure we can get something functional.

Explicit payload size (this would require bumping the blackbox_exporter MTU, but would allow checking MTU missconfigs)

Nice idea, and good to know, should be easy enough to extend post-MVP. How many payload sizes did you have in mind

We'd probably want an option to to set the "don't fragment" flag on the pings, or do that by default, if we go down this road.

Otherwise there could be an issue where we get a successful result, due to fragmentation, in spite of having an incorrect MTU set somewhere.

Change 777330 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] WIP test replacing smokeping with blackbox exporter

https://gerrit.wikimedia.org/r/777330

I took a stab at this in https://gerrit.wikimedia.org/r/c/operations/puppet/+/777330 and it seems doable by re-using most the work I did for service::catalog probes.

The Prometheus bits are fine, though we need to move $routers (and other "data sources") somewhere for easier sharing, most likely hiera and re-arrange the data structures so they are grouped explicitly by site.

Change 777347 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] WIP move core routers definitions to hiera

https://gerrit.wikimedia.org/r/777347

fgiunchedi moved this task from Radar to Doing on the User-fgiunchedi board.Apr 21 2022, 9:01 AM

fgiunchedi edited projects, added SRE Observability (FY2021/2022-Q4), Observability-Metrics; removed observability, Patch-For-Review.

Change 777347 merged by Filippo Giunchedi:

[operations/puppet@production] netops: move network routers/devices definitions to hiera

https://gerrit.wikimedia.org/r/777347

Change 777330 merged by Filippo Giunchedi:

[operations/puppet@production] netops: ping core routers from Prometheus

https://gerrit.wikimedia.org/r/777330

Pinging all core routers from all sites is now live! i.e. metrics are there

fgiunchedi raised the priority of this task from Low to Medium.May 19 2022, 11:54 AM

fgiunchedi updated the task description. (Show Details)

I want to get a matrix with sites and target sites as rows and columns, with cells being e.g. latency. I _almost_ got it in Grafana but not quite (https://grafana.wikimedia.org/d/D5-lIJX7z/filippo-blackbox-smoke-icmp-matrix) and I posted a question upstream: https://community.grafana.com/t/bi-dimensional-matrix-table-display-from-prometheus-query/65972

lmata moved this task from Inbox to In progress on the SRE Observability (FY2021/2022-Q4) board.May 25 2022, 10:13 PM

I spent a bit of time thinking how to best visualize that data.
The matrix is a great idea and would need to be completed with some kind of color coding to know if something is out of the ordinary.

I created a "global" graph: https://grafana.wikimedia.org/d/CbNAwAXnk/filippo-blackbox-smoke-icmp?orgId=1&var-site=All&var-target_site=All&forceLogin&from=now-30m&to=now&viewPanel=9
Which is messy "as it" but allows to quickly drill down the data when responding to an incident.
Eg. first filter on the site showing an issue.

Then I think it will be better to have each site specific graphs be grouped by target sites instead of source, as well as include all the devices present in this site.

Adding a filter for address family (IPv4/IPv6) would be useful as well.

Longer term, it could be useful to reproduce our topology with something like:
https://grafana.com/docs/grafana/latest/visualizations/node-graph/ or even an overlay on a world map (I don't think that's even possible now)
This would be quite beneficial to more efficiently figure out where an issue is happening.

Thank you for the feedback! Super useful

In T169860#7966918, @ayounsi wrote:

I spent a bit of time thinking how to best visualize that data.
The matrix is a great idea and would need to be completed with some kind of color coding to know if something is out of the ordinary.

I created a "global" graph: https://grafana.wikimedia.org/d/CbNAwAXnk/filippo-blackbox-smoke-icmp?orgId=1&var-site=All&var-target_site=All&forceLogin&from=now-30m&to=now&viewPanel=9
Which is messy "as it" but allows to quickly drill down the data when responding to an incident.
Eg. first filter on the site showing an issue.

Then I think it will be better to have each site specific graphs be grouped by target sites instead of source, as well as include all the devices present in this site.

Agreed, I've changed the dashboard to group by target site and device (instance)

Adding a filter for address family (IPv4/IPv6) would be useful as well.

Done (defaults to average across families)

Longer term, it could be useful to reproduce our topology with something like:
https://grafana.com/docs/grafana/latest/visualizations/node-graph/ or even an overlay on a world map (I don't think that's even possible now)
This would be quite beneficial to more efficiently figure out where an issue is happening.

Agreed that'd be very nice!

That's nice! I think we're at a point where we can make this dashboard official, add alerting (and doc on how to silence a device) and remove the matching devices from Smokeping.

In T169860#7987833, @ayounsi wrote:

That's nice! I think we're at a point where we can make this dashboard official, add alerting (and doc on how to silence a device) and remove the matching devices from Smokeping.

SGTM! I've added (un)availability panel and renamed the dashboard to https://grafana-rw.wikimedia.org/d/m1LYjVjnz/network-icmp-probes . I'll move on to alerting, docs, etc

Change 804304 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] netops: add PingUnavailable alert

https://gerrit.wikimedia.org/r/804304

gerritbot added a project: Patch-For-Review.Jun 9 2022, 12:48 PM

Change 804304 merged by Filippo Giunchedi:

[operations/alerts@master] netops: add PingUnreachable alert

https://gerrit.wikimedia.org/r/804304

Change 807100 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] smokeping: stop targetting cr devices, moved to Prometheus

https://gerrit.wikimedia.org/r/807100

Change 807179 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: ping access switches and FR firewalls

https://gerrit.wikimedia.org/r/807179

Change 807100 merged by Filippo Giunchedi:

[operations/puppet@production] smokeping: stop targetting cr devices, moved to Prometheus

https://gerrit.wikimedia.org/r/807100

Change 807179 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: ping access switches and FR firewalls

https://gerrit.wikimedia.org/r/807179

Change 808914 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] smokeping: remove asw/pfw, moved to Prometheus

https://gerrit.wikimedia.org/r/808914

Change 808914 merged by Filippo Giunchedi:

[operations/puppet@production] smokeping: remove pfw some asw, moved to Prometheus

https://gerrit.wikimedia.org/r/808914

Status update:

all cr devices are pinged full-mesh
all l3sw and pfw devices are pinged from within their site

We have left in smokeping:

"canaries" for the mgmt network, namely a mix of msw and asw
a few hosts
1. frack: frbast-eqiad and frpig1001
2. a manually-maintained list of per-row (more or less) list of well-known hosts
DNS latency towards ns*

The same general issue of "blackbox probes reaching out to the mgmt network" is being tackled as part of T310266: Move mgmt SSH checks from Icinga to Prometheus/Alertmanager and will likely result in running blackbox exporter from hosts that do have access to the mgmt network (e.g. cumin)

The "pick a couple of hosts per row to be probed" problem could be tackled either via netbox-exported hiera data, or a puppetdb query (I think), or even more simply target e.g. all bastions (either site-local or full-mesh). Of course depends on what we'd like to be observing!

For frack I'm not quite sure what the right answer is, though the answer is likely of bigger scope and ties into what we'll be doing wrt frack and icinga and all that. At the moment for example frpig1001 is mentioned in puppet in the manually-maintained list of frack hosts in icinga (modules/icinga/templates/nsca_frack.cfg.erb) and smokeping (modules/smokeping/templates/config.d/Targets.erb) frbast-eqiad shows up in smokeping only.

DNS checks should be ported over blackbox exporter too, and while we're at it make sure we're performing the right checks

Change 809535 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: add initial blackbox dns probes for wikipedia

https://gerrit.wikimedia.org/r/809535

Change 809536 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: probe DNS for (www).wikipedia.org

https://gerrit.wikimedia.org/r/809536

fgiunchedi edited projects, added SRE Observability (FY2022/2023-Q1); removed SRE Observability (FY2021/2022-Q4).Jul 1 2022, 7:59 AM

fgiunchedi moved this task from Inbox to In progress on the SRE Observability (FY2022/2023-Q1) board.Jul 1 2022, 8:24 AM

Change 809535 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: add initial blackbox dns probes for wikipedia

https://gerrit.wikimedia.org/r/809535

Change 809536 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: probe DNS for (www).wikipedia.org

https://gerrit.wikimedia.org/r/809536

Change 811207 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] netops: add DNS probes alerts

https://gerrit.wikimedia.org/r/811207

Change 811207 merged by Filippo Giunchedi:

[operations/alerts@master] netops: add DNS probes alerts

https://gerrit.wikimedia.org/r/811207

Change 812329 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] smokeping: remove DNS targets, moved to Prometheus

https://gerrit.wikimedia.org/r/812329

Change 812330 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: add support to blackbox icmp probe hosts

https://gerrit.wikimedia.org/r/812330

Change 812331 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: blackbox icmp probes for hosts

https://gerrit.wikimedia.org/r/812331

Change 812329 merged by Filippo Giunchedi:

[operations/puppet@production] smokeping: remove DNS targets, moved to Prometheus

https://gerrit.wikimedia.org/r/812329

Change 812330 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: add support to blackbox icmp probe hosts

https://gerrit.wikimedia.org/r/812330

Change 812331 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: blackbox icmp probes for hosts

https://gerrit.wikimedia.org/r/812331

Change 814792 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] smokeping: remove sampled hosts, probed by Prometheus

https://gerrit.wikimedia.org/r/814792

Change 814792 merged by Filippo Giunchedi:

[operations/puppet@production] smokeping: remove sampled hosts, probed by Prometheus

https://gerrit.wikimedia.org/r/814792

Change 814849 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] smokeping: fix targets configuration for drmrs

https://gerrit.wikimedia.org/r/814849

Change 814849 merged by Filippo Giunchedi:

[operations/puppet@production] smokeping: fix targets configuration for drmrs

https://gerrit.wikimedia.org/r/814849

lmata moved this task from Inbox to Prioritized on the Observability-Metrics board.Sep 6 2022, 7:47 PM

Change 836704 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] netmon: add blackbox-exporter for mgmt probes

https://gerrit.wikimedia.org/r/836704

Change 836704 merged by Filippo Giunchedi:

[operations/puppet@production] netmon: add blackbox-exporter for mgmt probes

https://gerrit.wikimedia.org/r/836704

Change 841542 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: probe mgmt network from netmon host

https://gerrit.wikimedia.org/r/841542

Change 842350 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/homer/public@master] mr: allow icmp from prometheus_group

https://gerrit.wikimedia.org/r/842350

Change 842350 merged by Filippo Giunchedi:

[operations/homer/public@master] mr: allow icmp from prometheus_group

https://gerrit.wikimedia.org/r/842350

Change 841542 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: probe a sample of hosts in mgmt network

https://gerrit.wikimedia.org/r/841542

Change 842357 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: temp disable mgmt checks until hiera export script is fixed

https://gerrit.wikimedia.org/r/842357

Change 842357 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: temp disable mgmt checks until hiera export script is fixed

https://gerrit.wikimedia.org/r/842357

We're now pinging a sample of mgmt hosts (one per rack/row), with this I believe we have completed coverage of smokeping targets \o/

I'll followup with the patches to turn down smokeping

Change 850154 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] smokeping: add ensure parameter, set to present

https://gerrit.wikimedia.org/r/850154

Change 850155 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] profile: absent smokeping

https://gerrit.wikimedia.org/r/850155

Change 850156 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] smokeping: remove module and profile

https://gerrit.wikimedia.org/r/850156

Change 850157 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] smokeping: remove ancillary data

https://gerrit.wikimedia.org/r/850157

In T169860#8349425, @fgiunchedi wrote:

We're now pinging a sample of mgmt hosts (one per rack/row), with this I believe we have completed coverage of smokeping targets \o/

I'll followup with the patches to turn down smokeping

\o/\o/\o/\o/\o/\o/\o/\o/\o/\o/\o/\o/ (the crowd goes wild) \o/\o/\o/\o/\o/\o/\o/\o/\o/\o/\o/\o/\o/\o/\o/

lmata edited projects, added SRE Observability (FY2022/2023-Q2); removed SRE Observability (FY2022/2023-Q1).Oct 28 2022, 6:28 PM

Change 850154 merged by Filippo Giunchedi:

[operations/puppet@production] smokeping: add ensure parameter, set to present

https://gerrit.wikimedia.org/r/850154

Change 850155 merged by Filippo Giunchedi:

[operations/puppet@production] profile: absent smokeping

https://gerrit.wikimedia.org/r/850155

Change 850156 merged by Filippo Giunchedi:

[operations/puppet@production] smokeping: remove module and profile

https://gerrit.wikimedia.org/r/850156

Change 850157 merged by Filippo Giunchedi:

[operations/puppet@production] smokeping: remove ancillary data

https://gerrit.wikimedia.org/r/850157

Change 852132 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/dns@master] wikimedia.org: remove smokeping.w.o

https://gerrit.wikimedia.org/r/852132

Change 852132 merged by Filippo Giunchedi:

[operations/dns@master] wikimedia.org: remove smokeping.w.o

https://gerrit.wikimedia.org/r/852132

I've updated the references to "smokeping" on wikitech to point them to the replacement dashboards, and removed puppet + dns smokeping references. With all of that done, I'll call this task resolved!

lmata moved this task from Inbox to Done on the SRE Observability (FY2022/2023-Q2) board.Jan 12 2023, 7:00 PM

lmata moved this task from Prioritized to Done on the Observability-Metrics board.Jan 16 2023, 5:39 PM

Replace smokeping with a Prometheus-based solutionClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Replace smokeping with a Prometheus-based solution
Closed, ResolvedPublic
Actions