Lately we have had lots of alerts like:
23:31 < icinga-wm> RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 20 probes of 317 (alerts on 25) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts 23:34 < icinga-wm> RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 0 probes of 342 (alerts on 25) - https://atlas.ripe.net/measurements/1791307/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts 22:45 < icinga-wm> PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 26 probes of 315 (alerts on 25) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts 22:50 < icinga-wm> RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 16 probes of 315 (alerts on 25) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
According to the docs:
If a high number of probes fail (eg. >75%) or if both IPv4 and IPv6 are failing simultaneously, and no quick recovery (~5min) it is less likely of a false positive, ping Netops If flapping with a failing number of probe close to the threshold, its possibly a false positive, monitor/downtime and open a high priority Netops task If it matches an (un)scheduled provider maintenance, it is possibly a side effect, if no quick recovery, page Netops to potentially drain that specific link
There are some vendor's emails, but I am not sure if it is really related or not.
I assume it is not critical or not too critical as it has been triggered a lot during the weekend and nothing has screamed, so if they are false positives, should we increase the threshold?