Page MenuHomePhabricator

IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 noisy alert
Closed, ResolvedPublic

Description

Lately we have had lots of alerts like:

23:31 < icinga-wm> RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 20 probes of 317 (alerts on 25) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
23:34 < icinga-wm> RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 0 probes of 342 (alerts on 25) - https://atlas.ripe.net/measurements/1791307/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
22:45 < icinga-wm> PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 26 probes of 315 (alerts on 25) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
22:50 < icinga-wm> RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 16 probes of 315 (alerts on 25) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts

According to the docs:

If a high number of probes fail (eg. >75%) or if both IPv4 and IPv6 are failing simultaneously, and no quick recovery (~5min) it is less likely of a false positive, ping Netops
If flapping with a failing number of probe close to the threshold, its possibly a false positive, monitor/downtime and open a high priority Netops task
If it matches an (un)scheduled provider maintenance, it is possibly a side effect, if no quick recovery, page Netops to potentially drain that specific link

There are some vendor's emails, but I am not sure if it is really related or not.
I assume it is not critical or not too critical as it has been triggered a lot during the weekend and nothing has screamed, so if they are false positives, should we increase the threshold?

Event Timeline

herron triaged this task as High priority.Oct 2 2018, 5:26 PM

I took the 16 hosts unable to reach the eqsin anchor over v6 during the last measurement (https://atlas.ripe.net/measurements/11645088/) and ran traceroutes from them to the eqsin anchor ( https://atlas.ripe.net/measurements/16451446/#!tracemon ) as well as from bat5001 to some of them.

Some have a clear routing loop in the path, some, have packet loss or high latency in the path, but nothing that seem to indicate an issue on our side.
Increasing the threshold is the only viable option I see.

Change 465476 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/puppet@production] Icinga: increase ripe atlas alerting threshold to 35

https://gerrit.wikimedia.org/r/465476

Change 465476 merged by Ayounsi:
[operations/puppet@production] Icinga: increase ripe atlas alerting threshold to 35

https://gerrit.wikimedia.org/r/465476

ayounsi claimed this task.

That should be good enough to make the alerts useful by removing the "false positive".

Please reopen if still too noisy.

The IPv6 pings eqiad alert keeps flapping, I downtimed it for 2 days and emailed the RIPE.

Reply from the RIPE:

I am currently running some comparisons between your anchor and a few others. Hopefully that will shed some more light on this. I will let you know in a few days what I have discovered.

Reply from the RIPE:

I see that you have found the problem as my graphs are looking normal now. From what I can gather, it was packet loss on IPv6 causing the flaky connections. It's now in line with your other anchors.
For your information: if you're using the status check URL for the mesh ping measurements then seeing an error rate between 0 and 1 % is normal for IPv4 but for IPv6 we see a baseline error rate of around 5-6%
Setting alerting thresholds at 5% for IPv4 and 10-15% for IPv6 seems therefore reasonable.
Hopefully in the new year, we will have some more built-in alerting and monitoring setup. We're working hard on this feature and if you have any ideas about it, we'd love to hear them :)

This has been quiet since. No root cause identified though.