Some more details on https://wikitech.wikimedia.org/wiki/Anycast#Limitations
Because of the nature of Anycast, monitoring a VIP from a central location will only ensure that the closest anycast server works.
For example 10.3.0.1, (our recursive DNS service) is advertised from all the sites, but when pinging that VIP, the eqiad Icinga server only talk to the eqiad DNS server.
So if the ulsfo instance is miss-behaving, we don't see it.
With Blackbox exporter, we can have all the sites pinging the VIP and thus having more visibility.
In that setup if the ulsfo instance dies, the ulsfo servers will fallback to the codfw servers and result in an increase of latency.
For alerting there are multiple options:
- If the latency increases
- Set a low TTL on the pings (so it will fail if not local)
- Graph which server replies to the queries (eg. with an equivalent of dig @10.3.0.1 CHAOS TXT id.server. +short)