Page MenuHomePhabricator

Use blackbox exporter for anycast monitoring
Open, LowPublic

Description

Some more details on https://wikitech.wikimedia.org/wiki/Anycast#Limitations

Because of the nature of Anycast, monitoring a VIP from a central location will only ensure that the closest anycast server works.
For example 10.3.0.1, (our recursive DNS service) is advertised from all the sites, but when pinging that VIP, the eqiad Icinga server only talk to the eqiad DNS server.
So if the ulsfo instance is miss-behaving, we don't see it.

With Blackbox exporter, we can have all the sites pinging the VIP and thus having more visibility.
In that setup if the ulsfo instance dies, the ulsfo servers will fallback to the codfw servers and result in an increase of latency.

For alerting there are multiple options:

  • If the latency increases
  • Set a low TTL on the pings (so it will fail if not local)
  • Graph which server replies to the queries (eg. with an equivalent of dig @10.3.0.1 CHAOS TXT id.server. +short)

Event Timeline

ayounsi created this task.

Graph which server replies to the queries (eg. with an equivalent of dig @10.3.0.1 CHAOS TXT id.server. +short)

depending on how much we can control the query sent we should just set nsid. This ensures that the identifier comes in the same answer e.g. with dig +nsid @10.3.0.1 api-ro.discovery.wmnet. we get the answer and the answering server in the same answer. however if we do dig @10.3.0.1 api-ro.discovery.wmnet. then we do dig @10.3.0.1 CHAOS TXT id.server. +short it possible (although arguable not too likely) that the second question will hit a different server to the first
`