Use blackbox exporter for anycast monitoring
Open, LowPublic
Actions

Assigned To

None

Authored By

	ayounsi
	Jun 29 2022, 12:27 PM

Description

Some more details on https://wikitech.wikimedia.org/wiki/Anycast#Limitations

Because of the nature of Anycast, monitoring a VIP from a central location will only ensure that the closest anycast server works.
For example 10.3.0.1, (our recursive DNS service) is advertised from all the sites, but when pinging that VIP, the eqiad Icinga server only talk to the eqiad DNS server.
So if the ulsfo instance is miss-behaving, we don't see it.

With Blackbox exporter, we can have all the sites pinging the VIP and thus having more visibility.
In that setup if the ulsfo instance dies, the ulsfo servers will fallback to the codfw servers and result in an increase of latency.

For alerting there are multiple options:

If the latency increases
Set a low TTL on the pings (so it will fail if not local)
Graph which server replies to the queries (eg. with an equivalent of dig @10.3.0.1 CHAOS TXT id.server. +short)

Event Timeline

ayounsi triaged this task as Low priority.Jun 29 2022, 12:27 PM

ayounsi created this task.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 29 2022, 12:27 PM

lmata edited projects, added Observability-Metrics; removed SRE Observability.Aug 1 2022, 8:18 PM

lmata subscribed.

lmata moved this task from Inbox to Backlog on the Observability-Metrics board.Sep 6 2022, 7:47 PM

Graph which server replies to the queries (eg. with an equivalent of dig @10.3.0.1 CHAOS TXT id.server. +short)

depending on how much we can control the query sent we should just set nsid. This ensures that the identifier comes in the same answer e.g. with dig +nsid @10.3.0.1 api-ro.discovery.wmnet. we get the answer and the answering server in the same answer. however if we do dig @10.3.0.1 api-ro.discovery.wmnet. then we do dig @10.3.0.1 CHAOS TXT id.server. +short it possible (although arguable not too likely) that the second question will hit a different server to the first
`

KOfori subscribed.Apr 5 2023, 6:38 PM

BCornwall subscribed.Sep 12 2023, 3:42 PM

Use blackbox exporter for anycast monitoringOpen, LowPublicActions

Description

Event Timeline

Use blackbox exporter for anycast monitoring
Open, LowPublic
Actions