We have RIPE Atlas probes against the reachability & rtt of their anchors in our datacenters, but we don't have any against our authoritative DNS servers. Adding that would be a nice, easy, open way to get some infrastructural-level external monitoring.
- Decide on the shape of measurements
There's something of a design discussion to be had here:
- Is simple query success + rtt enough? If so, the built-in functionality of atlas_exporter is sufficient. But I could see us deciding we want to also validate the payload of the response in some way, which would require a fair bit more work.
- What do we want to be querying? Is an A record for something like en.wikipedia.org enough?
- What selection of probes do we want to monitor from? I think at least a hundred, distributed globally, with the "IPv4/v6 stable 30d/90d" tag.
For an initial version, I think what I've proposed above is sufficient, but am open to discussion.
Since we're probably going to be monitoring an IPv4 and an IPv6 address for each site, plus one IPv4 anycast address, we also have to make sure we're not going to run out of credits or consume too many of them (we want to leave excess headroom for other measurements). Tradeoff here with the number of probes and the interval between measurements.
- Create RIPE Atlas measurements against each prod public NS IP
- Create measurements against each WMCS public NS IP
- Add said measurements to our atlas_exporter configuration
- Tweak grafana dashboard as necessary
- Add some alerting on measurement results