Page MenuHomePhabricator

Create RIPE Atlas measurements against our authoritative DNS servers; alert on them
Open, MediumPublic

Description

We have RIPE Atlas probes against the reachability & rtt of their anchors in our datacenters, but we don't have any against our authoritative DNS servers. Adding that would be a nice, easy, open way to get some infrastructural-level external monitoring.

  • Decide on the shape of measurements

There's something of a design discussion to be had here:

  • Is simple query success + rtt enough? If so, the built-in functionality of atlas_exporter is sufficient. But I could see us deciding we want to also validate the payload of the response in some way, which would require a fair bit more work.
  • What do we want to be querying? Is an A record for something like en.wikipedia.org enough?
  • What selection of probes do we want to monitor from? I think at least a hundred, distributed globally, with the "IPv4/v6 stable 30d/90d" tag.

For an initial version, I think what I've proposed above is sufficient, but am open to discussion.

Since we're probably going to be monitoring an IPv4 and an IPv6 address for each site, plus one IPv4 anycast address, we also have to make sure we're not going to run out of credits or consume too many of them (we want to leave excess headroom for other measurements). Tradeoff here with the number of probes and the interval between measurements.

  • Create RIPE Atlas measurements against each prod public NS IP
  • Create measurements against each WMCS public NS IP
  • Add said measurements to our atlas_exporter configuration
  • Tweak grafana dashboard as necessary
  • Add some alerting on measurement results

Event Timeline

CDanis renamed this task from Create RIPE Atlas probes against our public DNS servers; alert on them to Create RIPE Atlas measurements against our public DNS servers; alert on them.May 21 2021, 2:34 PM
CDanis renamed this task from Create RIPE Atlas measurements against our public DNS servers; alert on them to Create RIPE Atlas measurements against our authoritative DNS servers; alert on them.
Marostegui triaged this task as Medium priority.May 24 2021, 7:15 AM
Marostegui moved this task from Backlog to Acknowledged on the SRE board.

Is simple query success + rtt enough? If so, the built-in functionality of atlas_exporter is sufficient.

Definitely for the first instance

But I could see us deciding we want to also validate the payload of the response in some way, which would require a fair bit more work.

I'm not sure if we would wont to do alerting and the payload however doing some analysis on the payload would be interesting to for instance make sure users in the US are getting the US ip addresses. I wonder if its better to pull the measurement data into logstash and do analytics queries like this there?

What do we want to be querying? Is an A record for something like en.wikipedia.org enough?

for a high level "is DNS working" i would say do a SOA query for wikipedia.org, i would also set the NSID flag when configuring as this will add some nice metadata. for content ones ultimately it would be nice to have A/AAAA for each wiki. In relation to TCP vs UDP, both would be nice however UDP is the one that's important and also be aware (unless things have changed) the failure rate is higher for TCP in genral as middle where boxes still often block TCP 53.

What selection of probes do we want to monitor from? I think at least a hundred, distributed globally, with the "IPv4/v6 stable 30d/90d" tag.

Sounds good, i would also say confine it to tags 'Anchor' i think there should be way more then 100 stable anchors

we also have to make sure we're not going to run out of credits or consume too many of them

We can probably ask for more if we need them, from my experience ripe is very open to that especially for services which offer a public benefit.

Was looking at the NSID stuff based on John's suggestion, indeed it is returned by our authdns at the moment:

root@nyc2:~# for i in {0..2}; do echo -ne "ns$i:\t"; dig +nsid A en.wikipedia.org @ns$i.wikimedia.org | grep NSID; done
ns0:	; NSID: 61 75 74 68 64 6e 73 31 30 30 31 ("authdns1001")
ns1:	; NSID: 61 75 74 68 64 6e 73 32 30 30 31 ("authdns2001")
ns2:	; NSID: 64 6e 73 33 30 30 31 ("dns3001")

Might be of interest down the road if we go the Anycast route.

Thanks very much for all the helpful comments!

  • I'll definitely enable the NSID bit, it will be very useful especially for mapping the anycast IP
  • Ingestion into logstash or some other analysis system is a good "some day" idea. For now it's probably not too bad to use the RIPE Atlas API / CLI tools to fetch raw data if needed.
  • SOA query for wikipedia.org sounds like a good first pass.

Re: anchors vs probes I had hoped to get at least some probes on actual residential ISPs, not just points in the 'core' of the Internet, although I don't feel super strongly about this.

Thanks very much for all the helpful comments!

  • I'll definitely enable the NSID bit, it will be very useful especially for mapping the anycast IP
  • Ingestion into logstash or some other analysis system is a good "some day" idea. For now it's probably not too bad to use the RIPE Atlas API / CLI tools to fetch raw data if needed.
  • SOA query for wikipedia.org sounds like a good first pass.

SGTM

Re: anchors vs probes I had hoped to get at least some probes on actual residential ISPs, not just points in the 'core' of the Internet, although I don't feel super strongly about this.

Yes that's a good point (although depending on how atlas classifies stable we may just end up with a probe in a dc instead of probes in end user networks)

Some very initial results running against the anycast NS at https://atlas.ripe.net/frames/measurements/30366093#!probes

Thanks again for the tip about Atlas's NSID support @jbond -- it is indeed quite nice! And from a cursory glance we can spot a number of interesting things that I am holding myself back from getting nerdsniped on: a probe on Reunion Island that took 340ms to reach eqsin; a probe in Colombia being anycasted to eqsin; a probe in Houston being anycasted to eqsin; a probe in Kenya being anycasted to eqiad; multiple probes in Portugal and also Germany being anycasted to eqsin...

I used the "System: IPv4 stable 30d" tag to select 200 probes worldwide, randomly, and this does seem to select end-user probes. However because of the nature of the Atlas network, there's quite a bias towards western Europe and the US:

image.png (226×828 px, 30 KB)

In the future I'll experiment with picking a sample of probes continent by continent, instead of a random sample of 200 worldwide.

Indeed the nerd-sniping gravitational pull is strong on that one and deserves its own task.
In the short term, understanding and controlling those deviations will be useful to @ssingh DoH project (T252132) as it will be anycasted.
Having access to the probes will make investigating it easier.
The main reason traffic will be sent to site X instead of Y, is the AS-PATH length, some example of sub optimal latency:

  • Peering with a remote provider in an IXP
  • Same length AS-PATH, will be picked randomly (eg. highest BGP session uptime)

To fight that, we have little control on what path a remote provider prefers, we can:

  • Not advertise the prefix to that peer
  • Prepend our AS to that peer
  • Use outbound BGP communities to remotely steer traffic like we do on our eqsin transits (only some providers support it)

Which can quickly increase configuration complexity.

Indeed some fascinating results there alright thanks for sharing Chris.

While I agree with Arzhel, these results do serve to remind me that every network gets to send egress traffic wherever they want, irregardless of any BGP attributes. The cheapest monetarily often being the deciding factor. I'd also agree that while some config our side may be worthwhile, it's more important we don't create a hard-to-maintain, overly-complex policy.

In terms of the DoH project, I wonder should we possibly add similar Atlas probes towards the Anycast IP for it once live (185.71.138.138)? I believe it is just an experiment so we'd not want to alert on those presumably, and need to factor in how many Atlas credits we may have.

Great, I'm working with Sukhbir to get that online in next day or two so we should be able to progress it any time after that.

! And from a cursory glance we can spot a number of interesting things that I am holding myself back from getting nerdsniped on:

this is a very very deep rabbit whole and i think with the limited peering presence and only being able run the test on a limited number of probes it will be difficult to really get a good picture let alone fix it (see responses above). however if you really want to dive in i would say the ripe stat looking glass api would be one of the better resources to dig into the why (although be warned id say you would be lucky if you could answer 50% of the why)

As to fixing theses issues from my experience it means a massive amount of resource reaching out to peers and also deploying HW in more locations then i suspect we ever will. that said ntt and i think some other peer do support a flexible set of communities so if we see some really bad AS's we may be able to influence things

lmata added a subscriber: lmata.