Page MenuHomePhabricator

RIPE Atlas monitoring of reachability & latency towards anycasted Wikidough IP
Open, MediumPublic

Description

As (re-)discovered in T283359: Create RIPE Atlas measurements against our authoritative DNS servers; alert on them, in practice anycast routing can offer many surprises.

It's possible we'll need to balance user latency vs traffic engineering configuration complexity when doing anycasting for real services.

Either way, we'll still want to monitor latency and reachability of Wikidough from many vantage points on the Internet.

This is [for now] a placeholder task to configure RIPE Atlas for this.

Event Timeline

Marostegui triaged this task as Medium priority.May 26 2021, 4:27 AM
Marostegui removed a project: SRE.

This service is now live in codfw, answering DoH, DoT and plain-old DNS.

I believe dnsdist is terminating the DoH/DoT, but regular queries on UDP/TCP 53 go directly to Power DNS Recursor. No problem with that, but worth noting we only get something back for NSID when we talk to dnsdist, and I assume with the Atlas probes we can only make regular queries.

root@debiantest:~# dig +nsid +https www.ietf.org @wikimedia-dns.org

; <<>> DiG 9.17.13-2+0~20210520.56+debian11~1.gbp96c80e-Debian <<>> +nsid +https www.ietf.org @wikimedia-dns.org
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 39806
;; flags: qr rd ra ad; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
; NSID: 6d 61 6c 6d 6f 6b ("malmok")
;; QUESTION SECTION:
;www.ietf.org.			IN	A

;; ANSWER SECTION:
www.ietf.org.		944	IN	CNAME	www.ietf.org.cdn.cloudflare.net.
www.ietf.org.cdn.cloudflare.net. 300 IN	A	104.16.44.99
www.ietf.org.cdn.cloudflare.net. 300 IN	A	104.16.45.99

;; Query time: 159 msec
;; SERVER: 185.71.138.138#443(wikimedia-dns.org) (HTTPS)
;; WHEN: Wed May 26 18:11:11 BST 2021
;; MSG SIZE  rcvd: 128
root@debiantest:~# dig +nsid +tls www.ietf.org @wikimedia-dns.org

; <<>> DiG 9.17.13-2+0~20210520.56+debian11~1.gbp96c80e-Debian <<>> +nsid +tls www.ietf.org @wikimedia-dns.org
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 50986
;; flags: qr rd ra ad; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
; NSID: 6d 61 6c 6d 6f 6b ("malmok")
;; QUESTION SECTION:
;www.ietf.org.			IN	A

;; ANSWER SECTION:
www.ietf.org.		275	IN	CNAME	www.ietf.org.cdn.cloudflare.net.
www.ietf.org.cdn.cloudflare.net. 300 IN	A	104.16.44.99
www.ietf.org.cdn.cloudflare.net. 300 IN	A	104.16.45.99

;; Query time: 348 msec
;; SERVER: 185.71.138.138#853(wikimedia-dns.org) (TLS)
;; WHEN: Wed May 26 18:22:20 BST 2021
;; MSG SIZE  rcvd: 128

EDIT: Removed incorrect info on regular DNS queries, managed to confuse myself.

This service is now live in codfw, answering DoH, DoT and plain-old DNS.

I believe dnsdist is terminating the DoH/DoT, but regular queries on UDP/TCP 53 go directly to Power DNS Recursor. No problem with that, but worth noting we only get something back for NSID when we talk to dnsdist, and I assume with the Atlas probes we can only make regular queries.

root@debiantest:~# dig +nsid +https www.ietf.org @wikimedia-dns.org

; <<>> DiG 9.17.13-2+0~20210520.56+debian11~1.gbp96c80e-Debian <<>> +nsid +https www.ietf.org @wikimedia-dns.org
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 39806
;; flags: qr rd ra ad; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
; NSID: 6d 61 6c 6d 6f 6b ("malmok")
;; QUESTION SECTION:
;www.ietf.org.			IN	A

;; ANSWER SECTION:
www.ietf.org.		944	IN	CNAME	www.ietf.org.cdn.cloudflare.net.
www.ietf.org.cdn.cloudflare.net. 300 IN	A	104.16.44.99
www.ietf.org.cdn.cloudflare.net. 300 IN	A	104.16.45.99

;; Query time: 159 msec
;; SERVER: 185.71.138.138#443(wikimedia-dns.org) (HTTPS)
;; WHEN: Wed May 26 18:11:11 BST 2021
;; MSG SIZE  rcvd: 128
root@debiantest:~# dig +nsid +tls www.ietf.org @wikimedia-dns.org

; <<>> DiG 9.17.13-2+0~20210520.56+debian11~1.gbp96c80e-Debian <<>> +nsid +tls www.ietf.org @wikimedia-dns.org
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 50986
;; flags: qr rd ra ad; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
; NSID: 6d 61 6c 6d 6f 6b ("malmok")
;; QUESTION SECTION:
;www.ietf.org.			IN	A

;; ANSWER SECTION:
www.ietf.org.		275	IN	CNAME	www.ietf.org.cdn.cloudflare.net.
www.ietf.org.cdn.cloudflare.net. 300 IN	A	104.16.44.99
www.ietf.org.cdn.cloudflare.net. 300 IN	A	104.16.45.99

;; Query time: 348 msec
;; SERVER: 185.71.138.138#853(wikimedia-dns.org) (TLS)
;; WHEN: Wed May 26 18:22:20 BST 2021
;; MSG SIZE  rcvd: 128
root@debiantest:~# dig +nsid www.ietf.org @wikimedia-dns.org

; <<>> DiG 9.17.13-2+0~20210520.56+debian11~1.gbp96c80e-Debian <<>> +nsid www.ietf.org @wikimedia-dns.org
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 35293
;; flags: qr rd ra ad; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1220
; COOKIE: bd8627f07957ae320100000060ae81eb55bd3b21e9c23314 (good)
;; QUESTION SECTION:
;www.ietf.org.			IN	A

;; ANSWER SECTION:
www.ietf.org.		1800	IN	CNAME	www.ietf.org.cdn.cloudflare.net.
www.ietf.org.cdn.cloudflare.net. 300 IN	A	104.16.44.99
www.ietf.org.cdn.cloudflare.net. 300 IN	A	104.16.45.99

;; Query time: 2467 msec
;; SERVER: 185.71.138.138#53(wikimedia-dns.org) (UDP)
;; WHEN: Wed May 26 18:14:19 BST 2021
;; MSG SIZE  rcvd: 146

Thanks for all the help with this task!

I just wanted to clarify,

I believe dnsdist is terminating the DoH/DoT, but regular queries on UDP/TCP 53 go directly to Power DNS Recursor.

Wikidough -- by design -- only supports DoH and DoT, so we listen on ports TCP/443 and TCP/853. We have no plans to support UDP/53 or TCP/53. PowerDNS Recursor listens on localhost/53 for queries from dnsdist but does not answer queries directly. (More at https://wikitech.wikimedia.org/wiki/Wikidough#Design.)

Ok thanks. That actually makes sense, and what I had originally expected.

I have fallen into a trap I've hit before, in that my home network here is secretly redirecting all plain text DNS to my own resolver. And thus when I checked if Wikidough would answer on UDP/53 it looked like it was. I'm a muppet.

Testing from elsewhere it does indeed time out:

root@nyc2:~# dig A en.wikipedia.org @wikimedia-dns.org

; <<>> DiG 9.16.16-Ubuntu <<>> A en.wikipedia.org @wikimedia-dns.org
;; global options: +cmd
;; connection timed out; no servers could be reached

Not sure what that means for the Atlas probes but we will discuss in SRE. Thanks!

RIPE Atlas probes support sending DoT queries; however, the option is not exposed anywhere in the measurement creation web UI, nor in the official ripe-atlas-tools distribution. The unofficial blaeu tools for creating one-off measurements support sending the option, and there is also a pull request against ripe-atlas-tools (lingering almost two years now).

The RIPE Atlas network does not support DoH by policy.