Page MenuHomePhabricator

Implement DNS-over-TLS for AuthDNS
Closed, ResolvedPublic

Description

As the title says. We have a lot of the easy parts of this already done, it's just a matter of correctly configuring an haproxy instance to accept TLSv1.3 and backend raw TCP conns with PROXYv2 metadata into the gdnsd port that's already waiting on such traffic, and setting up an LE cert with SANs matching our official nameserver hostnames. This will technically be "opportunistic" profile at that point, but the fact that the certs will validate in the usual browser sense against our fixed set of NS-record hostnames goes a long way as well. We can tackle better profiles support at a later date.

Event Timeline

BBlack triaged this task as Medium priority.Dec 6 2019, 2:27 PM
BBlack created this task.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 556738 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] dotls: define acme cert

https://gerrit.wikimedia.org/r/556738

Change 556739 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] [WIP] dotls: main implementation

https://gerrit.wikimedia.org/r/556739

Change 556738 merged by BBlack:
[operations/puppet@production] dotls: define acme cert

https://gerrit.wikimedia.org/r/556738

Change 556809 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] dotls: test on dns4002

https://gerrit.wikimedia.org/r/556809

Change 556739 merged by BBlack:
[operations/puppet@production] dotls: main implementation

https://gerrit.wikimedia.org/r/556739

Change 556809 merged by BBlack:
[operations/puppet@production] dotls: test on dns4002

https://gerrit.wikimedia.org/r/556809

Change 556814 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] dotls: fix listen specs

https://gerrit.wikimedia.org/r/556814

Change 556814 merged by BBlack:
[operations/puppet@production] dotls: fix listen specs

https://gerrit.wikimedia.org/r/556814

P9867 <- First internal test query on a prod dns box :)

Change 556821 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] dotls: simpler and clearer listen config

https://gerrit.wikimedia.org/r/556821

Change 556821 merged by BBlack:
[operations/puppet@production] dotls: simpler and clearer listen config

https://gerrit.wikimedia.org/r/556821

Change 556827 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] dotls: add ferm and NRPE monitoring via kdig

https://gerrit.wikimedia.org/r/556827

Change 556827 merged by BBlack:
[operations/puppet@production] dotls: add ferm and NRPE monitoring via kdig

https://gerrit.wikimedia.org/r/556827

Change 556831 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] dotls: haproxy gdnsd dep and smooth reloads

https://gerrit.wikimedia.org/r/556831

Change 556831 merged by BBlack:
[operations/puppet@production] dotls: haproxy gdnsd dep and smooth reloads

https://gerrit.wikimedia.org/r/556831

Change 556833 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] dotls: glue haproxy to gdnsd in systemd

https://gerrit.wikimedia.org/r/556833

Change 556833 merged by BBlack:
[operations/puppet@production] dotls: glue haproxy to gdnsd in systemd

https://gerrit.wikimedia.org/r/556833

This is now mostly-working, with heira flag controlling test deployment (currently only on dns4002, which doesn't have any public authserver IPs routed into it at this time).

Reminders on the next bits to remember to work on (besides just pushing it to the rest of the fleet):

  1. Global monitoring (icinga hitting the official public IPs, which means this is just a singular-POV monitor rather than all-machines, but as with existing authdns global checks, better-than-nothing).
  2. TLS Perf Tuning, especially a secure, shared ticket key rotation system (may as well make a generic one, as we'll likely want a similar one for ats-tls, espectially with TLSv1.3 just around the corner there as well).

Refactoring the dependencies a little here: Really (2) above's sub-point about shared ticket key rotation won't matter until we're anycasting, so I've made a separate task (+subtask) in T240863 to go look at that stuff later, blocking the anycast work.

What's left here really, given testing has gone great so far, is some relatively-minor config tweaks and the global monitoring.

Change 558234 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] dotls: ssl tweaks

https://gerrit.wikimedia.org/r/558234

Change 558235 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] dotls: enable on all servers

https://gerrit.wikimedia.org/r/558235

Change 558234 merged by BBlack:
[operations/puppet@production] dotls: ssl tweaks

https://gerrit.wikimedia.org/r/558234

Change 558235 merged by BBlack:
[operations/puppet@production] dotls: enable on all servers

https://gerrit.wikimedia.org/r/558235

Actually we can't realistically do global monitoring from icinga either, because icinga isn't on Buster and so it doesn't have the right library/tool access to check a TLSv1.3-only service, so we'll have to settle for the per-server NRPE checks for now.

External queries now working (note they all return a codfw IP without edns-client-subnet in play, because codfw is closest to my laptop and PROXYv2 is working for sending the "real" client IP from haproxy to gdnsd).

bblack@haliax:~$ kdig +nsid +tls-ca @ns0.wikimedia.org wikipedia.org A
;; TLS session (TLS1.3)-(ECDHE-SECP256R1)-(ECDSA-SECP256R1-SHA256)-(CHACHA20-POLY1305)
;; ->>HEADER<<- opcode: QUERY; status: NOERROR; id: 60466
;; Flags: qr aa rd; QUERY: 1; ANSWER: 1; AUTHORITY: 0; ADDITIONAL: 1

;; EDNS PSEUDOSECTION:
;; Version: 0; flags: ; UDP size: 1024 B; ext-rcode: NOERROR
;; Option (11): 0172
;; NSID: 61757468646E7331303031 "authdns1001"
;; PADDING: 385 B

;; QUESTION SECTION:
;; wikipedia.org.      		IN	A

;; ANSWER SECTION:
wikipedia.org.      	600	IN	A	208.80.153.224

;; Received 468 B
;; Time 2019-12-16 23:34:58 UTC
;; From 208.80.154.238@853(TCP) in 151.4 ms
bblack@haliax:~$ kdig +nsid +tls-ca @ns1.wikimedia.org wikipedia.org A
;; TLS session (TLS1.3)-(ECDHE-SECP256R1)-(ECDSA-SECP256R1-SHA256)-(CHACHA20-POLY1305)
;; ->>HEADER<<- opcode: QUERY; status: NOERROR; id: 53240
;; Flags: qr aa rd; QUERY: 1; ANSWER: 1; AUTHORITY: 0; ADDITIONAL: 1

;; EDNS PSEUDOSECTION:
;; Version: 0; flags: ; UDP size: 1024 B; ext-rcode: NOERROR
;; Option (11): 0172
;; NSID: 61757468646E7332303031 "authdns2001"
;; PADDING: 385 B

;; QUESTION SECTION:
;; wikipedia.org.      		IN	A

;; ANSWER SECTION:
wikipedia.org.      	600	IN	A	208.80.153.224

;; Received 468 B
;; Time 2019-12-16 23:35:01 UTC
;; From 208.80.153.231@853(TCP) in 97.7 ms
bblack@haliax:~$ kdig +nsid +tls-ca @ns2.wikimedia.org wikipedia.org A
;; TLS session (TLS1.3)-(ECDHE-SECP256R1)-(ECDSA-SECP256R1-SHA256)-(CHACHA20-POLY1305)
;; ->>HEADER<<- opcode: QUERY; status: NOERROR; id: 17804
;; Flags: qr aa rd; QUERY: 1; ANSWER: 1; AUTHORITY: 0; ADDITIONAL: 1

;; EDNS PSEUDOSECTION:
;; Version: 0; flags: ; UDP size: 1024 B; ext-rcode: NOERROR
;; Option (11): 0172
;; NSID: 646E7333303031 "dns3001"
;; PADDING: 389 B

;; QUESTION SECTION:
;; wikipedia.org.      		IN	A

;; ANSWER SECTION:
wikipedia.org.      	600	IN	A	208.80.153.224

;; Received 468 B
;; Time 2019-12-16 23:35:17 UTC
;; From 91.198.174.239@853(TCP) in 379.8 ms
BBlack claimed this task.

Change 558522 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] dotls: use haproxy exporter profile

https://gerrit.wikimedia.org/r/558522

Change 558522 merged by BBlack:
[operations/puppet@production] dotls: use haproxy exporter profile

https://gerrit.wikimedia.org/r/558522