Page MenuHomePhabricator

_etcd-client SRV record missing for conftool cluster
Closed, ResolvedPublic

Description

I'm currently developing the etcd client for the new L4LB. I wanted to avoid hardcoding conftool hostnames by leveraging the DNS descovery features included in the etcd client but I'm getting the following error:

vgutierrez@lvs6001:~$ ./l4lb etcd --domain eqiad.wmnet
2022/10/10 11:16:02 dns lookup errors: lookup _etcd-client-ssl-conftool._tcp.eqiad.wmnet on 10.3.0.1:53: no such host and lookup _etcd-client-conftool._tcp.eqiad.wmnet on 10.3.0.1:53: no such host

Per https://etcd.io/docs/v3.3/op-guide/clustering/#dns-discovery it seems like _etcd-client and _etcd-client-ssl SRV records should be created:

To help clients discover the etcd cluster, the following DNS SRV records are looked up in the listed order:

_etcd-client._tcp.example.com
_etcd-client-ssl._tcp.example.com
If _etcd-client-ssl._tcp.example.com is found, clients will attempt to communicate with the etcd cluster over SSL/TLS.

Event Timeline

Checking the client implementation for go.etcd.io/etcd/client/v2 v2.305.4 it looks like the SRV discoverer share code with v3: https://github.com/etcd-io/etcd/blob/client/v2.305.4/client/v2/discover.go and that's probably why a recent v2 client requires v3 style SRV records.

you're right in that regard:

vgutierrez@lvs6001:~$ ./l4lb etcd --domain conftool.eqiad.wmnet
2022/10/10 12:55:44 dns lookup errors: lookup _etcd-client-ssl._tcp.conftool.eqiad.wmnet on 10.3.0.1:53: no such host and lookup _etcd-client._tcp.conftool.eqiad.wmnet on 10.3.0.1:53: no such host

but the etcd client is still expecting a SRV of the form _etcd-client-ssl._tcp or _etcd-client._tcp

yeah this changed with v3. The problem is that AIUI confd uses an older version of the library and expects the simpler form we have now.

We can either add a new set of records or see if we can update confd's dependency.

The correct domain to test for read-only clients is conftool.eqiad.wmnet, see https://gerrit.wikimedia.org/r/plugins/gitiles/operations/dns/+/refs/heads/master/templates/wmnet#107

Did you mean read-write?

hmm from the mentioned documentation on the task description:

If etcd is using TLS, the discovery SRV record (e.g. example.com) must be included in the SSL certificate DNS SAN along with the hostname, or clustering will fail with log messages like the following:

[...] rejected connection from "10.0.1.11:53162" (error "remote error: tls: bad certificate", ServerName "example.com")

but:

$ $ openssl s_client -connect conf1009.eqiad.wmnet:4001 2>/dev/null </dev/null |openssl x509 -noout -ext subjectAltName -subject
X509v3 Subject Alternative Name: 
    DNS:conf1007.eqiad.wmnet, DNS:conf1006.eqiad.wmnet, DNS:conf1009.eqiad.wmnet, DNS:conf1008, DNS:conf1004, DNS:conf1009, DNS:etcd-v3.eqiad.wmnet, DNS:conf1008.eqiad.wmnet, DNS:conf1005, DNS:etcd.eqiad.wmnet, DNS:conf1006, DNS:conf1007, DNS:conf1004.eqiad.wmnet, DNS:conf1005.eqiad.wmnet
subject=CN = etcd-v3.eqiad.wmnet

so potentially we would also need to add conftool.eqiad.wmnet to the SAN list of the TLS certificate

Change 841138 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/dns@master] etcd: add records compatible with the v3 etcd library

https://gerrit.wikimedia.org/r/841138

Change 841139 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/mediawiki-config@master] etcd: use the v3-style SRV record

https://gerrit.wikimedia.org/r/841139

Change 843873 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] confd: use the v3 style srv records

https://gerrit.wikimedia.org/r/843873

Change 841138 merged by Giuseppe Lavagetto:

[operations/dns@master] etcd: add records compatible with the v3 etcd library

https://gerrit.wikimedia.org/r/841138

Change 843873 merged by Giuseppe Lavagetto:

[operations/puppet@production] confd: use the v3 style srv records

https://gerrit.wikimedia.org/r/843873

Vgutierrez assigned this task to Joe.
vgutierrez@lvs6001:~$ ./liberica etcd --config /home/vgutierrez/config.yaml 
Using config file: /home/vgutierrez/config.yaml
2022/11/22 11:52:15 Spawning Watchers...
2022/11/22 11:52:15 Watching /conftool/v1/pools/drmrs/cache_text/ats-tls
2022/11/22 11:52:15 Watching /conftool/v1/pools/drmrs/ncredir/nginx
2022/11/22 11:52:15 etcd endpoints discovered: [https://conf1009.eqiad.wmnet.:4001 https://conf1007.eqiad.wmnet.:4001 https://conf1008.eqiad.wmnet.:4001]

endpoints are now being discovered as expected. Thanks @Joe

Change 841139 merged by Giuseppe Lavagetto:

[operations/mediawiki-config@master] etcd: use the v3-style SRV record

https://gerrit.wikimedia.org/r/841139

Mentioned in SAL (#wikimedia-operations) [2023-01-03T14:05:36Z] <oblivian@deploy1002> Started scap: Backport for [[gerrit:841139|etcd: use the v3-style SRV record (T320397)]]

Mentioned in SAL (#wikimedia-operations) [2023-01-03T14:07:30Z] <oblivian@deploy1002> oblivian and oblivian: Backport for [[gerrit:841139|etcd: use the v3-style SRV record (T320397)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2023-01-03T14:13:34Z] <oblivian@deploy1002> Finished scap: Backport for [[gerrit:841139|etcd: use the v3-style SRV record (T320397)]] (duration: 07m 58s)