Page MenuHomePhabricator

Switch wikikube-staging (codfw and eqiad) etcd clusters to use PKI
Closed, ResolvedPublic

Description

In the light of T329556: K8s etcd on bullseye show TLS errors in logs we should configure the wikikube-staging etcd clusters to use PKI instead of cergen certs

https://gerrit.wikimedia.org/r/c/operations/puppet/+/889082/

  • staging codfw
  • staging eqiad
  • clean up cergen certs in private puppet

Event Timeline

I have deleted some logs on kubestagetcd100[5,6] since the root partition was almost full, etcd keeps logging TLS errors (error "remote error: tls: bad certificate", ServerName "k8s3-staging.eqiad.wmnet").

Mentioned in SAL (#wikimedia-operations) [2023-02-24T07:52:18Z] <elukey> rm /var/log/{syslog,messages,user.log}.1 on kubetcd1006 to free up space - T329717

Change 891749 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::etcd::v3::kubernetes::staging: move certs to PKI

https://gerrit.wikimedia.org/r/891749

Change 891749 merged by Elukey:

[operations/puppet@production] role::etcd::v3::kubernetes::staging: move certs to PKI

https://gerrit.wikimedia.org/r/891749

Mentioned in SAL (#wikimedia-operations) [2023-02-24T09:08:44Z] <elukey> rm /var/log/{syslog,messages,user.log}.1 on kubetcd1005 to free up space - T329717

elukey@kubestagetcd1004:~$ etcdctl -C https://$(hostname -f):2379 cluster-health
member 2e98c8b51153156c is healthy: got healthy result from https://kubestagetcd1006.eqiad.wmnet:2379
member a29a9e00247eef21 is healthy: got healthy result from https://kubestagetcd1005.eqiad.wmnet:2379
member c450621e7916ca97 is healthy: got healthy result from https://kubestagetcd1004.eqiad.wmnet:2379
cluster is healthy
elukey@kubestagetcd2001:~$ etcdctl -C https://$(hostname -f):2379 cluster-health
member 98ab3a19cacdf63b is healthy: got healthy result from https://kubestagetcd2002.codfw.wmnet:2379
member cec6617bc5da0995 is healthy: got healthy result from https://kubestagetcd2003.codfw.wmnet:2379
member d29bf1642e768eed is healthy: got healthy result from https://kubestagetcd2001.codfw.wmnet:2379
cluster is healthy
elukey@kubestagetcd1004:~$ echo y | openssl s_client -connect $(hostname -f):2380 | openssl x509 -text | grep "Subject Alternative Name" -A 1
[..]
            X509v3 Subject Alternative Name: 
                DNS:kubestagetcd1004.eqiad.wmnet, DNS:k8s3-staging.eqiad.wmnet, DNS:_etcd-server-ssl._tcp.k8s3-staging.eqiad.wmnet
elukey@kubestagetcd2001:~$  echo y | openssl s_client -connect $(hostname -f):2380 | openssl x509 -text | grep "Subject Alternative Name" -A 1
[..]
            X509v3 Subject Alternative Name: 
              DNS:kubestagetcd2001.codfw.wmnet, DNS:k8s3-staging.codfw.wmnet, DNS:_etcd-server-ssl._tcp.k8s3-staging.codfw.wmnet

The clusters are up and healthy, verified that the new SAN has been added.

Change 895237 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[labs/private@master] secrets/ssl: Remove keys for kubernetes etcd clusters

https://gerrit.wikimedia.org/r/895237

Change 895237 merged by JMeybohm:

[labs/private@master] secrets/ssl: Remove keys for kubernetes etcd clusters

https://gerrit.wikimedia.org/r/895237

JMeybohm updated the task description. (Show Details)

I've removed all cergen certs and config for wikikube and ml clusters from private puppet. Following puppet runs on etcd nodes where fine.