Page MenuHomePhabricator

Toolforge: modernize deployment for etcd in k8s
Closed, ResolvedPublic

Description

This is the task to track all the work related to modernize the Toolforge puppet code for etcd in k8s.
It turns out that a proper etcd deployment is required in order to run kubernetes, so we should do this first.

I will be developing the code in toolsbeta before an actual deployment in the tools project.

Event Timeline

aborrero triaged this task as Medium priority.Jun 19 2019, 12:07 PM
aborrero created this task.
aborrero moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.

Change 517858 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: k8s: etcd: don't wrap profile::etcd, and use base etcd v3 directly

https://gerrit.wikimedia.org/r/517858

Change 517858 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: k8s: etcd: don't wrap profile::etcd, and use base etcd v3 directly

https://gerrit.wikimedia.org/r/517858

Change 517896 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: k8s: etcd: specify more certificates

https://gerrit.wikimedia.org/r/517896

Change 517896 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: k8s: etcd: specify more certificates

https://gerrit.wikimedia.org/r/517896

Change 517905 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: k8s: etcd: add etcd user to the puppet group

https://gerrit.wikimedia.org/r/517905

Change 517905 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: k8s: etcd: add etcd user to the puppet group

https://gerrit.wikimedia.org/r/517905

Change 517906 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: k8s: etcd: use complete fqdn in node name

https://gerrit.wikimedia.org/r/517906

Change 517906 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: k8s: etcd: use complete fqdn in node name

https://gerrit.wikimedia.org/r/517906

Change 518020 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: k8s: etcd: restart etcd service when certs change

https://gerrit.wikimedia.org/r/518020

Using puppet certs for etcd has some other tricky parts, like:

Jun 20 12:06:12 toolsbeta-arturo-k8s-etcd-1 etcd[28005]: health check for peer 5323d67b4ea7da68 could not connect: x509: cannot validate certificate for 172.16.0.243 because it doesn't contain any IP SANs
Jun 20 12:06:12 toolsbeta-arturo-k8s-etcd-1 etcd[28005]: health check for peer 7e025ec0fe50d8f2 could not connect: x509: cannot validate certificate for 172.16.0.240 because it doesn't contain any IP SANs

My current config would need some additional tweaks that I'm still investigating.

Change 518075 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: k8s: etcd: use domain names instead of IP addresses

https://gerrit.wikimedia.org/r/518075

I still don't find why etcd would do that.

I can connect by hand using the same certs:

root@toolsbeta-arturo-k8s-etcd-1:~# openssl s_client -CAfile /var/lib/puppet/ssl/certs/ca.pem -connect 172.16.0.240:2380 \
    -cert /var/lib/puppet/ssl/certs/toolsbeta-arturo-k8s-etcd-1.toolsbeta.eqiad.wmflabs.pem  \
    -key /var/lib/puppet/ssl/private_keys/toolsbeta-arturo-k8s-etcd-1.toolsbeta.eqiad.wmflabs.pem 
[... ok ...]

I'm using etcd v 3.2.12-1 from our internal repo. It seems newer etcd versions contains an additional config flag to allow arbitrary SANs in the peer certificate: https://etcd.io/docs/v3.3.12/op-guide/configuration/#peer-cert-allowed-cn

Change 518075 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: k8s: etcd: use domain names instead of IP addresses

https://gerrit.wikimedia.org/r/518075

Change 518235 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: k8s: etcd: enable 2379/tcp for peers as well

https://gerrit.wikimedia.org/r/518235

Change 518235 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: k8s: etcd: enable 2379/tcp for peers as well

https://gerrit.wikimedia.org/r/518235

OK, after some help from @Joe and @Vgutierrez I got the cluster working:

root@toolsbeta-arturo-k8s-etcd-1:~# etcdctl --endpoints https://toolsbeta-arturo-k8s-etcd-1.toolsbeta.eqiad.wmflabs:2379 cluster-health
member 4c91c386e446da71 is healthy: got healthy result from https://toolsbeta-arturo-k8s-etcd-1.toolsbeta.eqiad.wmflabs:2379
member 591dee308ea139ea is healthy: got healthy result from https://toolsbeta-arturo-k8s-etcd-2.toolsbeta.eqiad.wmflabs:2379
member b0e3c89a1c97e359 is healthy: got healthy result from https://toolsbeta-arturo-k8s-etcd-3.toolsbeta.eqiad.wmflabs:2379

Change 518020 abandoned by Arturo Borrero Gonzalez:
toolforge: k8s: etcd: restart etcd service when certs change

Reason:
Using https://gerrit.wikimedia.org/r/c/operations/puppet/ /518238 instead

https://gerrit.wikimedia.org/r/518020

Mentioned in SAL (#wikimedia-cloud) [2019-06-21T11:42:49Z] <arturo> re-create 3 VMs toolsbeta-arturo-k8s-etcd-[1-3] to test latest puppet code in T226098

Change 518247 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: k8s: etcd: also create /etc/etcd

https://gerrit.wikimedia.org/r/518247

Change 518247 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: k8s: etcd: also create /etc/etcd

https://gerrit.wikimedia.org/r/518247

I'm pretty happy now with how etcd looks in the puppet tree and the resulting state with fresh installed VMs. I will probably leave it as is and move on to kubernetes itself and see if anything else is required once k8s is actually using etcd.

For the record, I only used these hiera keys in the prefix for the basic bootstrap:

profile::etcd::cluster_bootstrap: true
profile::ldap::client::labs::client_stack: sssd
profile::toolforge::k8s::etcd_hosts:
- toolsbeta-arturo-k8s-etcd-1.toolsbeta.eqiad.wmflabs
- toolsbeta-arturo-k8s-etcd-2.toolsbeta.eqiad.wmflabs
- toolsbeta-arturo-k8s-etcd-3.toolsbeta.eqiad.wmflabs
sudo_flavor: sudo

Mentioned in SAL (#wikimedia-cloud) [2019-07-18T12:47:13Z] <arturo> create toolsbeta-test-k8s-etcd-2 as buster to check status of latest puppet code (T226098)