Page MenuHomePhabricator

etcd config depends on puppet certs, but puppet doesn't know
Closed, ResolvedPublic

Description

We had a bit of a mess in Tools today, that went like this:

  1. I switched tools nodes to a new puppetmaster
  2. puppet certs were updated on tools VMs
  3. we restarted the k8s components on the k8s-master, and they tried to validate the local puppet cert and check if it matched with the certs etcd presented
  4. the etcd cluster was still using the old certs, which caused kube-apiserver to fail with etcd misconfigured errors
  5. we restarted etcd on all the etcd clients, they picked up the new certs and then thinks got better

The puppet config for etcd should depend on cert changes so that the etcd service restarts immediately.

Details

Event Timeline

Andrew created this task.Jun 30 2017, 2:16 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 30 2017, 2:16 AM

While checking an email from Shinken about Puppet failing, I've noticed the following error. Adding it here in case it's related.

Sep 17 21:10:19 tools-k8s-master-01 systemd[1]: Starting etcd...
Sep 17 21:10:19 tools-k8s-master-01 systemd[1]: Started etcd.
Sep 17 21:10:19 tools-k8s-master-01 etcd[23177]: recognized and used environment variable ETCD_ADVERTISE_CLIENT_URLS=https://tools-k8s-master-01.tools.eqiad.wmflabs:2379
Sep 17 21:10:19 tools-k8s-master-01 etcd[23177]: recognized and used environment variable ETCD_CERT_FILE=/var/lib/etcd/ssl/certs/cert.pem
Sep 17 21:10:19 tools-k8s-master-01 etcd[23177]: recognized and used environment variable ETCD_DATA_DIR=/var/lib/etcd/tools-k8s
Sep 17 21:10:19 tools-k8s-master-01 etcd[23177]: recognized and used environment variable ETCD_INITIAL_ADVERTISE_PEER_URLS=http://tools-k8s-master-01.tools.eqiad.wmflabs:2380
Sep 17 21:10:19 tools-k8s-master-01 etcd[23177]: recognized and used environment variable ETCD_INITIAL_CLUSTER=tools-k8s-master-01=http://tools-k8s-master-01.tools.eqiad.wmflabs:2380
Sep 17 21:10:19 tools-k8s-master-01 etcd[23177]: recognized and used environment variable ETCD_INITIAL_CLUSTER_STATE=existing
Sep 17 21:10:19 tools-k8s-master-01 etcd[23177]: recognized and used environment variable ETCD_KEY_FILE=/var/lib/etcd/ssl/private_keys/server.key
Sep 17 21:10:19 tools-k8s-master-01 etcd[23177]: recognized and used environment variable ETCD_LISTEN_CLIENT_URLS=https://tools-k8s-master-01.tools.eqiad.wmflabs:2379
Sep 17 21:10:19 tools-k8s-master-01 etcd[23177]: recognized and used environment variable ETCD_LISTEN_PEER_URLS=http://tools-k8s-master-01.tools.eqiad.wmflabs:2380
Sep 17 21:10:19 tools-k8s-master-01 etcd[23177]: recognized and used environment variable ETCD_NAME=tools-k8s-master-01
Sep 17 21:10:19 tools-k8s-master-01 etcd[23177]: setting maximum number of CPUs to 2, total number of available CPUs is 2
Sep 17 21:10:19 tools-k8s-master-01 etcd[23177]: the server is already initialized as member before, starting as etcd member...
Sep 17 21:10:19 tools-k8s-master-01 etcd[23177]: listening for peers on http://tools-k8s-master-01.tools.eqiad.wmflabs:2380
Sep 17 21:10:19 tools-k8s-master-01 etcd[23177]: clientTLS: cert = /var/lib/etcd/ssl/certs/cert.pem, key = /var/lib/etcd/ssl/private_keys/server.key, ca = , trusted-ca = , client-cert-auth = false
Sep 17 21:10:19 tools-k8s-master-01 etcd[23177]: stopping listening for peers on http://tools-k8s-master-01.tools.eqiad.wmflabs:2380
Sep 17 21:10:19 tools-k8s-master-01 etcd[23177]: open /var/lib/etcd/ssl/certs/cert.pem: no such file or directory
Sep 17 21:10:19 tools-k8s-master-01 systemd[1]: etcd.service: main process exited, code=exited, status=1/FAILURE
Sep 17 21:10:19 tools-k8s-master-01 systemd[1]: Unit etcd.service entered failed state.
Bstorm added a subscriber: Bstorm.Nov 21 2018, 11:14 PM

The error I mentioned above is unrelated to this issue, please ignore. It's been fixed in T211202.

GTirloni removed a subscriber: GTirloni.Dec 19 2018, 3:58 PM
Bstorm added a comment.Feb 6 2019, 7:17 PM

I think I see what's up here. In modules/etcd/manifests/ssl.pp , the etcd certs are literally copied from the puppet ones. That's a pretty strong dependency. If there's a mismatch, that would break things, and etcd wouldn't pick up new certs until a restart, which would kind of need to be timed along with any clients that use the same sort of cert scheme.

I wonder if puppet subscribe => and/or notify => mechanisms would work in this case.

We could link the etcd service to the certificate file, so when puppet modify it, the service is restarted.

But puppet doesn't run the agent like normal when it modifies a cert. It waits for signature, etc.?
However, now that you mention it, I think I was thinking about it wrong. Why not just subscribe to: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/etcd/manifests/ssl.pp#32 ?

Certs for puppet may be core to puppet and affect evaluation, but a File resource that happens to use the puppet cert as the source is something that can be watched and acted on when it changes (the puppet cert itself may be as well...but it somehow makes my head hurt).

Change 512338 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] [RFC] etcd::ssl: restart etcd service when the SSL cert changes

https://gerrit.wikimedia.org/r/512338

aborrero added a comment.EditedMay 24 2019, 9:37 AM

But puppet doesn't run the agent like normal when it modifies a cert. It waits for signature, etc.?
However, now that you mention it, I think I was thinking about it wrong. Why not just subscribe to: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/etcd/manifests/ssl.pp#32 ?
Certs for puppet may be core to puppet and affect evaluation, but a File resource that happens to use the puppet cert as the source is something that can be watched and acted on when it changes (the puppet cert itself may be as well...but it somehow makes my head hurt).

By the time the agent has fetched the new certificate, the puppet agent itself is already working properly and perhaps in the same agent run we could copy the cert and restart the service by means of the subscribe mechanism.

See patch for example of my idea: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/512338

Edit: perhaps other option would be a link instead of hard cert file copy/duplication? But we would require the restart anyway for the etcd daemon to pick up the new cert file.

ArielGlenn triaged this task as Medium priority.Jun 10 2019, 6:56 AM

Change 518020 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: k8s: etcd: restart etcd service when certs change

https://gerrit.wikimedia.org/r/518020

Change 512338 abandoned by Arturo Borrero Gonzalez:
etcd::ssl: restart etcd service when the SSL cert changes

Reason:
We no longer require this in T169287

https://gerrit.wikimedia.org/r/512338

Change 518238 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: k8s: etcd: refresh server with certs changes

https://gerrit.wikimedia.org/r/518238

Change 518020 abandoned by Arturo Borrero Gonzalez:
toolforge: k8s: etcd: restart etcd service when certs change

Reason:
Using https://gerrit.wikimedia.org/r/c/operations/puppet/ /518238 instead

https://gerrit.wikimedia.org/r/518020

Change 518238 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: k8s: etcd: refresh server with certs changes

https://gerrit.wikimedia.org/r/518238

For the next toolforge kubernetes cluster we have 2 approaches for etcd:

Either way, the next version of Toolforge kubernetes should contain a fix for this.

Bstorm closed this task as Resolved.Aug 14 2019, 5:39 PM
Bstorm claimed this task.

At this point, we have ended up puppetizing the copying of puppet certs to act as etcd client certs as well as server certs in T215531: Deploy upgraded Kubernetes to toolsbeta with an "unstacked" control plane (separate etcd servers) because we found the process of dealing with node failure with a stacked control plane to be kind of awful.

Therefore, cert changes matter again! However, the fix of restarting the service if the file changes is already baked into our puppet, and I believe that @aborrero's patch above fixes this for the old systems. It *should* even fix itself eventually if the puppet CA certs change. Therefore, closing this!