Page MenuHomePhabricator

etcd config depends on puppet certs, but puppet doesn't know
Open, NormalPublic

Description

We had a bit of a mess in Tools today, that went like this:

  1. I switched tools nodes to a new puppetmaster
  2. puppet certs were updated on tools VMs
  3. we restarted the k8s components on the k8s-master, and they tried to validate the local puppet cert and check if it matched with the certs etcd presented
  4. the etcd cluster was still using the old certs, which caused kube-apiserver to fail with etcd misconfigured errors
  5. we restarted etcd on all the etcd clients, they picked up the new certs and then thinks got better

The puppet config for etcd should depend on cert changes so that the etcd service restarts immediately.

Event Timeline

Andrew created this task.Jun 30 2017, 2:16 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 30 2017, 2:16 AM

While checking an email from Shinken about Puppet failing, I've noticed the following error. Adding it here in case it's related.

Sep 17 21:10:19 tools-k8s-master-01 systemd[1]: Starting etcd...
Sep 17 21:10:19 tools-k8s-master-01 systemd[1]: Started etcd.
Sep 17 21:10:19 tools-k8s-master-01 etcd[23177]: recognized and used environment variable ETCD_ADVERTISE_CLIENT_URLS=https://tools-k8s-master-01.tools.eqiad.wmflabs:2379
Sep 17 21:10:19 tools-k8s-master-01 etcd[23177]: recognized and used environment variable ETCD_CERT_FILE=/var/lib/etcd/ssl/certs/cert.pem
Sep 17 21:10:19 tools-k8s-master-01 etcd[23177]: recognized and used environment variable ETCD_DATA_DIR=/var/lib/etcd/tools-k8s
Sep 17 21:10:19 tools-k8s-master-01 etcd[23177]: recognized and used environment variable ETCD_INITIAL_ADVERTISE_PEER_URLS=http://tools-k8s-master-01.tools.eqiad.wmflabs:2380
Sep 17 21:10:19 tools-k8s-master-01 etcd[23177]: recognized and used environment variable ETCD_INITIAL_CLUSTER=tools-k8s-master-01=http://tools-k8s-master-01.tools.eqiad.wmflabs:2380
Sep 17 21:10:19 tools-k8s-master-01 etcd[23177]: recognized and used environment variable ETCD_INITIAL_CLUSTER_STATE=existing
Sep 17 21:10:19 tools-k8s-master-01 etcd[23177]: recognized and used environment variable ETCD_KEY_FILE=/var/lib/etcd/ssl/private_keys/server.key
Sep 17 21:10:19 tools-k8s-master-01 etcd[23177]: recognized and used environment variable ETCD_LISTEN_CLIENT_URLS=https://tools-k8s-master-01.tools.eqiad.wmflabs:2379
Sep 17 21:10:19 tools-k8s-master-01 etcd[23177]: recognized and used environment variable ETCD_LISTEN_PEER_URLS=http://tools-k8s-master-01.tools.eqiad.wmflabs:2380
Sep 17 21:10:19 tools-k8s-master-01 etcd[23177]: recognized and used environment variable ETCD_NAME=tools-k8s-master-01
Sep 17 21:10:19 tools-k8s-master-01 etcd[23177]: setting maximum number of CPUs to 2, total number of available CPUs is 2
Sep 17 21:10:19 tools-k8s-master-01 etcd[23177]: the server is already initialized as member before, starting as etcd member...
Sep 17 21:10:19 tools-k8s-master-01 etcd[23177]: listening for peers on http://tools-k8s-master-01.tools.eqiad.wmflabs:2380
Sep 17 21:10:19 tools-k8s-master-01 etcd[23177]: clientTLS: cert = /var/lib/etcd/ssl/certs/cert.pem, key = /var/lib/etcd/ssl/private_keys/server.key, ca = , trusted-ca = , client-cert-auth = false
Sep 17 21:10:19 tools-k8s-master-01 etcd[23177]: stopping listening for peers on http://tools-k8s-master-01.tools.eqiad.wmflabs:2380
Sep 17 21:10:19 tools-k8s-master-01 etcd[23177]: open /var/lib/etcd/ssl/certs/cert.pem: no such file or directory
Sep 17 21:10:19 tools-k8s-master-01 systemd[1]: etcd.service: main process exited, code=exited, status=1/FAILURE
Sep 17 21:10:19 tools-k8s-master-01 systemd[1]: Unit etcd.service entered failed state.
Bstorm added a subscriber: Bstorm.Nov 21 2018, 11:14 PM

The error I mentioned above is unrelated to this issue, please ignore. It's been fixed in T211202.

GTirloni removed a subscriber: GTirloni.Dec 19 2018, 3:58 PM
Bstorm added a comment.Feb 6 2019, 7:17 PM

I think I see what's up here. In modules/etcd/manifests/ssl.pp , the etcd certs are literally copied from the puppet ones. That's a pretty strong dependency. If there's a mismatch, that would break things, and etcd wouldn't pick up new certs until a restart, which would kind of need to be timed along with any clients that use the same sort of cert scheme.

I wonder if puppet subscribe => and/or notify => mechanisms would work in this case.

We could link the etcd service to the certificate file, so when puppet modify it, the service is restarted.

But puppet doesn't run the agent like normal when it modifies a cert. It waits for signature, etc.?
However, now that you mention it, I think I was thinking about it wrong. Why not just subscribe to: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/etcd/manifests/ssl.pp#32 ?

Certs for puppet may be core to puppet and affect evaluation, but a File resource that happens to use the puppet cert as the source is something that can be watched and acted on when it changes (the puppet cert itself may be as well...but it somehow makes my head hurt).

Change 512338 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] [RFC] etcd::ssl: restart etcd service when the SSL cert changes

https://gerrit.wikimedia.org/r/512338

aborrero added a comment.EditedFri, May 24, 9:37 AM

But puppet doesn't run the agent like normal when it modifies a cert. It waits for signature, etc.?
However, now that you mention it, I think I was thinking about it wrong. Why not just subscribe to: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/etcd/manifests/ssl.pp#32 ?

Certs for puppet may be core to puppet and affect evaluation, but a File resource that happens to use the puppet cert as the source is something that can be watched and acted on when it changes (the puppet cert itself may be as well...but it somehow makes my head hurt).

By the time the agent has fetched the new certificate, the puppet agent itself is already working properly and perhaps in the same agent run we could copy the cert and restart the service by means of the subscribe mechanism.

See patch for example of my idea: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/512338

Edit: perhaps other option would be a link instead of hard cert file copy/duplication? But we would require the restart anyway for the etcd daemon to pick up the new cert file.

ArielGlenn triaged this task as Normal priority.Mon, Jun 10, 6:56 AM