etcd config depends on puppet certs, but puppet doesn't know
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Andrew
	Jun 30 2017, 2:16 AM

Description

We had a bit of a mess in Tools today, that went like this:

I switched tools nodes to a new puppetmaster
puppet certs were updated on tools VMs
we restarted the k8s components on the k8s-master, and they tried to validate the local puppet cert and check if it matched with the certs etcd presented
the etcd cluster was still using the old certs, which caused kube-apiserver to fail with etcd misconfigured errors
we restarted etcd on all the etcd clients, they picked up the new certs and then thinks got better

The puppet config for etcd should depend on cert changes so that the etcd service restarts immediately.

Details

Subject	Repo	Branch	Lines +/-
toolforge: k8s: etcd: refresh server with certs changes	operations/puppet	production	+44 -10
toolforge: k8s: etcd: restart etcd service when certs change	operations/puppet	production	+14 -5
etcd::ssl: restart etcd service when the SSL cert changes	operations/puppet	production	+29 -3

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
		Restricted Task
Resolved	• Bstorm	T246122 Upgrade the Toolforge Kubernetes cluster to v1.16
		Restricted Task
Resolved	bd808	T232536 Toolforge Kubernetes internal API down, causing `webservice` and other tooling to fail
Resolved	• Bstorm	T236565 "tools" Cloud VPS project jessie deprecation
Resolved	aborrero	T101651 Set up toolsbeta more fully to help make testing easier
Resolved	• Bstorm	T166949 Homedir/UID info breaks after a while in Tools Kubernetes (can't read replica.my.cnf)
Resolved	• Bstorm	T246059 Add admin account creation to maintain-kubeusers
Resolved	• Bstorm	T154504 Make webservice backend default to kubernetes
Declined	None	T245230 Investigate cpu/ram requests and limits for DaemonSets pods
Resolved	• Bstorm	T214513 Deploy and migrate tools to a Kubernetes v1.15 or newer cluster
Resolved	• Bstorm	T215553 Figure out cert management for Toolforge kubernetes and make it clear in documents, etc. for the upgrade
Resolved	• Bstorm	T169289 Tool Labs 2017-06-29 Labstore100[45] kernel upgrade issues
Resolved	• Bstorm	T169287 etcd config depends on puppet certs, but puppet doesn't know

Event Timeline

Andrew created this task.Jun 30 2017, 2:16 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 30 2017, 2:16 AM

• madhuvishy updated the task description. (Show Details)Jun 30 2017, 2:28 AM

bd808 added a parent task: T169289: Tool Labs 2017-06-29 Labstore100[45] kernel upgrade issues.Jun 30 2017, 2:33 AM

Krenair subscribed.Aug 30 2017, 7:34 PM

While checking an email from Shinken about Puppet failing, I've noticed the following error. Adding it here in case it's related.

Sep 17 21:10:19 tools-k8s-master-01 systemd[1]: Starting etcd...
Sep 17 21:10:19 tools-k8s-master-01 systemd[1]: Started etcd.
Sep 17 21:10:19 tools-k8s-master-01 etcd[23177]: recognized and used environment variable ETCD_ADVERTISE_CLIENT_URLS=https://tools-k8s-master-01.tools.eqiad.wmflabs:2379
Sep 17 21:10:19 tools-k8s-master-01 etcd[23177]: recognized and used environment variable ETCD_CERT_FILE=/var/lib/etcd/ssl/certs/cert.pem
Sep 17 21:10:19 tools-k8s-master-01 etcd[23177]: recognized and used environment variable ETCD_DATA_DIR=/var/lib/etcd/tools-k8s
Sep 17 21:10:19 tools-k8s-master-01 etcd[23177]: recognized and used environment variable ETCD_INITIAL_ADVERTISE_PEER_URLS=http://tools-k8s-master-01.tools.eqiad.wmflabs:2380
Sep 17 21:10:19 tools-k8s-master-01 etcd[23177]: recognized and used environment variable ETCD_INITIAL_CLUSTER=tools-k8s-master-01=http://tools-k8s-master-01.tools.eqiad.wmflabs:2380
Sep 17 21:10:19 tools-k8s-master-01 etcd[23177]: recognized and used environment variable ETCD_INITIAL_CLUSTER_STATE=existing
Sep 17 21:10:19 tools-k8s-master-01 etcd[23177]: recognized and used environment variable ETCD_KEY_FILE=/var/lib/etcd/ssl/private_keys/server.key
Sep 17 21:10:19 tools-k8s-master-01 etcd[23177]: recognized and used environment variable ETCD_LISTEN_CLIENT_URLS=https://tools-k8s-master-01.tools.eqiad.wmflabs:2379
Sep 17 21:10:19 tools-k8s-master-01 etcd[23177]: recognized and used environment variable ETCD_LISTEN_PEER_URLS=http://tools-k8s-master-01.tools.eqiad.wmflabs:2380
Sep 17 21:10:19 tools-k8s-master-01 etcd[23177]: recognized and used environment variable ETCD_NAME=tools-k8s-master-01
Sep 17 21:10:19 tools-k8s-master-01 etcd[23177]: setting maximum number of CPUs to 2, total number of available CPUs is 2
Sep 17 21:10:19 tools-k8s-master-01 etcd[23177]: the server is already initialized as member before, starting as etcd member...
Sep 17 21:10:19 tools-k8s-master-01 etcd[23177]: listening for peers on http://tools-k8s-master-01.tools.eqiad.wmflabs:2380
Sep 17 21:10:19 tools-k8s-master-01 etcd[23177]: clientTLS: cert = /var/lib/etcd/ssl/certs/cert.pem, key = /var/lib/etcd/ssl/private_keys/server.key, ca = , trusted-ca = , client-cert-auth = false
Sep 17 21:10:19 tools-k8s-master-01 etcd[23177]: stopping listening for peers on http://tools-k8s-master-01.tools.eqiad.wmflabs:2380
Sep 17 21:10:19 tools-k8s-master-01 etcd[23177]: open /var/lib/etcd/ssl/certs/cert.pem: no such file or directory
Sep 17 21:10:19 tools-k8s-master-01 systemd[1]: etcd.service: main process exited, code=exited, status=1/FAILURE
Sep 17 21:10:19 tools-k8s-master-01 systemd[1]: Unit etcd.service entered failed state.

• Bstorm subscribed.Nov 21 2018, 11:14 PM

The error I mentioned above is unrelated to this issue, please ignore. It's been fixed in T211202.

• GTirloni unsubscribed.Dec 19 2018, 3:58 PM

• Bstorm added a parent task: T214513: Deploy and migrate tools to a Kubernetes v1.15 or newer cluster.Jan 23 2019, 8:21 PM

• Phabricator_maintenance moved this task from Backlog to Acknowledged on the SRE board.Jan 26 2019, 9:20 PM

I think I see what's up here. In modules/etcd/manifests/ssl.pp , the etcd certs are literally copied from the puppet ones. That's a pretty strong dependency. If there's a mismatch, that would break things, and etcd wouldn't pick up new certs until a restart, which would kind of need to be timed along with any clients that use the same sort of cert scheme.

• Bstorm mentioned this in T214513: Deploy and migrate tools to a Kubernetes v1.15 or newer cluster.Feb 7 2019, 12:39 AM

• Bstorm edited parent tasks, added: T215553: Figure out cert management for Toolforge kubernetes and make it clear in documents, etc. for the upgrade; removed: T214513: Deploy and migrate tools to a Kubernetes v1.15 or newer cluster.Feb 7 2019, 9:05 PM

• Bstorm mentioned this in T215663: Stand up upgraded Toolforge etcd clusters.Feb 8 2019, 9:45 PM

I wonder if puppet subscribe => and/or notify => mechanisms would work in this case.

We could link the etcd service to the certificate file, so when puppet modify it, the service is restarted.

But puppet doesn't run the agent like normal when it modifies a cert. It waits for signature, etc.?
However, now that you mention it, I think I was thinking about it wrong. Why not just subscribe to: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/etcd/manifests/ssl.pp#32 ?

Certs for puppet may be core to puppet and affect evaluation, but a File resource that happens to use the puppet cert as the source is something that can be watched and acted on when it changes (the puppet cert itself may be as well...but it somehow makes my head hurt).

Change 512338 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] [RFC] etcd::ssl: restart etcd service when the SSL cert changes

https://gerrit.wikimedia.org/r/512338

gerritbot added a project: Patch-For-Review.May 24 2019, 9:37 AM

In T169287#5208609, @Bstorm wrote:

But puppet doesn't run the agent like normal when it modifies a cert. It waits for signature, etc.?
However, now that you mention it, I think I was thinking about it wrong. Why not just subscribe to: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/etcd/manifests/ssl.pp#32 ?

Certs for puppet may be core to puppet and affect evaluation, but a File resource that happens to use the puppet cert as the source is something that can be watched and acted on when it changes (the puppet cert itself may be as well...but it somehow makes my head hurt).

By the time the agent has fetched the new certificate, the puppet agent itself is already working properly and perhaps in the same agent run we could copy the cert and restart the service by means of the subscribe mechanism.

See patch for example of my idea: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/512338

Edit: perhaps other option would be a link instead of hard cert file copy/duplication? But we would require the restart anyway for the etcd daemon to pick up the new cert file.

ArielGlenn triaged this task as Medium priority.Jun 10 2019, 6:56 AM

Change 518020 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: k8s: etcd: restart etcd service when certs change

https://gerrit.wikimedia.org/r/518020

Change 512338 abandoned by Arturo Borrero Gonzalez:
etcd::ssl: restart etcd service when the SSL cert changes

Reason:
We no longer require this in T169287

https://gerrit.wikimedia.org/r/512338

Change 518238 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: k8s: etcd: refresh server with certs changes

https://gerrit.wikimedia.org/r/518238

Change 518020 abandoned by Arturo Borrero Gonzalez:
toolforge: k8s: etcd: restart etcd service when certs change

Reason:
Using https://gerrit.wikimedia.org/r/c/operations/puppet/ /518238 instead

https://gerrit.wikimedia.org/r/518020

Change 518238 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: k8s: etcd: refresh server with certs changes

https://gerrit.wikimedia.org/r/518238

Maintenance_bot removed a project: Patch-For-Review.Jun 21 2019, 12:10 PM

For the next toolforge kubernetes cluster we have 2 approaches for etcd:

the one developed in this phabricator task: T226098: Toolforge: modernize deployment for etcd in k8s. That one uses puppet certs that restarts the etcd server in case of certificate changes. No more worries about certificate changes.
the approach of using kubeadm to setup the stacked etcd setup for us, tested and developed in task: T215531: Deploy upgraded Kubernetes to toolsbeta. This one uses certs managed entirely by kubernetes, so No more worries about certificate changes.

Either way, the next version of Toolforge kubernetes should contain a fix for this.

aborrero edited projects, added cloud-services-team (Kanban); removed SRE.Jul 4 2019, 4:39 PM

At this point, we have ended up puppetizing the copying of puppet certs to act as etcd client certs as well as server certs in T215531: Deploy upgraded Kubernetes to toolsbeta with an "unstacked" control plane (separate etcd servers) because we found the process of dealing with node failure with a stacked control plane to be kind of awful.

Therefore, cert changes matter again! However, the fix of restarting the service if the file changes is already baked into our puppet, and I believe that @aborrero's patch above fixes this for the old systems. It *should* even fix itself eventually if the puppet CA certs change. Therefore, closing this!

etcd config depends on puppet certs, but puppet doesn't knowClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

etcd config depends on puppet certs, but puppet doesn't know
Closed, ResolvedPublic
Actions

Related Objects
Search...