Page MenuHomePhabricator

Migrate etcd::tlsproxy Nginx certs and etcd itself to PKI
Open, Needs TriagePublic

Description

The etcd-v3.[eqiad|codfw].wmnet certs used by Nginx on the conf* hosts are currently using a certificate signed by the old Puppet 5 CA using the sslcert::certificate() define and cergen. They need to be moved to the PKI before the conf servers can be migrated to Puppet 7.

profile::etcd::v3 needs to switch to PKI certs as well, there's a Hiera flag use_pki_certs to that effect (already in use by the other etcd clusters we run).

Event Timeline

When we make the change, it will require a restart of etcd on the nodes.

We will need to perform the change and then issue a restart of all pybals connected to the specific server, and when we're done, also restart all confd instances.

After doing the first server, I'd keep an eye on errors from mediawiki as well.

To be clear: this isn't a small change with limited impact, it should be done outside of change freeze periods.

MoritzMuehlenhoff renamed this task from Migrate etcd::tlsproxy Nginx certs to PKI to Migrate etcd::tlsproxy Nginx certs and etcd itself to PKI.Nov 30 2023, 2:09 PM
Scott_French subscribed.

My initial plan was to move etcd to PKI as part of the v3 API migration (T350565), which is also likely to do away with the TLS proxy.

However, the ETA for that is likely measured in "months from now" so I can explore this earlier if needed.

Took a closer look at this today: This should be trivial in the case where we turn up the new v3-API-only etcd cluster using PKI from day 1.

Naively, my main concern would be support for runtime cert reload (for rotation w/o node restarts), but that's been supported since v3.2 [0] (we're on v3.3.25, as are the etcd clusters that support various k8s deployments, which are already on PKI, so no surprise there).

The other concern that comes to mind is eventual rotation of the intermediate CA certs, as etcd is known to have issues with non-disruptive reloading of trusted CA bundles to support rotation [1] (relevant to client auth for peer-peer communication in our case).

Looking more closely at what we actually do, this should not be a problem either: in profile::etcd::v3, the trusted CA cert is simply the internal root CA cert, while the client (peer) certs are chained (i.e., including the intermediate).

Open questions:

  • Is there any value in creating a new intermediate, separate from "etcd" used by clusters supporting k8s?
  • Is there any value in creating distinct certs for client vs. peer connections? (analogous to what we have now)

[0] https://etcd.io/docs/v3.3/op-guide/security/#notes-for-tls-authentication

[1] https://github.com/etcd-io/etcd/issues/11555

Is there any value in creating a new intermediate, separate from "etcd" used by clusters supporting k8s?

The main benefit here would be decoupling between rather different use cases.

For example, this would make it easy to have different (default) signing policies (e.g., expiry). Then again, the latter doesn't really require a new intermediate - we could just add a new signing profile.

The other possible benefit is providing a boundary for client auth between etcd peers. Whether that's really meaningful or not given the realities of our environment I'll need to explore a bit more.

Is there any value in creating distinct certs for client vs. peer connections? (analogous to what we have now)

Note: "analogous to what we have now" is referring specifically to the main etcd cluster (i.e., we have clients connect via the nginx TLS proxy, which uses distinct certs). In k8s, it's the same certs.

The main benefit of separating them would again be different signing policies. IMO, that's not something we need from day 1.

In any case, I think at this point the PoR is to migrate to PKI as part of the v3 API migration, possibly using a different intermediate (this decision has no bearing on the timeline, though).