Page MenuHomePhabricator

Migrate the etcd main cluster to cfssl-based PKI
Closed, ResolvedPublic

Description

There are two contexts in which the etcd main cluster still uses TLS certs signed by the puppet 5 CA, which in turn blocks role::configcluster hosts moving to puppet 7:

  • profile::etcd::tlsproxy - The nginx-based authenticating proxy supports only the sslcert::certificate define (loading the CN=etcd-v3.(eqiad|codfw).wmnet certificates).
  • profile::etcd::v3 - etcd itself (for both peer and direct client communication, the latter being limited to nginx and etcd-mirror) supports both sslcert::certificate and profile::pki::get_cert, but in the configcluster use case still relies on the former (loading the CN=_etcd-server-ssl._tcp.v3.(codfw|eqiad).wmnet certificates).

To unblock moving to puppet 7, we need to:

  • Add support for cfssl-based PKI in profile::etcd::tlsproxy.
  • Migrate etcd's nginx proxy to PKI via the above.
  • Migrate etcd itself to PKI using the existing support in profile::etcd::v3 (gated on the use_pki_certs hiera key, and already used by other etcd clusters we run, e.g., for k8s).

Details

Related Changes in Gerrit:
SubjectRepoBranchLines +/-
operations/puppetproduction+0 -27
operations/puppetproduction+0 -26
operations/puppetproduction+0 -26
operations/puppetproduction+0 -25
operations/puppetproduction+1 -1
operations/puppetproduction+1 -3
operations/puppetproduction+1 -1
operations/puppetproduction+1 -0
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+1 -2
operations/puppetproduction+2 -1
operations/puppetproduction+2 -1
operations/puppetproduction+1 -0
operations/puppetproduction+10 -32
operations/puppetproduction+3 -4
operations/puppetproduction+1 -5
operations/puppetproduction+1 -0
operations/puppetproduction+2 -2
operations/puppetproduction+0 -1
operations/puppetproduction+1 -0
operations/puppetproduction+38 -12
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Mentioned in SAL (#wikimedia-operations) [2025-11-10T16:10:11Z] <swfrench-wmf> begin rolling restart of codfw-associated confds after conf2006 etcd restart - T352245

The pilot on conf2006 today went smoothly. No issues encountered following the switch to the new cfssl/pki-based certs.

Total elapsed from SIGTERM to the new process becoming ready to serve client requests was ~ 13s: ~ 8s of graceful shutdown, followed by ~ 5s rejoining the cluster and recovering state. Note for future reference that there will be transient errors emitted during the shutdown process (as cancellations propagate, etc.).

I think we're in a good spot to proceed with the rest of the cluster later this week, tentatively Wednesday.


Additional points of note for Wednesday:

We'll need to exercise some care around PyBal (conf2004) and etcdmirror (conf2005).

In particular, we need to decide whether it makes sense to migration them away from their respective nodes while the changes are ongoing The main question is whether doing so would be strictly more disruptive than cleanly recovering from a 10s+ window in which etcd will be partially or fully unavailable.

For etcdmirror, I'd propose that we manually stop and start the service on conf2005, bracketing the etcd restart. If the timings today are representative, this should not result in a longer time window without replication (and thus risk of the last-replicated index falling off the 1000-event window) than temporarily switching to a peer host would. We will need to silence EtcdReplicationDown during the switch.

For PyBals located in codfw, I'm on the fence. If we consider (clean) PyBal restarts to be trivially safe, then it may actually reduce cognitive burden during the migration to migrate them away from conf2004 and back again (i.e., rather than having to run whack-a-mole restarts in response to edge cases around watch index recovery).

For PyBals located in codfw, I'm on the fence. If we consider (clean) PyBal restarts to be trivially safe, then it may actually reduce cognitive burden during the migration to migrate them away from conf2004 and back again (i.e., rather than having to run whack-a-mole restarts in response to edge cases around watch index recovery).

Sounds good, restarting PyBal one by one is perfectly fine to do and avoids help any surprises for later restarts as well.

Change #1203556 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/puppet@production] hiera: temporarily point codfw LVS at conf2006

https://gerrit.wikimedia.org/r/1203556

Change #1203557 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/puppet@production] hiera: switch codfw etcd-main cluster to cfssl/pki

https://gerrit.wikimedia.org/r/1203557

Thanks to @ssingh and @MoritzMuehlenhoff for the reviews.

Alas, the work described in T352245#11360536 could not happen today due to last-minute conflicts.

I also realized there are two aspects of the plan I had in mind that I need to double check, specifically around:

  1. how I was planning to handle etcd-mirror during the switch on conf2005 (i.e., coordination of manual etcd-mirror stop / start while also needing to run puppet-agent); and
  2. whether a preemptive leader move (i.e., to decouple from restarts) will be possible at all given the tricks we play with advertised client URLs.

I'll following up on these today.

Following up on the items mentioned in T352245#11367783:

  1. how I was planning to handle etcd-mirror during the switch on conf2005 (i.e., coordination of manual etcd-mirror stop / start while also needing to run puppet-agent); and

To recap, we don't want an ill-timed update to be replicated by etcd-mirror just as the the node-local etcd member is shutting down, since it risks a "torn" update (i.e., the observed update is applied to the local cluster, but the subsequent replication-index update fails; see Etcd2Writer.write) which is hard to recover from without a destructive reload.

Instead, when migrating conf2005, I was planning to manually stop etcd-mirror, run puppet agent to apply the certificate change to etcd, and then start etcd-mirror again.

That of course does not work if puppet's intent is for etcd-mirror to be running on the node - i.e., puppet will start it again, concurrent with restarting etcd.

Unless there's some puppet-agent magic to instruct it not to touch services (i.e., don't notify, don't (re)start if not running), then the "simplest" option here is to temporarily disable (via puppet) etcd-mirror on conf2005 before the update, then reenable it after the dust settles. As long as there's no more than one mirror running at a time, it should be fine to run it manually on another node in the time between.

  1. whether a preemptive leader move (i.e., to decouple from restarts) will be possible at all given the tricks we play with advertised client URLs.

Since "clean" shutdown is rarely all that clean, I want to preemptively move leadership away from any soon-to-be-updated node.

I was concerned that this would not work, for the same reason we cannot use etcdctl to apply mutating operations in the read-only cluster (codfw), since etcd advertises client URLs that point to the proxy, which will in turn reject those requests.

However, I forgot that etcdctl in ETCDCTL_API=3 mode - which is what we have to use anyway, since move-leader does not exist in the v2 API - doesn't consult advertised client endpoints at all, and just uses the static set of endpoints you provide directly. So yeah, no problems there.

In any case, I'll work this into my plan for proceeding with the update next week.

Fun fact: at least as of right now, conf2006 is now leader. Since leadership terms tend to be fairly long in our environment, we might not need to deal with this at all.

Mentioned in SAL (#wikimedia-operations) [2025-11-17T18:57:47Z] <swfrench-wmf> disable puppet on A:lvs-codfw for pybal config change - T352245

Change #1203556 merged by Scott French:

[operations/puppet@production] hiera: temporarily point codfw LVS at conf2006

https://gerrit.wikimedia.org/r/1203556

Mentioned in SAL (#wikimedia-operations) [2025-11-17T19:05:40Z] <swfrench@cumin2002> START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-secondary-codfw (T352245)

Mentioned in SAL (#wikimedia-operations) [2025-11-17T19:11:43Z] <swfrench@cumin2002> END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-secondary-codfw (T352245)

Mentioned in SAL (#wikimedia-operations) [2025-11-17T19:15:30Z] <swfrench@cumin2002> START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-low-traffic-codfw (T352245)

Mentioned in SAL (#wikimedia-operations) [2025-11-17T19:21:37Z] <swfrench@cumin2002> END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-low-traffic-codfw (T352245)

Mentioned in SAL (#wikimedia-operations) [2025-11-17T19:27:07Z] <swfrench@cumin2002> START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-high-traffic2-codfw (T352245)

Mentioned in SAL (#wikimedia-operations) [2025-11-17T19:27:29Z] <swfrench@cumin2002> END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-high-traffic2-codfw (T352245)

Mentioned in SAL (#wikimedia-operations) [2025-11-17T19:30:54Z] <swfrench@cumin2002> START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-high-traffic1-codfw (T352245)

Mentioned in SAL (#wikimedia-operations) [2025-11-17T19:31:31Z] <swfrench@cumin2002> END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-high-traffic1-codfw (T352245)

Change #1206452 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/puppet@production] hiera: temporarily move etcd replication to conf2006

https://gerrit.wikimedia.org/r/1206452

Change #1206453 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/puppet@production] hiera: move etcd replication back to conf2005

https://gerrit.wikimedia.org/r/1206453

I've gone ahead an moved codfw PyBals to conf2006 today, so that preparation step is out of the way.

I've also posted additional patches related to disabling / relocating etcd-mirror replication, as described in T352245#11376150.

Unless anything comes up in the interim, I'll aim to get this done tomorrow (Tuesday) with a potential fallback day of Wednesday. In either case, this would happen during the 15:00 - 17:00 UTC window.

Mentioned in SAL (#wikimedia-operations) [2025-11-18T15:35:14Z] <swfrench-wmf> disable puppet on A:conf-codfw - T352245

Mentioned in SAL (#wikimedia-operations) [2025-11-18T15:39:19Z] <swfrench-wmf> silenced EtcdReplicationDown db7447af-851f-4faa-a4fd-b535ee9fbcdb - T352245

Change #1206452 merged by Scott French:

[operations/puppet@production] hiera: temporarily disable etcd replication on conf2005

https://gerrit.wikimedia.org/r/1206452

Mentioned in SAL (#wikimedia-operations) [2025-11-18T15:48:45Z] <swfrench-wmf> transferred etcd-mirror replication to conf2006 - T352245

Change #1203557 merged by Scott French:

[operations/puppet@production] hiera: switch codfw etcd-main cluster to cfssl/pki

https://gerrit.wikimedia.org/r/1203557

Mentioned in SAL (#wikimedia-operations) [2025-11-18T16:10:12Z] <swfrench-wmf> migrating etcd to PKI certs on conf2004 - T352245

Mentioned in SAL (#wikimedia-operations) [2025-11-18T16:18:30Z] <swfrench@deploy2002> Locking from deployment [ALL REPOSITORIES]: Hold deployments during etcd certificate change - T352245

Mentioned in SAL (#wikimedia-operations) [2025-11-18T16:22:29Z] <swfrench-wmf> migrating etcd to PKI certs on conf2005 - T352245

Mentioned in SAL (#wikimedia-operations) [2025-11-18T16:23:02Z] <swfrench@deploy2002> Unlocked for deployment [ALL REPOSITORIES]: Hold deployments during etcd certificate change - T352245 (duration: 04m 32s)

Mentioned in SAL (#wikimedia-operations) [2025-11-18T16:25:00Z] <swfrench-wmf> begin rolling restarts of codfw-associated confds - T352245

Mentioned in SAL (#wikimedia-operations) [2025-11-18T16:28:15Z] <swfrench-wmf> restarted navtiming on webperf2003 - T352245

Change #1206453 merged by Scott French:

[operations/puppet@production] hiera: move etcd replication back to conf2005

https://gerrit.wikimedia.org/r/1206453

Mentioned in SAL (#wikimedia-operations) [2025-11-18T16:38:08Z] <swfrench-wmf> transferred etcd-mirror replication back to conf2005 - T352245

Mentioned in SAL (#wikimedia-operations) [2025-11-18T16:55:29Z] <swfrench-wmf> deleted EtcdReplicationDown silence db7447af-851f-4faa-a4fd-b535ee9fbcdb - T352245

Change #1206922 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/puppet@production] hiera: point codfw LVS back to conf2004

https://gerrit.wikimedia.org/r/1206922

Alright, that went quite smoothly, if somewhat involved.

Some observations:

Leadership transfer: By the time we started work on this today, conf2005 had become the leader. While I'd considered using move-leader to preemptively move leadership to another node, upon closer inspection of the logs from the prior update of conf2004, I had more confidence in just letting etcd's shutdown handlers manage it:

Nov 18 16:11:02 conf2004 etcd[2123154]: received terminated signal, shutting down...
[ ... ]
Nov 18 16:11:09 conf2004 etcd[2123154]: skipped leadership transfer for stopping non-leader member

and indeed that worked as expected:

Nov 18 16:20:26 conf2005 etcd[2619405]: received terminated signal, shutting down...
[ ... ]
Nov 18 16:20:33 conf2005 etcd[2619405]: 1368857fef7c66b6 starts leadership transfer from 1368857fef7c66b6 to bc455dd2ab948ee
Nov 18 16:20:33 conf2005 etcd[2619405]: 1368857fef7c66b6 [term 12939] starts to transfer leadership to bc455dd2ab948ee
Nov 18 16:20:33 conf2005 etcd[2619405]: 1368857fef7c66b6 sends MsgTimeoutNow to bc455dd2ab948ee immediately as bc455dd2ab948ee already has up-to-date log
Nov 18 16:20:33 conf2005 etcd[2619405]: 1368857fef7c66b6 [term: 12939] received a MsgVote message with higher term from bc455dd2ab948ee [term: 12940]
Nov 18 16:20:33 conf2005 etcd[2619405]: 1368857fef7c66b6 became follower at term 12940

Ultimately, I'm happy we managed to kick the tires on letting etcd do its thing.

etcd-mirror: What I ended up doing here was disabling etcd-mirror replication on conf2005 by merging https://gerrit.wikimedia.org/r/1206452 and running puppet-agent, and then as soon as it had exited (recall, it can take up to 60s if there are no etcd writes to mirror), manually started it on conf2006:

sudo -i /usr/bin/etcd-mirror --strip --src-prefix / --dst-prefix / --src-ignore-keys-regex '/spicerack/locks/etcd(/.*)?' https://conf1009.eqiad.wmnet:4001 https://conf2006.codfw.wmnet:2379

Then, once the migration had completed, I merged https://gerrit.wikimedia.org/r/1206453, stopped the manual etcd-mirror on conf2006 (waiting for it to exit), and then ran puppet-agent on conf2005. All told, that resulted in ~ 20-30s without replication, which is entirely tolerable.

I probably could have safely accelerated that by manually starting etcdmirror--eqiad-wmnet.service on conf2005 before running puppet-agent (which should correctly leave the service untouched when it becomes ensure: running, since there are no managed file changes that should result in a notify - h/t to @MoritzMuehlenhoff for the discussion on this), but it was simpler to just run the agent given the time involved.

Liberica control-plane: Keeping an eye on the rate of etcd errors reported by the liberica_etcd_errors_total metric was quite handy in quickly assessing whether liberica-cp daemons were able to recover from the transient disruptions. h/t @Vgutierrez for pointing this out.


Anyway, I'm just writing this all down in detail in order to provide raw input for potentially automating portions of this in the future.

Mentioned in SAL (#wikimedia-operations) [2025-11-18T19:16:56Z] <swfrench-wmf> disable puppet on A:lvs-codfw for pybal config change - T352245

Change #1206922 merged by Scott French:

[operations/puppet@production] hiera: point codfw LVS back to conf2004

https://gerrit.wikimedia.org/r/1206922

Mentioned in SAL (#wikimedia-operations) [2025-11-18T19:23:22Z] <swfrench@cumin2002> START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-secondary-codfw (T352245)

Mentioned in SAL (#wikimedia-operations) [2025-11-18T19:24:08Z] <swfrench@cumin2002> END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-secondary-codfw (T352245)

Mentioned in SAL (#wikimedia-operations) [2025-11-18T19:29:15Z] <swfrench@cumin2002> START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-low-traffic-codfw (T352245)

Mentioned in SAL (#wikimedia-operations) [2025-11-18T19:30:03Z] <swfrench@cumin2002> END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-low-traffic-codfw (T352245)

Mentioned in SAL (#wikimedia-operations) [2025-11-18T19:37:31Z] <swfrench@cumin2002> START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-high-traffic2-codfw (T352245)

Mentioned in SAL (#wikimedia-operations) [2025-11-18T19:37:52Z] <swfrench@cumin2002> END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-high-traffic2-codfw (T352245)

Mentioned in SAL (#wikimedia-operations) [2025-11-18T19:42:27Z] <swfrench@cumin2002> START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-high-traffic1-codfw (T352245)

Mentioned in SAL (#wikimedia-operations) [2025-11-18T19:42:47Z] <swfrench@cumin2002> END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-high-traffic1-codfw (T352245)

I'll be out the remainder of this week, but when I return I'd like to get this moving forward in eqiad.

The sequencing would look similar to codfw, though some aspects will be simpler (e.g., we don't need to juggle etcd-mirror around between destination-side hosts).

In eqiad, conf1008 is the host with "nothing special going on" whereas conf1007 is the eqiad profile::pybal::config_host and conf1009 is the etcd-mirror source host.

Given that, the high-level procedure would look like the following:

Day 1

  • Migrate conf1008 (switch to new certs, verify, restart clients [0]).
  • Temporarily point eqiad PyBals at conf1008.

Day 2

  • Silence EtcdReplicationDown, stop puppet and etcd-mirror service on conf2005, manually shift replication source to conf1008 (similar to what we did in T405950).
  • Migrate conf1009 (switch to new certs, verify).
  • Switch back to etcd-mirror service targeting conf1009 on conf2005 and reenable puppet. Delete EtcdReplicationDown silence.
  • Migrate conf1007 (switch to new certs, verify).
  • Restart clients.
  • Move eqiad PyBals back to conf1007.

Similar to what we did in codfw, during each one of these migrations, we would take the scap lock to prevent MediaWiki deployments.

[0] This includes eqiad-associated confds, eqiad navtiming, and hiddenparma.

FWIW, the plan for eqiad sounds good to me

Change #1213600 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/puppet@production] hieradata: enable cfssl/pki for etcd on conf1008

https://gerrit.wikimedia.org/r/1213600

Change #1213601 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/puppet@production] hieradata: temporarily point eqiad LVS at conf1008

https://gerrit.wikimedia.org/r/1213601

Change #1213602 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/puppet@production] hieradata: enable cfssl/pki for etcd on all configcluster hosts

https://gerrit.wikimedia.org/r/1213602

Change #1213603 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/puppet@production] hieradata: point eqiad LVS back to conf1007

https://gerrit.wikimedia.org/r/1213603

Change #1213600 merged by Scott French:

[operations/puppet@production] hieradata: enable cfssl/pki for etcd on conf1008

https://gerrit.wikimedia.org/r/1213600

Mentioned in SAL (#wikimedia-operations) [2025-12-02T16:08:41Z] <swfrench-wmf> migrating etcd to PKI certs on conf1008 - T352245

Mentioned in SAL (#wikimedia-operations) [2025-12-02T16:08:50Z] <swfrench@deploy2002> Locking from deployment [MediaWiki]: Hold deployments during etcd certificate change - T352245

Mentioned in SAL (#wikimedia-operations) [2025-12-02T16:12:35Z] <swfrench@deploy2002> Unlocked for deployment [MediaWiki]: Hold deployments during etcd certificate change - T352245 (duration: 03m 45s)

Mentioned in SAL (#wikimedia-operations) [2025-12-02T16:14:23Z] <swfrench-wmf> begin rolling restarts of eqiad-associated confds - T352245

Mentioned in SAL (#wikimedia-operations) [2025-12-02T16:18:15Z] <swfrench-wmf> restarted navtiming on webperf1003 - T352245

Change #1213601 merged by Scott French:

[operations/puppet@production] hieradata: temporarily point eqiad LVS at conf1008

https://gerrit.wikimedia.org/r/1213601

Mentioned in SAL (#wikimedia-operations) [2025-12-02T16:53:02Z] <swfrench@cumin2002> START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-secondary-eqiad (T352245)

Mentioned in SAL (#wikimedia-operations) [2025-12-02T16:53:39Z] <swfrench@cumin2002> END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-secondary-eqiad (T352245)

Mentioned in SAL (#wikimedia-operations) [2025-12-02T16:58:42Z] <swfrench@cumin2002> START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-low-traffic-eqiad (T352245)

Mentioned in SAL (#wikimedia-operations) [2025-12-02T16:59:10Z] <swfrench@cumin2002> END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-low-traffic-eqiad (T352245)

Mentioned in SAL (#wikimedia-operations) [2025-12-02T17:01:56Z] <swfrench@cumin2002> START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-high-traffic2-eqiad (T352245)

Mentioned in SAL (#wikimedia-operations) [2025-12-02T17:02:28Z] <swfrench@cumin2002> END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-high-traffic2-eqiad (T352245)

Mentioned in SAL (#wikimedia-operations) [2025-12-02T17:05:37Z] <swfrench@cumin2002> START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-high-traffic1-eqiad (T352245)

Mentioned in SAL (#wikimedia-operations) [2025-12-02T17:06:09Z] <swfrench@cumin2002> END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-high-traffic1-eqiad (T352245)

Mentioned in SAL (#wikimedia-operations) [2025-12-02T18:01:28Z] <swfrench-wmf> silenced EtcdReplicationDown (42a82757-2075-44fd-b057-ec9ed2afeb90) - T352245

Mentioned in SAL (#wikimedia-operations) [2025-12-02T18:04:28Z] <swfrench-wmf> manually transferred codfw etcd replication source to conf1008 - T352245

Change #1213602 merged by Scott French:

[operations/puppet@production] hieradata: enable cfssl/pki for etcd on all configcluster hosts

https://gerrit.wikimedia.org/r/1213602

Mentioned in SAL (#wikimedia-operations) [2025-12-02T18:10:22Z] <swfrench@deploy2002> Locking from deployment [MediaWiki]: Hold deployments during etcd certificate change - T352245

Mentioned in SAL (#wikimedia-operations) [2025-12-02T18:12:18Z] <swfrench-wmf> migrating etcd to PKI certs on conf1009 - T352245

Mentioned in SAL (#wikimedia-operations) [2025-12-02T18:16:34Z] <swfrench-wmf> manually transferred etcd replication source back to conf1009 - T352245

Mentioned in SAL (#wikimedia-operations) [2025-12-02T18:19:46Z] <swfrench-wmf> deleted EtcdReplicationDown silence (42a82757-2075-44fd-b057-ec9ed2afeb90) - T352245

Mentioned in SAL (#wikimedia-operations) [2025-12-02T18:22:20Z] <swfrench-wmf> migrating etcd to PKI certs on conf1007 - T352245

Mentioned in SAL (#wikimedia-operations) [2025-12-02T18:23:49Z] <swfrench-wmf> begin rolling restarts of eqiad-associated confds - T352245

Mentioned in SAL (#wikimedia-operations) [2025-12-02T18:26:51Z] <swfrench-wmf> restarted navtiming on webperf1003 - T352245

Mentioned in SAL (#wikimedia-operations) [2025-12-02T18:27:57Z] <swfrench@deploy2002> Unlocked for deployment [MediaWiki]: Hold deployments during etcd certificate change - T352245 (duration: 17m 35s)

Change #1213603 merged by Scott French:

[operations/puppet@production] hieradata: point eqiad LVS back to conf1007

https://gerrit.wikimedia.org/r/1213603

Mentioned in SAL (#wikimedia-operations) [2025-12-02T18:36:15Z] <swfrench@cumin2002> START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-secondary-eqiad (T352245)

Mentioned in SAL (#wikimedia-operations) [2025-12-02T18:36:54Z] <swfrench@cumin2002> END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-secondary-eqiad (T352245)

Mentioned in SAL (#wikimedia-operations) [2025-12-02T18:40:11Z] <swfrench@cumin2002> START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-low-traffic-eqiad (T352245)

Mentioned in SAL (#wikimedia-operations) [2025-12-02T18:41:02Z] <swfrench@cumin2002> END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-low-traffic-eqiad (T352245)

Mentioned in SAL (#wikimedia-operations) [2025-12-02T18:47:33Z] <swfrench@cumin2002> START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-high-traffic2-eqiad (T352245)

Mentioned in SAL (#wikimedia-operations) [2025-12-02T18:47:56Z] <swfrench@cumin2002> END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-high-traffic2-eqiad (T352245)

Mentioned in SAL (#wikimedia-operations) [2025-12-02T18:53:29Z] <swfrench@cumin2002> START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-high-traffic1-eqiad (T352245)

Mentioned in SAL (#wikimedia-operations) [2025-12-02T18:53:58Z] <swfrench@cumin2002> END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-high-traffic1-eqiad (T352245)

Alright, that should be everything. All configcluster hosts have had etcd migrated to use cfssl-based PKI, which should unblock migration to Puppet 7.

I've also updated https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster#Individual_cluster_configuration to reflect this.

Edit: I also should have mentioned, enough time has passed since codfw was migrated that I was able to confirm certificate refresh works as expected - i.e., etcd automatically picks up the new certificates for both the nginx-facing and peer-peer use cases.

Many thanks to all who have assisted in moving this forward.

Although there may be some further cleanups to remove the now-unused certificates, no further action is planned as part of this task.

Follow-on work to migrate configcluster hosts to Puppet 7 is in progress (thanks to @MoritzMuehlenhoff) and is be tracked as part of T349619.

Change #1182693 merged by Muehlenhoff:

[operations/puppet@production] conf/codfw: Remove now obsolete cert

https://gerrit.wikimedia.org/r/1182693

Change #1227307 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] conf/etcd: Remove now obsolete cert

https://gerrit.wikimedia.org/r/1227307

Change #1227309 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] conf/etcd: Remove now obsolete cert

https://gerrit.wikimedia.org/r/1227309

Change #1182694 merged by Muehlenhoff:

[operations/puppet@production] conf/eqiad: Remove obsolete cert

https://gerrit.wikimedia.org/r/1182694

Change #1227307 merged by Muehlenhoff:

[operations/puppet@production] conf/etcd: Remove now obsolete cert

https://gerrit.wikimedia.org/r/1227307

Change #1227309 merged by Muehlenhoff:

[operations/puppet@production] conf/etcd: Remove now obsolete cert

https://gerrit.wikimedia.org/r/1227309