Page MenuHomePhabricator

etcd: adapt etcd-backup.py for etcd 3.4
Open, Stalled, MediumPublic

Description

In etcd 3.4 the backup subcommand has gone away, so we'll need to adapt our etcd-backup script to use a different approach

Feb 03 20:01:00 aux-k8s-etcd2003 etcd-backup[659074]: Error: unknown command "backup" for "etcdctl"
Feb 03 20:01:00 aux-k8s-etcd2003 etcd-backup[659074]: Run 'etcdctl --help' for usage.

Event Timeline

setting environment ETCDCTL_API=2 for the backup script may be an option as well

Change #1117588 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] etcd-backup: ensure api v2 is used in newer etcd versions

https://gerrit.wikimedia.org/r/1117588

Change #1117588 abandoned by Herron:

[operations/puppet@production] etcd-backup: ensure api v2 is used in newer etcd versions

Reason:

this won't work as expected, going for a different approach

https://gerrit.wikimedia.org/r/1117588

Thought about this over the weekend a bit.

To summarize my current understanding, we have etcd-backup.py on etcd hosts to creates backups via etcdctl backup which captures contents of the v2 datastore. With etcd 3.4 ETCDCTL_API=3 becomes the default, and our script no longer works without setting ETCDCTL_API=2.

Thinking back on https://gerrit.wikimedia.org/r/1117588 -- this patch indeed wouldn't work for capturing v3 backups, but in hindsight it doesn't seem like a bad thing to set ETCDCTL_API=2 in the sense that the existing backup script should consistently create v2 backups regardless of default api version.

Looking through puppet I see snapshot means a few things in etcd. We do manage snapshot-count as it relates to in memory size and compaction, but I'm not yet finding e.g. a snapshot save for v3 datastore snapshots/backups.

In terms of next steps, I'm thinking along the lines of:

  • Create a snapshot/v3 variant of etcd-backup with ETCDCTL_API=3
  • Go back and set ETCDCTL_API=2 for the current etcd-backup script, for v2 backups on newer hosts
  • Run both backup and snapshot save (v2/v3) scripts in parallel along the same schedule on etcd hosts.
herron triaged this task as Medium priority.Feb 18 2025, 3:56 PM

Change #1120602 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] etcd: add etcd-backup-v3 script

https://gerrit.wikimedia.org/r/1120602

Change #1120899 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::etcd::v3::aux_k8s_etcd: remove backups

https://gerrit.wikimedia.org/r/1120899

Change #1120899 merged by Elukey:

[operations/puppet@production] role::etcd::v3::aux_k8s_etcd: remove backups

https://gerrit.wikimedia.org/r/1120899

Mentioned in SAL (#wikimedia-operations) [2025-02-19T09:02:20Z] <elukey> elukey@cumin1002:~$ sudo cumin --m async 'aux-k8s-etcd*' 'systemctl stop etcd-backup.timer etcd-backup.service' 'rm /lib/systemd/system/etcd-backup.service /lib/systemd/system/etcd-backup.timer' 'systemctl daemon-reload' - T385727

MatthewVernon subscribed.

[I've put the Observability tag onto this task as otherwise it's showing up in the Clinic Duty triage list]

herron changed the task status from Open to Stalled.Apr 7 2025, 2:56 PM
herron moved this task from Inbox to Radar on the SRE Observability board.

Change #1120602 abandoned by Herron:

[operations/puppet@production] etcd: add etcd-backup-v3 script

Reason:

https://gerrit.wikimedia.org/r/1120602