Page MenuHomePhabricator

Switch conftool to use the version 3 etcd datastore
Open, Needs TriagePublic

Description

Etcd is pushing progressively to remove the version 2 datastore. We will eventually need to switch conftool to etcd v3, which should allow us to use the native replication as well.

Sadly the transition would require a lot of work, that I will try to summarize here. There's also two main ways to go with the integration: native grpc or via the grpc/json gateway.

Native grpc interface

  • Allow access to the etcdv3 grpc interface via nginx.
  • Fork/take over maintenance of the grpc python3 client currently at https://github.com/kragniz/python-etcd3 which looks notably unmaintained at the moment
  • Verify auth works

GRPC gateway

  • Check that access via nginx works as intended
  • Create a python client for the v3 http api. It doesn't need to be complete, we just need the CRUD parts
  • verify auth works

We then need to integrate etcd 3 into conftool, allowing to write to both datastores:

  • Add an etcd3 backend to conftool
  • Add a "proxy" backend to conftool that can write to multiple backends
  • Start writing to both backends

And finally, we'll need to convert clients:

  • Confd supports etcdv3 natively; we'll still have to find out if it's 100% compatible
  • MediaWiki will probably need to be enabled to call the v3 grpc gateway
  • We might want to migrate pybal, and in that case I'd still use the v3 grpc gateway

Status update (June 2025)

Preparatory work has started in advance of introducing an etcd v3 backend driver to conftool. Specifically, this involves:

  • Document API semantics of the existing v2 driver
  • Implement conformance tests to assert those semantics are honored (to which the v3 driver will also be subject)
  • Resolve under-specified or inconsistent behaviors in the v2 driver (in progress)

With that complete, we'll be ready to introduce a v3 driver implementing the same API, and later a (temporary) dual-write driver to support the migration (i.e., sequencing writes across v2 and v3 dependent on migration phase). We have not yet made a final decision on the specific v3 python client to use. One option under consideration is something like etcd3-client-lite, which targets the HTTP gRPC gateway.

I'll soon open a separate subtask to specifically capture discussion on tradeoffs between various client options.

More generally, there already exists a detailed migration design drafted in mid-2024. I'll aim to start incrementally refreshing that and breaking out other key decisions to subtasks (covering, e.g., auth model, MediaWiki support, migration phases, etc.).

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Untagged sre-tools and spicerack as I've created the dedicated sub-tasks for them.

swfrench opened https://gitlab.wikimedia.org/repos/sre/conftool/-/merge_requests/88

drivers/etcd: Raise NotFoundError for delete on a non-existent key

oblivian merged https://gitlab.wikimedia.org/repos/sre/conftool/-/merge_requests/88

drivers/etcd: Raise NotFoundError for delete on a non-existent key

swfrench opened https://gitlab.wikimedia.org/repos/sre/conftool/-/merge_requests/91

drivers/etcd: Resolve remaining TODOs for read and listing methods

PEPE1234.13 removed Scott_French as the assignee of this task.
PEPE1234.13 triaged this task as Unbreak Now! priority.
PEPE1234.13 added a subscriber: Scott_French.
Aklapper assigned this task to Scott_French.
Aklapper lowered the priority of this task from Unbreak Now! to Needs Triage.

oblivian merged https://gitlab.wikimedia.org/repos/sre/conftool/-/merge_requests/91

drivers/etcd: Resolve remaining TODOs for read and listing methods