Page MenuHomePhabricator

etcdmirror does not recover from a cleared waitIndex
Closed, ResolvedPublic

Description

Issue

This evening, the etcdmirror service on conf2005 failed after it hit the case described in [0] (and the Note just above that heading).

At 01:39:44 UTC, the service failed with CRITICAL: The current replication index is not available anymore in the etcd source cluster. The last mirrored event prior to that was at 00:00:26 UTC (index: 3020127).

We inspected the replication index key like so:

$ curl https://conf2005.codfw.wmnet:4001/v2/keys/__replication
{"action":"get","node":{"key":"/__replication","dir":true,"nodes":[{"key":"/__replication/conftool","value":"3020127","modifiedIndex":5150719,"createdIndex":4271}],"modifiedIndex":2140,"createdIndex":2140}}

and confirmed (1) it matched the index from the last event in the logs and (2) it was just a bit more than 1000 events behind the latest X-Etcd-Index returned by conf1009 (then 3021261).

We recovered by manually advancing /__replication/conftool to 3020127 + 999 (i.e., just prior to when we fell off the 1000 event retention window) since this should not allow for lost events (in contrast, advancing all the way to the then-current X-Etcd-Index could have), and restarting the service.

We believe this happened due to the combination of an unusually quiet period in the /conftool keyspace and an unusually busy period outside - e.g., spicerack locks.

Next steps

In the near term, we should harden our use of etcdmirror to prevent a recurrence.

Options considered:

  1. Extend etcdmirror to use X-Etcd-Index to automatically recover from this situation.
  2. Reconfigure etcdmirror to replicate (nearly) the entire keyspace.

After investigating both options, #2 is far simpler, and is in fact preferable over the current state (i.e., it is good to have spicerack lock state replicated). Thus, it is the current PoR.

In the long term (however long it takes to get fully off the v2 API), etcdmirror will be replaced with some other solution TBD that supports v3 (most likely etcdctl make-mirror).

[0] https://etcd.io/docs/v2.3/api/#watch-from-cleared-event-index

Event Timeline

That etcdmirror is mirroring only the /conftool keys it's totally news to me, I assumed it was replicating the whole content of etcd. But indeed it does not:

$ sudo etcdctl --endpoints https://conf1007.eqiad.wmnet:4001 ls  /
/conftool
/spicerack
/test

vs

$ sudo etcdctl --endpoints https://conf2005.codfw.wmnet:4001 ls  /
/conftool

Is there any specific reason we shouldn't replicate it all? If replicating it all this issue should never happen right?
Spicerack follows the SRV records for the endpoint, so if that gets failed over to codfw spicerack should follow and I would have expected to find the locks replicated there too.

That etcdmirror is mirroring only the /conftool keys it's totally news to me, I assumed it was replicating the whole content of etcd. But indeed it does not:

$ sudo etcdctl --endpoints https://conf1007.eqiad.wmnet:4001 ls  /
/conftool
/spicerack
/test

vs

$ sudo etcdctl --endpoints https://conf2005.codfw.wmnet:4001 ls  /
/conftool

Is there any specific reason we shouldn't replicate it all? If replicating it all this issue should never happen right?
Spicerack follows the SRV records for the endpoint, so if that gets failed over to codfw spicerack should follow and I would have expected to find the locks replicated there too.

No reason. Pretty simply at the time there was nothing beyond conftool to replicate, and we wanted to leave space for cluster-local usage of etcd.

If we need to replicate /spicerack as well, we can change replication or spin up a second instance.

Thanks, Riccardo. Yes, indeed - this particular issue should generally not happen if the entire keyspace is mirrored (IIRC, there are non-keyspace events that can advance the index, but (a) I might be misremembering and (b) those are presumably administrative / rare).

Also that makes sense about having the spicerack locks replicated to the follower site making a failover safer for cookbooks.

I suspect the main concern would be increased load on the mirror, but I don't have a good intuition for whether that's a practical concern. We can certainly load test that, though.

Speaking of which, I still need to confirm from the nginx access logs that our explanation (spicerack lock events) holds up. I plan to do that this morning. (I need those kinds of numbers anyway to calibrate benchmarks for the v3 migration)

But a second instance wouldn't prevent the current issue, right?

Rather than trying to pull access logs off the conf hosts (as they're rather large, and I'd like to avoid stressing them), I just ended up looking at the etcd grafana dashboard instead: https://grafana.wikimedia.org/goto/NsEfMXTSz?orgId=1

Of interest is the aggregated requests by method panel (attached).

Pulling the data behind that panel down and spot checking the number of (PUT, POST, DELETE) events between 00:00:26 UTC and 01:39:44 UTC we end up at ~ 1000 as expected (a bit short, which makes sense since we're using time buckets that don't cleanly align with our window).

What's interesting about the graph is that the upticks look close to what I'd expect for spicerack locks - e.g., roughly balanced POST/DELETE most of the time (this is the queueing mutex in python-etcd, so we expect paired ops except when the key TTLs out).

I'd also expect DELETE to slightly exceed POST, since we also clear keys when their KeyLocks structure is empty (no spicerack locks held), which we see here too.

The fact that POST/DELETE significantly outpaces PUT much of the time is interesting too, and suggests contention - i.e., not all etcd lock acquisitions result in spicerack lock acquisition ... though this can also be explained to an extent by the fact that not all lock acquisitions result in a PUT, as noted above.

In any case, this is enough for me to generally say "looks about right" :)

T358636_updates.png (304×622 px, 38 KB)

After thinking about this a bit more today, I think I'm onboard with the idea of expanding replication to include /spicerack.

If we're comfortable with that being the default for the entire keyspace (i.e., even for new workloads) then that's a fairly straightforward config change [0]. If for some reason we want to limit replication to _just_ /spicerack and /conftool, that's a still-fairly straightforward code change on top of that.

I do need to think a little bit about what a failover [1] would look like for spicerack lock state. Naively, as long as the global-read-only period is longer than the etcd lock TTL [2] (which it certainly will be if the latter stays 15s) then we should end up in the happy state that no possibly-confusing lock keys exist in the failed-to cluster when it becomes the primary [3]. The KeyLocks payloads would of course still be there (the important part).

I also gave some thought to what a general solution for this issue (recovery from the wait index being cleared from etcd's event history) in etcd-mirror might look like, and it gets tricky pretty quickly for non-trivial cases.

[0] Though it would be a full reload, so I'm wondering if there are precautions we should take (e.g., pointing codfw clients at eqiad temporarily).

[1] https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster#Master_cluster_switchover

[2] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/spicerack/+/1840a8fce79c45a35c3b000b31a5a861a577d181/spicerack/locking.py#323

[3] I say possibly confusing, because the underlying functionality used to generate lock waiter keys (https://etcd.io/docs/v2.3/api/#atomically-creating-in-order-keys) is based on the *cluster-local* etcd index, which might be wildly different between sites.

In the meantime, I remembered why we were only replicating the /conftool keyspace:

in the past, we had another prefix called /eventlogging that was used only in eqiad and was supposed to potentially be busy.

Given the flimsy nature of etcdmirror, and the fact that eventlogging data was eqiad-only, we restricted replication.

But there is also another consideration: we know the write rate from conftool is generally reasonably low, enough not to get etcdmirror to lag significantly. There is a risk that we increase the lag of replication for something that is fundamental to running the live website to allow having a backup of etcd data for reimages and cookbooks. As @Scott_French already noted, it's not obvious what a failover scenario would look like in that case.

Is uniformity worth the risk? For sure, we'll need to do some load testing as @Scott_French mentioned.

If we're comfortable with that being the default for the entire keyspace (i.e., even for new workloads) then that's a fairly straightforward config change [0].

We control what goes into etcd quite closely.

As for changing the replication: one way to do it that doesn't require a full reload is to do the following

  • set the main etcd cluster in read-only mode
  • dump the /spicerack keyspace from eqiad, load it into codfw[1]
  • Modify the replication metadata

but I think a depool and reload of all data is the better option.

One thing we should consider and test is if I was smart enough when writing etcdmirror and explicitly considered the case where we replicate the whole root keyspace or not. I would imagine I did not, which would mean that for instance the key for replication would be invalid. So we might still need a small patch to etcd-mirror.

[1] We can just make an etcdmirror configuration just for /spicerack replication and start the replication with --reload just for that, or use any dump tools.

Of course I don't want to add additional lag to any live-traffic data (pybal, mwconfig, dbctl) and if we deem adding spicerack locks to the replication might cause that let's find another solution. For example we could have a failover etcd cookbook that when run will read the active locks from the primary cluster and manually replicate them on the secondary one explicitely. Or any other viable option.

That said, if we add it to the replication, there is no need to overcomplicate it. If done at the right time (like for example right now, that there are no cookbooks in-flight*) there is no need to do anything. We can also "restart" from scratch as we can even rm -rf /spicerack entirely right before enabling the replication for it if that simplify things. There is no persistent data.

The current data is only directories:

$ sudo etcdctl --endpoints https://conf1007.eqiad.wmnet:4001 ls -r /spicerack
/spicerack/locks
/spicerack/locks/custom
/spicerack/locks/etcd
/spicerack/locks/modules
/spicerack/locks/cookbooks

We control what goes into etcd quite closely.

Ah, that is a good point: changes in workload are strictly limited to what we already (or will in the future) allow by ACL.

One thing we should consider and test is ...

Also a great point. So, from some local testing earlier today, mirroring the full keyspace does "work" without any modification to etcdmirror.

However, that's the result of etcd's canonicalization of paths with superfluous slashes, which works by way of returning 301 to the correct path, which python-etcd understands.

To avoid the redirects, we'll need to adjust the Etcd2Writer a bit. I have a patch for this that I'm happy to send out.

Next steps

Alright, it sounds like the next step here is to design a load test for etcdmirror, geared toward assessing the replication latency impact on a hypothetical "rare but critical" op in the presence of a high background rate of "less-critical" ops.

I'll give that some thought this afternoon.

If we decide we're comfortable with the outcome, then opening up the full keyspace could happen either with a reload or some combination of what we've discussed above.

For example, if we're able to ensure there's no cookbooks running like Riccardo observed earlier today, then:

  1. making eqiad RO
  2. deleting /spicerack in eqiad
  3. stopping replication in codfw
  4. copying /__replication/conftool to /__replication in codfw [0]
  5. restarting replication in codfw for the full keyspace
  6. making eqiad RW again

would probably be fine.

Not sure about this /test key in eqiad ... is that actually used for anything? If not, we should remove it if we're not going to do a full reload, as certain types of operations on it in the future will break replication.

[0] Or something else if we decide to change the replication index key structure in etcdmirror (e.g., make it so mirroring / corresponds to /__replication/idx or something).

Last Friday, I put together a simple stress test for etcd-mirror, with the goal of measuring replication delay under a range of background write rates from an "antagonist" workload. Test setup and results: https://phabricator.wikimedia.org/P58402.

In short, mean replication delay stays around ~ 4ms and tail latency remains predictably low well beyond a background write rate of 100 put/s - i.e., ~ 50x the typical peak write rate seen day-to-day in production (~ 1.8 put/s).

Tail latency degrades noticeably past ~ 500 put/s and replication fails around 700 put/s (where replication delay crosses the event history limit of 1000 events).

The most notable caveat of this test is that it's optimistic: it's measuring ideal / peak replication throughput, as (1) there is no inter-DC network delay and (2) no delay from consensus rounds among etcd nodes (though disk I/O is still a factor).

Even so, I think the margins are wide enough here that we should be able to easily accommodate replicating spicerack lock state (with the previously noted caveat that some portions of that state cannot be meaningfully interpreted across clusters, but should TTL out quickly).

If folks are comfortable moving forward with full-keyspace mirroring given the above, I'll get started planning that out (and send off the patch I mentioned previously).

@Scott_French great job, indeed I think we can just go forward and do it, even if there's a few things that look suboptimal.

This got me thinking: if we're not really interested in what is in /spicerack, we could add to etcdmirror the ability to watch a keyspace but ignore some sub-keyspaces in replication.

This would solve the 1k event window problem but also eliminates the problem of replicating keys that don't matter or won't really work when replicated.

This got me thinking: if we're not really interested in what is in /spicerack, we could add to etcdmirror the ability to watch a keyspace but ignore some sub-keyspaces in replication.

The only part we could skip replicating is /spicerack/locks/etcd, not the whole /spicerack. We're interested in the content of /spicerack in general ;)
If for any reason we failover etcd from eqiad to codfw if we don't mirror its content we risk to run concurrent cookbooks that should not run concurrently. And not all of them are short-lived, some run for a week...

What Scott was referring in prior comments is the short-lived mutex lock that spicerack uses to write its own locks ensuring no-one is writing locks at the same time, not the actual spicerack locks that we care of. And those mutex do quite some write traffic so it does make sense to skip them in replication if that's easy and we think it wouldn't create replication issues in the future.

Thank you both for the follow-up, and apologies for the imprecision re: which locks :)

Great, I'll send out that first patch and get started figuring out the switchover plan.

Also it looks like it should be pretty easy to add a deny-list pattern or similar for filtering out writes for certain keys (while still advancing the replication index), so we can suppress just /spicerack/locks/etcd. I'll take a look at that today.

Change 1007738 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/software/etcd-mirror@master] Improve support for mirroring the full keyspace

https://gerrit.wikimedia.org/r/1007738

Coming back to this after pausing for the summit, I wanted to follow up on my comment that these benchmarks are measuring ideal throughput (not accounting for, e.g., network delays).

Reminder: etcd v2 watch uses long-polling, in contrast to v3 which is streaming. In addition, the poll -> write cycle is synchronous on the etcdmirror event loop (i.e., we're not using non-blocking clients, so we can't overlap the communication cost).

Thus (if urllib3 is reusing TCP connections, which it should) practical peak is ~ 1 / (site-RTT + local-cluster write latency). Back-of-the-envelope:

  • steady-state codfw-eqiad RTT is ~ 30ms (not sure what this looks like in failure cases)
  • p95 etcdmirror write latency is ~ 10ms [0] (and p50 is somewhere between 5 and 10ms)

... which puts us at ~ 25 put/s peak.

There are two points I'd like to highlight:

  1. while this is far less than ideal peak, it's still > 10x the peak write rate seen day-to-day (~ 1.8 put/s); and
  2. today, sustained ~ 25 put/s to non-conftool keys in the primary cluster would break replication in ~ 40s anyway.

By #2, what I mean is that expanding the replicated keyspace is still a strict improvement over today - i.e., we're not adding a new risk of breakage in the face of "unprecedented" writes to spicerack locks, but rather giving ourselves a fighting chance of keeping up.

[0] https://phabricator.wikimedia.org/P58806

Following up on the /test key (T358636#9589274) present in eqiad:

  • I can find no references to it (e.g., as probe or something) in operations/puppet.
  • From a (hopefully) representative couple of hours of nginx access logs, I see no requests explicitly touching this key.
  • Going by the modifiedIndex of the key and the current X-Etcd-Index in eqiad, it does not appear to be recently updated, but also appears to not be terribly old [0]:
$ curl -v https://conf1007.eqiad.wmnet:4001/v2/keys/test
...
< X-Etcd-Index: 3093504
...
{"action":"get","node":{"key":"/test","value":"test2","modifiedIndex":2526020,"createdIndex":2526020}}
  • Clearly nothing that might ever attempt to query for this key in codfw (i.e., via the read-only endpoints) should care about it, since it's not replicated.

Given the above, it is likely that this is safe to remove.

Aside: I'm still a bit puzzled as to how this got there, as the action would either need to use the blanket credentials in /root/.etcdrc (i.e., what we load in conftool or spicerack) _or_ step around nginx by being run on the conf host (i.e., against :2379 directly). In either case, this is probably not something we can figure out after the fact.

[0] As a rough estimate based on prometheus metrics, we see ~ 1-5k mutations (DELETE|POST|PUT) per business day. Taking typical numbers over the course of a week, this index difference puts us back at 0.5 - 1 year (setting aside internal operations that also advance the index).

Wow, that was quite an investigation for a /test key, thanks for the thorough analysis. As for the test2 value that could have been me when deploying the spicerack locks. I have in my bash history from the now defunct cumin1001 this for example:

sudo etcdctl -C https://conf1008.eqiad.wmnet:4001 set /test/volans '{"test": "value"}'

that although not the same might have failed if test was a key and not a directory (as it looks like) and I might have retried it with different values.
This just to say that I think it's safe to remove /test.

Change #1008944 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/software/etcd-mirror@master] Add support for an optional ignored-keys pattern

https://gerrit.wikimedia.org/r/1008944

Change #1008944 merged by jenkins-bot:

[operations/software/etcd-mirror@master] Add support for an optional ignored-keys pattern

https://gerrit.wikimedia.org/r/1008944

Change #1007738 merged by jenkins-bot:

[operations/software/etcd-mirror@master] Improve support for mirroring the full keyspace

https://gerrit.wikimedia.org/r/1007738

Thanks, Riccardo! (both for the follow-up here and code reviews)

Next steps:

  • Release a new etcd-mirror package (a process I can largely follow from the excellent httpbb docs [0]). This should result in no change in behavior, but is clearly something I'll monitor closely.
  • Finalize the switchover plan, which I've tested locally, and socialize the "no cookbooks please" window.
  • Prepare the changes to [1] et al. to make the change (src / dst prefix -> '/', ignored keys -> '/spicerack/locks/etcd(/.*)?', etc.).
  • Make it so.

[0] https://wikitech.wikimedia.org/wiki/Httpbb#Deploying_httpbb

[1] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/etcdmirror/manifests/instance.pp

Change #1017346 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/software/etcd-mirror@master] Release etcd-mirror 0.0.11

https://gerrit.wikimedia.org/r/1017346

Change #1017358 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/software/etcd-mirror@master] Bump etcdmirror package version: 0.0.11

https://gerrit.wikimedia.org/r/1017358

Change #1017358 merged by jenkins-bot:

[operations/software/etcd-mirror@master] Bump etcdmirror package version: 0.0.11

https://gerrit.wikimedia.org/r/1017358

Change #1017346 merged by jenkins-bot:

[operations/software/etcd-mirror@master] Release etcd-mirror 0.0.11

https://gerrit.wikimedia.org/r/1017346

For the package release, my plan is to build and upload the 0.0.11-1 package shortly (but ideally not less than 30m) before a "MediaWiki infrastructure" deployment window.

Once the window starts, I'll debdeploy the new package to configcluster hosts and restart etcd-mirror on conf2005 (there is no postinst script, so only at this point will the change take effect).

Doing this during an infrastructure window has the benefit of avoiding at least one common source of "important" conftool updates (appserver depool / repool) during the restart.

I'll still socialize this with folks who might also be performing conftool-affecting operations during that same window (e.g., dbctl).

In the meantime, I've verified locally that the upgrade itself is a noop.

As discussed previously, there are two options to consider:

  • Full reload - i.e., use --reload to fully clear and rewrite the keyspace in codfw.
  • In-place migration - i.e., manually copy the replication index key into its new location.

Full-reload has the benefit of being conceptually simple, but is disruptive to clients. To avoid that, we’d probably want to direct all codfw-associated consumers to eqiad temporarily. That’s non-ideal for two reasons:

  • We need to globally restart confd after updating the SRV records and clearing the DNS caches (T314489).
  • It's somewhat challenging to assess the risk of failing over all clients to eqiad in terms of load (particularly since our largest consumer, confd, uses linearizable reads).

Edit: Actually, it looks like we managed to do the latter successfully as part of T332010 within the last year, so this is _likely_ low risk. That said, I forgot to include PyBal in the above (i.e., restarting all PyBal instances after updating their config to temporarily point to etcd in eqiad).

In-place migration has the benefit of being non-disruptive, but involves a bit more care and requires cleanup of previously non-replicated paths in eqiad before migration (/spicerack) and ensuring they are not updated during the migration.

As long as we can do that cleanup, though, I’d say the in-place path has fewer unknowns (e.g., it is straightforward to perform e2e tests for the switch to increase our confidence that it will indeed go to plan). It also doesn't require _more_ steps / sequencing that the reload approach would.

With that in mind, I believe it would look something like the following:

  1. After verifying no cookbooks are running and locking them out of etcd, delete /spicerack in eqiad.
  2. Set profile::etcd::replication::active: false on conf2005 in hiera, merge, and run puppet-agent on conf2005.

[At this point, replication is stopped, but can be recovered in an emergency by manually starting the systemd unit and disabling puppet-agent.]

  1. Copy /__replication/conftool to /__replication/__ROOT__ in codfw.
  2. Manually start etcd-mirror with its new command-line (new src / dst prefixes, spicerack python-etcd lock keys ignored) and verify correct operation (e.g., depool a host).
  3. Stop manual invocation.

[At this point, we can either revert back to our previous state or proceed forward. If anything was replicated during #4, we’ll need to catch up /__replication/conftool if we want to revert.]

  1. Override ensure => absent on all resources in etcdmirror::instance (except the package, of course), merge, run agent on all configcluster hosts.

[At this point, replication is unconfigured, but can be manually started in a emergency, as in #4.]

  1. Set src and dst paths in hieradata/role/{site}/configcluster.yaml to /, wire in src-ignore-keys-regex (updating both the systemd unit and reload script templates), and flip all resources back to ensure => present. Merge and run puppet-agent on conf2005.
  2. Verify systemd unit script contents on conf2005 one last time (i.e., should be identical to manual invocation in #4).
  3. Set profile::etcd::replication::active: true on conf2005 in hiera, merge, and run puppet-agent on conf2005.
  4. Verify etcd-mirror is up and running as expected.

Cleanup: delete old /__replication/conftool index key in codfw.

Why the ensure dance? Because the src key prefix path is embedded in the resource name (e.g., systemd units). In order to make these resources actually go away, we need them to go through absent.

Are you sure this isn't more work than a full reload? A full reload would look rather similar, just bracketed by DNS changes etc. and minus the /__replication index key modifications.

Did you consider extending the puppet configs to support parallel mirrors? Parallel mirrors (one enabled at a time) would be nice, as we could have both configurations fully ready at the same time, collapsing 6-9 into one step (deleting the old resources left as a cleanup). Once I realized how involved this would be, though, I decided it was not worth pursuing.

Thanks a lot for the detailed plan outline. The plan looks sane to me, I agree that the in-place migration is probably the less risky path.
Just one nit, we need to give plenty of advance notice to avoid long-running scripts that might touch conftool such as long-running cookbooks and long-running DBAs scripts that call dbctl at random times.

Thanks, Riccardo!

That's a great point - I've raised this with the DBAs and will coordinate at least a week out from when I plan to do the actual replication change (T358636#9699378).

For the package update (T358636#9699259), the downtime should be no longer than a "normal" restart [0], so I don't plan any special actions to avoid conftool updates other than sticking to an infrastructure deployment window.

[0] on 0.0.9 or below, this could be up to 90s, but if an event shows up that needs replicated, it'll actually cause the mirror to terminate promptly (the wait loop exits, event is mirrored, and since the reactor is stopped, it exits cleanly rather than continuing the callback chain)

Mentioned in SAL (#wikimedia-operations) [2024-04-10T16:12:14Z] <swfrench-wmf> uploaded etcd-mirror 0.0.11-1 to apt.wikimedia.org (T358636)

Mentioned in SAL (#wikimedia-operations) [2024-04-10T17:37:22Z] <swfrench-wmf> restarting etcd-mirror on conf2005.codfw.wmnet for T358636

Mentioned in SAL (#wikimedia-operations) [2024-04-10T17:46:40Z] <swfrench-wmf> finished updating A:conf hosts to etcd-mirror 0.0.11-1 (T358636)

After asking around a bit, it seems that temporarily directing codfw-associated clients (confd, navtiming, pybal) to eqiad should be relatively straightforward, if involved.

Given that this was done as recently as late last year when codfw was reimaged [0] (note: bits of the runbook may be stale, e.g. host lists), the risk of overloading etcd in eqiad is likely low.

One benefit of going this route is that it reduces time pressure on making the switch (vs. the hitless procedure). It also side-steps coordination surrounding cookbooks, etc.

I'll keep this in mind if that latter turns out to be a challenge.

[0] https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster#Reimage_cluster

Change #1023554 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/puppet@production] WIP: hieradata: disable etcd replication on conf2005

https://gerrit.wikimedia.org/r/1023554

Change #1023555 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/puppet@production] WIP: etcdmirror::instance: absent all resources

https://gerrit.wikimedia.org/r/1023555

Change #1023556 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/puppet@production] WIP: etcdmirror: reconfigure with full-keyspace replication

https://gerrit.wikimedia.org/r/1023556

Change #1023557 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/puppet@production] WIP: hieradata: reenable etcd replication on conf2005

https://gerrit.wikimedia.org/r/1023557

Change #1023966 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/puppet@production] hieradata: make etcd in eqiad read-only

https://gerrit.wikimedia.org/r/1023966

Change #1023967 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/puppet@production] hieradata: return etcd in eqiad to read-write

https://gerrit.wikimedia.org/r/1023967

Scott_French changed the task status from Open to In Progress.EditedApr 24 2024, 10:44 PM

All patches to support the migration described in T358636#9699378 are ready. Many thanks to @Volans for the reviews.

The one minor change:

The plan did not put etcd in read-only mode in eqiad, the primary rationale being that it would introduce additional delays in supporting emergency changes (e.g., urgent requestctl block).

In retrospect, the benefits of read-only (rule out cross-DC divergence entirely, quick feedback if humans try to update etcd in error) outweigh that concern, particularly if the revert to read-write patch is prepared ahead of time and ready to merge.

The only notable change is that, when verifying replication is working (step #4), we'll just use etcdctl locally on an eqiad cluster member to update a test key.

Collection of prepared commands that will be run as part of this: P61226

Mentioned in SAL (#wikimedia-operations) [2024-04-30T17:01:37Z] <swfrench@deploy1002> Locking from deployment [ALL REPOSITORIES]: etcd replication maintenance - T358636

Change #1023966 merged by Scott French:

[operations/puppet@production] hieradata: make etcd in eqiad read-only

https://gerrit.wikimedia.org/r/1023966

Mentioned in SAL (#wikimedia-operations) [2024-04-30T17:03:03Z] <swfrench-wmf> putting etcd in read-only mode for T358636

Change #1023554 merged by Scott French:

[operations/puppet@production] hieradata: disable etcd replication on conf2005

https://gerrit.wikimedia.org/r/1023554

Mentioned in SAL (#wikimedia-operations) [2024-04-30T17:09:52Z] <swfrench-wmf> disabling etcd replication into conf2005 for T358636

Mentioned in SAL (#wikimedia-analytics) [2024-04-30T17:21:29Z] <xcollazo> aborting deploy of refinery due to scap global lock held by T358636. Will attempt again in about an hour.

Change #1023555 merged by Scott French:

[operations/puppet@production] etcdmirror::instance: absent all resources

https://gerrit.wikimedia.org/r/1023555

Change #1023556 merged by Scott French:

[operations/puppet@production] etcdmirror: reconfigure with full-keyspace replication

https://gerrit.wikimedia.org/r/1023556

Change #1023557 merged by Scott French:

[operations/puppet@production] hieradata: reenable etcd replication on conf2005

https://gerrit.wikimedia.org/r/1023557

Change #1023967 merged by Scott French:

[operations/puppet@production] hieradata: return etcd in eqiad to read-write

https://gerrit.wikimedia.org/r/1023967

Mentioned in SAL (#wikimedia-operations) [2024-04-30T17:54:50Z] <swfrench-wmf> putting etcd back in read-write mode for T358636

Mentioned in SAL (#wikimedia-operations) [2024-04-30T17:56:48Z] <swfrench@deploy1002> Unlocked for deployment [ALL REPOSITORIES]: etcd replication maintenance - T358636 (duration: 55m 11s)

Well, that seems to have gone off without significant issue. Many thanks to @Volans and @RLazarus for all of your help.

A couple of observations and / or lessons learned:

Deployments: While not mentioned in the original plan, we decided to take the scap lock just prior to entering RO (preferable to block all deploys outright, including k8s-only ones for which this isn't relevent, rather than letting them fail later on).

Silences: Somehow, I forgot that the EtcdReplicationDown alert is paging, and thus did not think to silence it. Boo.

Manual write operations: The intention was to use etcdctl for manual operations against the kv store. This includes cleaning up /spicerack in eqiad, populating the new replication index in codfw, and issuing test puts/deletes to verify replication (see P61226).

I expected that hitting :2379 directly on an etcd node via -C would work. However, due to the way etcdctl defers to the advertised client URLs (which provide the nginx-proxied :4001 endpoints), writes are blocked.

Instead, we had to use the REST API directly, similar to what we happened to do by default in the original incident.

Choice of initial index: There is another subtle issue in how we chose to populate the new replication index: simply copying /__replication/conftool to /__replication/__ROOT__ is not guaranteed to work.

Specifically, if any operations that assert a prior state (e.g., a delete, which requires that the deleted object exists) have occurred in eqiad under non-/conftool paths _after_ the last successfully replicated /conftool operation, then they will not cleanly apply when "replayed" to codfw.

Indeed, we _created_ such operations when we cleaned up /spicerack in eqiad, and ran into this the first time we started replication under the new configuration.

Instead, the correct sequence of operations would be:

  • Make eqiad read-only.
  • Stop replication (once fully caught up to any in-/conftool changes, of course).
  • Clean up /spicerack and note the returned X-Etcd-Index header.
  • Set /__replication/__ROOT__ to the header value.
  • Start replication again.

This would safely skip over the ops pertaining to keys that don't exist in codfw, and this is what we did on our second attempt to restart replication.

In summary, nothing terribly exciting to report - these were all pretty straightforward to overcome.

I've updated https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster to reflect the new state of the world, while also fixing up some out-of-date parts and noting some of the lessons learned (e.g., the etcdctl behavior noted in T358636#9759478).

With that, I think we can close this out, as no other actions are tracked here.

T350565 will have an update soon on the proposed migration to the v3 API, which as noted in the description, will replace etcdmirror with etcdctl mirror-maker.