Page MenuHomePhabricator

etcdmirror does not recover from a cleared waitIndex
Open, MediumPublic

Description

This evening, the etcdmirror service on conf2005 failed after it hit the case described in [0] (and the Note just above that heading).

At 01:39:44 UTC, the service failed with CRITICAL: The current replication index is not available anymore in the etcd source cluster. The last mirrored event prior to that was at 00:00:26 UTC (index: 3020127).

We inspected the replication index key like so:

$ curl https://conf2005.codfw.wmnet:4001/v2/keys/__replication
{"action":"get","node":{"key":"/__replication","dir":true,"nodes":[{"key":"/__replication/conftool","value":"3020127","modifiedIndex":5150719,"createdIndex":4271}],"modifiedIndex":2140,"createdIndex":2140}}

and confirmed (1) it matched the index from the last event in the logs and (2) it was just a bit more than 1000 events behind the latest X-Etcd-Index returned by conf1009 (then 3021261).

We recovered by manually advancing /__replication/conftool to 3020127 + 999 (i.e., just prior to when we fell off the 1000 event retention window) since this should not allow for lost events (in contrast, advancing all the way to the then-current X-Etcd-Index could have), and restarting the service.

We believe this happened due to the combination of an unusually quiet period in the /conftool keyspace and an unusually busy period outside - e.g., spicerack locks.

In any case, we should look into whether etcdmirror should be improved to use X-Etcd-Index to automatically recover from this situation in the near term.

In the long term (however long it takes to get fully off the v2 API), etcdmirror will be replaced with some other solution TBD that supports v3.

[0] https://etcd.io/docs/v2.3/api/#watch-from-cleared-event-index

Event Timeline

That etcdmirror is mirroring only the /conftool keys it's totally news to me, I assumed it was replicating the whole content of etcd. But indeed it does not:

$ sudo etcdctl --endpoints https://conf1007.eqiad.wmnet:4001 ls  /
/conftool
/spicerack
/test

vs

$ sudo etcdctl --endpoints https://conf2005.codfw.wmnet:4001 ls  /
/conftool

Is there any specific reason we shouldn't replicate it all? If replicating it all this issue should never happen right?
Spicerack follows the SRV records for the endpoint, so if that gets failed over to codfw spicerack should follow and I would have expected to find the locks replicated there too.

That etcdmirror is mirroring only the /conftool keys it's totally news to me, I assumed it was replicating the whole content of etcd. But indeed it does not:

$ sudo etcdctl --endpoints https://conf1007.eqiad.wmnet:4001 ls  /
/conftool
/spicerack
/test

vs

$ sudo etcdctl --endpoints https://conf2005.codfw.wmnet:4001 ls  /
/conftool

Is there any specific reason we shouldn't replicate it all? If replicating it all this issue should never happen right?
Spicerack follows the SRV records for the endpoint, so if that gets failed over to codfw spicerack should follow and I would have expected to find the locks replicated there too.

No reason. Pretty simply at the time there was nothing beyond conftool to replicate, and we wanted to leave space for cluster-local usage of etcd.

If we need to replicate /spicerack as well, we can change replication or spin up a second instance.

Thanks, Riccardo. Yes, indeed - this particular issue should generally not happen if the entire keyspace is mirrored (IIRC, there are non-keyspace events that can advance the index, but (a) I might be misremembering and (b) those are presumably administrative / rare).

Also that makes sense about having the spicerack locks replicated to the follower site making a failover safer for cookbooks.

I suspect the main concern would be increased load on the mirror, but I don't have a good intuition for whether that's a practical concern. We can certainly load test that, though.

Speaking of which, I still need to confirm from the nginx access logs that our explanation (spicerack lock events) holds up. I plan to do that this morning. (I need those kinds of numbers anyway to calibrate benchmarks for the v3 migration)

But a second instance wouldn't prevent the current issue, right?

Rather than trying to pull access logs off the conf hosts (as they're rather large, and I'd like to avoid stressing them), I just ended up looking at the etcd grafana dashboard instead: https://grafana.wikimedia.org/goto/NsEfMXTSz?orgId=1

Of interest is the aggregated requests by method panel (attached).

Pulling the data behind that panel down and spot checking the number of (PUT, POST, DELETE) events between 00:00:26 UTC and 01:39:44 UTC we end up at ~ 1000 as expected (a bit short, which makes sense since we're using time buckets that don't cleanly align with our window).

What's interesting about the graph is that the upticks look close to what I'd expect for spicerack locks - e.g., roughly balanced POST/DELETE most of the time (this is the queueing mutex in python-etcd, so we expect paired ops except when the key TTLs out).

I'd also expect DELETE to slightly exceed POST, since we also clear keys when their KeyLocks structure is empty (no spicerack locks held), which we see here too.

The fact that POST/DELETE significantly outpaces PUT much of the time is interesting too, and suggests contention - i.e., not all etcd lock acquisitions result in spicerack lock acquisition ... though this can also be explained to an extent by the fact that not all lock acquisitions result in a PUT, as noted above.

In any case, this is enough for me to generally say "looks about right" :)

T358636_updates.png (304×622 px, 38 KB)

After thinking about this a bit more today, I think I'm onboard with the idea of expanding replication to include /spicerack.

If we're comfortable with that being the default for the entire keyspace (i.e., even for new workloads) then that's a fairly straightforward config change [0]. If for some reason we want to limit replication to _just_ /spicerack and /conftool, that's a still-fairly straightforward code change on top of that.

I do need to think a little bit about what a failover [1] would look like for spicerack lock state. Naively, as long as the global-read-only period is longer than the etcd lock TTL [2] (which it certainly will be if the latter stays 15s) then we should end up in the happy state that no possibly-confusing lock keys exist in the failed-to cluster when it becomes the primary [3]. The KeyLocks payloads would of course still be there (the important part).

I also gave some thought to what a general solution for this issue (recovery from the wait index being cleared from etcd's event history) in etcd-mirror might look like, and it gets tricky pretty quickly for non-trivial cases.

[0] Though it would be a full reload, so I'm wondering if there are precautions we should take (e.g., pointing codfw clients at eqiad temporarily).

[1] https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster#Master_cluster_switchover

[2] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/spicerack/+/1840a8fce79c45a35c3b000b31a5a861a577d181/spicerack/locking.py#323

[3] I say possibly confusing, because the underlying functionality used to generate lock waiter keys (https://etcd.io/docs/v2.3/api/#atomically-creating-in-order-keys) is based on the *cluster-local* etcd index, which might be wildly different between sites.

In the meantime, I remembered why we were only replicating the /conftool keyspace:

in the past, we had another prefix called /eventlogging that was used only in eqiad and was supposed to potentially be busy.

Given the flimsy nature of etcdmirror, and the fact that eventlogging data was eqiad-only, we restricted replication.

But there is also another consideration: we know the write rate from conftool is generally reasonably low, enough not to get etcdmirror to lag significantly. There is a risk that we increase the lag of replication for something that is fundamental to running the live website to allow having a backup of etcd data for reimages and cookbooks. As @Scott_French already noted, it's not obvious what a failover scenario would look like in that case.

Is uniformity worth the risk? For sure, we'll need to do some load testing as @Scott_French mentioned.

If we're comfortable with that being the default for the entire keyspace (i.e., even for new workloads) then that's a fairly straightforward config change [0].

We control what goes into etcd quite closely.

As for changing the replication: one way to do it that doesn't require a full reload is to do the following

  • set the main etcd cluster in read-only mode
  • dump the /spicerack keyspace from eqiad, load it into codfw[1]
  • Modify the replication metadata

but I think a depool and reload of all data is the better option.

One thing we should consider and test is if I was smart enough when writing etcdmirror and explicitly considered the case where we replicate the whole root keyspace or not. I would imagine I did not, which would mean that for instance the key for replication would be invalid. So we might still need a small patch to etcd-mirror.

[1] We can just make an etcdmirror configuration just for /spicerack replication and start the replication with --reload just for that, or use any dump tools.

Of course I don't want to add additional lag to any live-traffic data (pybal, mwconfig, dbctl) and if we deem adding spicerack locks to the replication might cause that let's find another solution. For example we could have a failover etcd cookbook that when run will read the active locks from the primary cluster and manually replicate them on the secondary one explicitely. Or any other viable option.

That said, if we add it to the replication, there is no need to overcomplicate it. If done at the right time (like for example right now, that there are no cookbooks in-flight*) there is no need to do anything. We can also "restart" from scratch as we can even rm -rf /spicerack entirely right before enabling the replication for it if that simplify things. There is no persistent data.

The current data is only directories:

$ sudo etcdctl --endpoints https://conf1007.eqiad.wmnet:4001 ls -r /spicerack
/spicerack/locks
/spicerack/locks/custom
/spicerack/locks/etcd
/spicerack/locks/modules
/spicerack/locks/cookbooks

We control what goes into etcd quite closely.

Ah, that is a good point: changes in workload are strictly limited to what we already (or will in the future) allow by ACL.

One thing we should consider and test is ...

Also a great point. So, from some local testing earlier today, mirroring the full keyspace does "work" without any modification to etcdmirror.

However, that's the result of etcd's canonicalization of paths with superfluous slashes, which works by way of returning 301 to the correct path, which python-etcd understands.

To avoid the redirects, we'll need to adjust the Etcd2Writer a bit. I have a patch for this that I'm happy to send out.

Next steps

Alright, it sounds like the next step here is to design a load test for etcdmirror, geared toward assessing the replication latency impact on a hypothetical "rare but critical" op in the presence of a high background rate of "less-critical" ops.

I'll give that some thought this afternoon.

If we decide we're comfortable with the outcome, then opening up the full keyspace could happen either with a reload or some combination of what we've discussed above.

For example, if we're able to ensure there's no cookbooks running like Riccardo observed earlier today, then:

  1. making eqiad RO
  2. deleting /spicerack in eqiad
  3. stopping replication in codfw
  4. copying /__replication/conftool to /__replication in codfw [0]
  5. restarting replication in codfw for the full keyspace
  6. making eqiad RW again

would probably be fine.

Not sure about this /test key in eqiad ... is that actually used for anything? If not, we should remove it if we're not going to do a full reload, as certain types of operations on it in the future will break replication.

[0] Or something else if we decide to change the replication index key structure in etcdmirror (e.g., make it so mirroring / corresponds to /__replication/idx or something).

Last Friday, I put together a simple stress test for etcd-mirror, with the goal of measuring replication delay under a range of background write rates from an "antagonist" workload. Test setup and results: https://phabricator.wikimedia.org/P58402.

In short, mean replication delay stays around ~ 4ms and tail latency remains predictably low well beyond a background write rate of 100 put/s - i.e., ~ 50x the typical peak write rate seen day-to-day in production (~ 1.8 put/s).

Tail latency degrades noticeably past ~ 500 put/s and replication fails around 700 put/s (where replication delay crosses the event history limit of 1000 events).

The most notable caveat of this test is that it's optimistic: it's measuring ideal / peak replication throughput, as (1) there is no inter-DC network delay and (2) no delay from consensus rounds among etcd nodes (though disk I/O is still a factor).

Even so, I think the margins are wide enough here that we should be able to easily accommodate replicating spicerack lock state (with the previously noted caveat that some portions of that state cannot be meaningfully interpreted across clusters, but should TTL out quickly).

If folks are comfortable moving forward with full-keyspace mirroring given the above, I'll get started planning that out (and send off the patch I mentioned previously).

@Scott_French great job, indeed I think we can just go forward and do it, even if there's a few things that look suboptimal.

This got me thinking: if we're not really interested in what is in /spicerack, we could add to etcdmirror the ability to watch a keyspace but ignore some sub-keyspaces in replication.

This would solve the 1k event window problem but also eliminates the problem of replicating keys that don't matter or won't really work when replicated.

This got me thinking: if we're not really interested in what is in /spicerack, we could add to etcdmirror the ability to watch a keyspace but ignore some sub-keyspaces in replication.

The only part we could skip replicating is /spicerack/locks/etcd, not the whole /spicerack. We're interested in the content of /spicerack in general ;)
If for any reason we failover etcd from eqiad to codfw if we don't mirror its content we risk to run concurrent cookbooks that should not run concurrently. And not all of them are short-lived, some run for a week...

What Scott was referring in prior comments is the short-lived mutex lock that spicerack uses to write its own locks ensuring no-one is writing locks at the same time, not the actual spicerack locks that we care of. And those mutex do quite some write traffic so it does make sense to skip them in replication if that's easy and we think it wouldn't create replication issues in the future.

Thank you both for the follow-up, and apologies for the imprecision re: which locks :)

Great, I'll send out that first patch and get started figuring out the switchover plan.

Also it looks like it should be pretty easy to add a deny-list pattern or similar for filtering out writes for certain keys (while still advancing the replication index), so we can suppress just /spicerack/locks/etcd. I'll take a look at that today.

Change 1007738 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/software/etcd-mirror@master] Improve support for mirroring the full keyspace

https://gerrit.wikimedia.org/r/1007738

Coming back to this after pausing for the summit, I wanted to follow up on my comment that these benchmarks are measuring ideal throughput (not accounting for, e.g., network delays).

Reminder: etcd v2 watch uses long-polling, in contrast to v3 which is streaming. In addition, the poll -> write cycle is synchronous on the etcdmirror event loop (i.e., we're not using non-blocking clients, so we can't overlap the communication cost).

Thus (if urllib3 is reusing TCP connections, which it should) practical peak is ~ 1 / (site-RTT + local-cluster write latency). Back-of-the-envelope:

  • steady-state codfw-eqiad RTT is ~ 30ms (not sure what this looks like in failure cases)
  • p95 etcdmirror write latency is ~ 10ms [0] (and p50 is somewhere between 5 and 10ms)

... which puts us at ~ 25 put/s peak.

There are two points I'd like to highlight:

  1. while this is far less than ideal peak, it's still > 10x the peak write rate seen day-to-day (~ 1.8 put/s); and
  2. today, sustained ~ 25 put/s to non-conftool keys in the primary cluster would break replication in ~ 40s anyway.

By #2, what I mean is that expanding the replicated keyspace is still a strict improvement over today - i.e., we're not adding a new risk of breakage in the face of "unprecedented" writes to spicerack locks, but rather giving ourselves a fighting chance of keeping up.

[0] https://phabricator.wikimedia.org/P58806

Following up on the /test key (T358636#9589274) present in eqiad:

  • I can find no references to it (e.g., as probe or something) in operations/puppet.
  • From a (hopefully) representative couple of hours of nginx access logs, I see no requests explicitly touching this key.
  • Going by the modifiedIndex of the key and the current X-Etcd-Index in eqiad, it does not appear to be recently updated, but also appears to not be terribly old [0]:
$ curl -v https://conf1007.eqiad.wmnet:4001/v2/keys/test
...
< X-Etcd-Index: 3093504
...
{"action":"get","node":{"key":"/test","value":"test2","modifiedIndex":2526020,"createdIndex":2526020}}
  • Clearly nothing that might ever attempt to query for this key in codfw (i.e., via the read-only endpoints) should care about it, since it's not replicated.

Given the above, it is likely that this is safe to remove.

Aside: I'm still a bit puzzled as to how this got there, as the action would either need to use the blanket credentials in /root/.etcdrc (i.e., what we load in conftool or spicerack) _or_ step around nginx by being run on the conf host (i.e., against :2379 directly). In either case, this is probably not something we can figure out after the fact.

[0] As a rough estimate based on prometheus metrics, we see ~ 1-5k mutations (DELETE|POST|PUT) per business day. Taking typical numbers over the course of a week, this index difference puts us back at 0.5 - 1 year (setting aside internal operations that also advance the index).

Wow, that was quite an investigation for a /test key, thanks for the thorough analysis. As for the test2 value that could have been me when deploying the spicerack locks. I have in my bash history from the now defunct cumin1001 this for example:

sudo etcdctl -C https://conf1008.eqiad.wmnet:4001 set /test/volans '{"test": "value"}'

that although not the same might have failed if test was a key and not a directory (as it looks like) and I might have retried it with different values.
This just to say that I think it's safe to remove /test.

Change #1008944 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/software/etcd-mirror@master] Add support for an optional ignored-keys pattern

https://gerrit.wikimedia.org/r/1008944

Change #1008944 merged by jenkins-bot:

[operations/software/etcd-mirror@master] Add support for an optional ignored-keys pattern

https://gerrit.wikimedia.org/r/1008944

Change #1007738 merged by jenkins-bot:

[operations/software/etcd-mirror@master] Improve support for mirroring the full keyspace

https://gerrit.wikimedia.org/r/1007738

Thanks, Riccardo! (both for the follow-up here and code reviews)

Next steps:

  • Release a new etcd-mirror package (a process I can largely follow from the excellent httpbb docs [0]). This should result in no change in behavior, but is clearly something I'll monitor closely.
  • Finalize the switchover plan, which I've tested locally, and socialize the "no cookbooks please" window.
  • Prepare the changes to [1] et al. to make the change (src / dst prefix -> '/', ignored keys -> '/spicerack/locks/etcd(/.*)?', etc.).
  • Make it so.

[0] https://wikitech.wikimedia.org/wiki/Httpbb#Deploying_httpbb

[1] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/etcdmirror/manifests/instance.pp

Change #1017346 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/software/etcd-mirror@master] Release etcd-mirror 0.0.11

https://gerrit.wikimedia.org/r/1017346

Change #1017358 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/software/etcd-mirror@master] Bump etcdmirror package version: 0.0.11

https://gerrit.wikimedia.org/r/1017358

Change #1017358 merged by jenkins-bot:

[operations/software/etcd-mirror@master] Bump etcdmirror package version: 0.0.11

https://gerrit.wikimedia.org/r/1017358

Change #1017346 merged by jenkins-bot:

[operations/software/etcd-mirror@master] Release etcd-mirror 0.0.11

https://gerrit.wikimedia.org/r/1017346

For the package release, my plan is to build and upload the 0.0.11-1 package shortly (but ideally not less than 30m) before a "MediaWiki infrastructure" deployment window.

Once the window starts, I'll debdeploy the new package to configcluster hosts and restart etcd-mirror on conf2005 (there is no postinst script, so only at this point will the change take effect).

Doing this during an infrastructure window has the benefit of avoiding at least one common source of "important" conftool updates (appserver depool / repool) during the restart.

I'll still socialize this with folks who might also be performing conftool-affecting operations during that same window (e.g., dbctl).

In the meantime, I've verified locally that the upgrade itself is a noop.

As discussed previously, there are two options to consider:

  • Full reload - i.e., use --reload to fully clear and rewrite the keyspace in codfw.
  • In-place migration - i.e., manually copy the replication index key into its new location.

Full-reload has the benefit of being conceptually simple, but is disruptive to clients. To avoid that, we’d probably want to direct all codfw-associated consumers to eqiad temporarily. That’s non-ideal for two reasons:

  • We need to globally restart confd after updating the SRV records and clearing the DNS caches (T314489).
  • It's somewhat challenging to assess the risk of failing over all clients to eqiad in terms of load (particularly since our largest consumer, confd, uses linearizable reads).

Edit: Actually, it looks like we managed to do the latter successfully as part of T332010 within the last year, so this is _likely_ low risk. That said, I forgot to include PyBal in the above (i.e., restarting all PyBal instances after updating their config to temporarily point to etcd in eqiad).

In-place migration has the benefit of being non-disruptive, but involves a bit more care and requires cleanup of previously non-replicated paths in eqiad before migration (/spicerack) and ensuring they are not updated during the migration.

As long as we can do that cleanup, though, I’d say the in-place path has fewer unknowns (e.g., it is straightforward to perform e2e tests for the switch to increase our confidence that it will indeed go to plan). It also doesn't require _more_ steps / sequencing that the reload approach would.

With that in mind, I believe it would look something like the following:

  1. After verifying no cookbooks are running and locking them out of etcd, delete /spicerack in eqiad.
  2. Set profile::etcd::replication::active: false on conf2005 in hiera, merge, and run puppet-agent on conf2005.

[At this point, replication is stopped, but can be recovered in an emergency by manually starting the systemd unit and disabling puppet-agent.]

  1. Copy /__replication/conftool to /__replication/__ROOT__ in codfw.
  2. Manually start etcd-mirror with its new command-line (new src / dst prefixes, spicerack python-etcd lock keys ignored) and verify correct operation (e.g., depool a host).
  3. Stop manual invocation.

[At this point, we can either revert back to our previous state or proceed forward. If anything was replicated during #4, we’ll need to catch up /__replication/conftool if we want to revert.]

  1. Override ensure => absent on all resources in etcdmirror::instance (except the package, of course), merge, run agent on all configcluster hosts.

[At this point, replication is unconfigured, but can be manually started in a emergency, as in #4.]

  1. Set src and dst paths in hieradata/role/{site}/configcluster.yaml to /, wire in src-ignore-keys-regex (updating both the systemd unit and reload script templates), and flip all resources back to ensure => present. Merge and run puppet-agent on conf2005.
  2. Verify systemd unit script contents on conf2005 one last time (i.e., should be identical to manual invocation in #4).
  3. Set profile::etcd::replication::active: true on conf2005 in hiera, merge, and run puppet-agent on conf2005.
  4. Verify etcd-mirror is up and running as expected.

Cleanup: delete old /__replication/conftool index key in codfw.

Why the ensure dance? Because the src key prefix path is embedded in the resource name (e.g., systemd units). In order to make these resources actually go away, we need them to go through absent.

Are you sure this isn't more work than a full reload? A full reload would look rather similar, just bracketed by DNS changes etc. and minus the /__replication index key modifications.

Did you consider extending the puppet configs to support parallel mirrors? Parallel mirrors (one enabled at a time) would be nice, as we could have both configurations fully ready at the same time, collapsing 6-9 into one step (deleting the old resources left as a cleanup). Once I realized how involved this would be, though, I decided it was not worth pursuing.

Thanks a lot for the detailed plan outline. The plan looks sane to me, I agree that the in-place migration is probably the less risky path.
Just one nit, we need to give plenty of advance notice to avoid long-running scripts that might touch conftool such as long-running cookbooks and long-running DBAs scripts that call dbctl at random times.

Thanks, Riccardo!

That's a great point - I've raised this with the DBAs and will coordinate at least a week out from when I plan to do the actual replication change (T358636#9699378).

For the package update (T358636#9699259), the downtime should be no longer than a "normal" restart [0], so I don't plan any special actions to avoid conftool updates other than sticking to an infrastructure deployment window.

[0] on 0.0.9 or below, this could be up to 90s, but if an event shows up that needs replicated, it'll actually cause the mirror to terminate promptly (the wait loop exits, event is mirrored, and since the reactor is stopped, it exits cleanly rather than continuing the callback chain)

Mentioned in SAL (#wikimedia-operations) [2024-04-10T16:12:14Z] <swfrench-wmf> uploaded etcd-mirror 0.0.11-1 to apt.wikimedia.org (T358636)

Mentioned in SAL (#wikimedia-operations) [2024-04-10T17:37:22Z] <swfrench-wmf> restarting etcd-mirror on conf2005.codfw.wmnet for T358636

Mentioned in SAL (#wikimedia-operations) [2024-04-10T17:46:40Z] <swfrench-wmf> finished updating A:conf hosts to etcd-mirror 0.0.11-1 (T358636)