Issue
This evening, the etcdmirror service on conf2005 failed after it hit the case described in [0] (and the Note just above that heading).
At 01:39:44 UTC, the service failed with CRITICAL: The current replication index is not available anymore in the etcd source cluster. The last mirrored event prior to that was at 00:00:26 UTC (index: 3020127).
We inspected the replication index key like so:
$ curl https://conf2005.codfw.wmnet:4001/v2/keys/__replication {"action":"get","node":{"key":"/__replication","dir":true,"nodes":[{"key":"/__replication/conftool","value":"3020127","modifiedIndex":5150719,"createdIndex":4271}],"modifiedIndex":2140,"createdIndex":2140}}
and confirmed (1) it matched the index from the last event in the logs and (2) it was just a bit more than 1000 events behind the latest X-Etcd-Index returned by conf1009 (then 3021261).
We recovered by manually advancing /__replication/conftool to 3020127 + 999 (i.e., just prior to when we fell off the 1000 event retention window) since this should not allow for lost events (in contrast, advancing all the way to the then-current X-Etcd-Index could have), and restarting the service.
We believe this happened due to the combination of an unusually quiet period in the /conftool keyspace and an unusually busy period outside - e.g., spicerack locks.
Next steps
In the near term, we should harden our use of etcdmirror to prevent a recurrence.
Options considered:
- Extend etcdmirror to use X-Etcd-Index to automatically recover from this situation.
- Reconfigure etcdmirror to replicate (nearly) the entire keyspace.
After investigating both options, #2 is far simpler, and is in fact preferable over the current state (i.e., it is good to have spicerack lock state replicated). Thus, it is the current PoR.
In the long term (however long it takes to get fully off the v2 API), etcdmirror will be replaced with some other solution TBD that supports v3 (most likely etcdctl make-mirror).
[0] https://etcd.io/docs/v2.3/api/#watch-from-cleared-event-index
