Page MenuHomePhabricator

confd's watch functionality appears to be partially broken when interacting with etcd 3.x
Closed, ResolvedPublic

Description

confd is still reading data from etcd using the v2 datastore, so it's calling

https://{host}:4001/v2/keys/{key path}?watch=true

it appears that if we change multiple keys in rapid sequence (this happens when confctl changes multiple records in one command), confd will only see the first one changing and miss the next updates.

This means, amongst other things, that we could miss critical changes to the discovery dns, or even to the varnish backends.

As a stopgap solution I propose we switch all instances from using watch to have a polling interval of 3 seconds, while we figure out what's wrong.

Event Timeline

Joe triaged this task as High priority.Aug 20 2020, 9:46 AM

Adding traffic as their systems are the ones affected.

Change 621484 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] confd: disable -watch for machines connected to etcd 3.x

https://gerrit.wikimedia.org/r/621484

Change 621484 merged by Giuseppe Lavagetto:
[operations/puppet@production] confd: disable -watch for machines connected to etcd 3.x

https://gerrit.wikimedia.org/r/621484

For the record, the problem is more general, and also affects servers connecting to etcd 2.x - the watch functionality seem to be completely broken in confd when dealing with fast, subsequent updates.

Change 622314 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] confd: use -interval 3 as a lower bound in all datacenters

https://gerrit.wikimedia.org/r/622314

Change 622314 merged by Giuseppe Lavagetto:
[operations/puppet@production] confd: use -interval 3 as a lower bound in all datacenters

https://gerrit.wikimedia.org/r/622314

After more digging, it seems the problem always existed, and it has to do with how confd watches for changes.

Specifically you can see here:

https://github.com/kelseyhightower/confd/blob/34a6ce8897ab3bde10f49c30c815fe496d592860/backends/etcd/client.go#L118

that if a waitIndex is present, it's ignored in the first request at line 127 that creates the watcher. This means that if WatchPrefix returns and the result is processed in a time that exceeds the time it takes to etcd to write the next value, the next iteration of WatchPrefix will start anew from the latest update, losing whatever happened in the meanwhile.

I'm convinced this could be patched relatively easily - probably not in time for the switchover, but we should definitely try to fix this.

Change 623228 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/software/spicerack@master] dnsdisc: change retry logic

https://gerrit.wikimedia.org/r/623228

Change 623228 merged by Volans:
[operations/software/spicerack@master] dnsdisc: change retry logic

https://gerrit.wikimedia.org/r/623228

After today's failure of the check_ttl step in the switchdc of the services, I had a chat with Brandon on the gdnsd side and it seems that there shouldn't be any different behaviour in the handling of the TTL vs the UP/DOWN state.
After that I took a look at the logs in the different involved hosts.

On cumin1001 the failure was due to:

2020-08-31 14:12:08,407 rzl 8957 [DEBUG dnsdisc.py:209 in resolve] [authdns2001.wikimedia.org] search -> 10.2.2.30 TTL 300

where we were expecting a TTL of 10.

On authdns1001, the confd changes were applied all within 1s at 14:11:40, while on authdns2001 the changes started at 14:11:40 but the last one was at 14:12:07, a whole 27s after the first one.

Looking at conf2002 logs, I found that etcdmirror-conftool-eqiad-wmnet logs started to replicate keys at 14:11:39 and seems to end at 14:12:03 (I say seems because it restarts at 14:13:09 replicating the same keys again).

This might explain the delay and seems to point to etcd-mirror as a possible culprit but I don't had time to investigate it more right now, but wanted to share this update with you.

etcd-mirror is ok, it's just that writes need to be serialized and etcd2 is orders of magnitude slower at doing that than etcd3.

As a workaround for the switchover, we'll just point all confds to eqiad.

Change 623535 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] confd: only read from the master during the switchover

https://gerrit.wikimedia.org/r/623535

Change 623535 merged by Giuseppe Lavagetto:
[operations/puppet@production] confd: only read from the master during the switchover

https://gerrit.wikimedia.org/r/623535

BBlack added a subscriber: BBlack.

The swap of Traffic for Traffic-Icebox in this ticket's set of tags was based on a bulk action for all such tickets that haven't been updated in 6 months or more. This does not imply any human judgement about the validity or importance of the task, and is simply the first step in a larger task cleanup effort. Further manual triage and/or requests for updates will happen this month for all such tickets. For more detail, have a look at the extended explanation on the main page of Traffic-Icebox . Thank you!

@Joe, thank you for all the work on this ticket! Would you say that this is resolved since the CRs have all been merged?

BCornwall changed the task status from Open to Stalled.Mar 29 2023, 6:22 PM
BCornwall assigned this task to Joe.