Page MenuHomePhabricator

Add etcdmirror connection retry on etcd-tls-proxy unavailability
Open, Needs TriagePublic

Description

If possible, etcdmirror should not crash right away when etcd-tls-proxy is unavailable, for example when updating nginx, or during a network blip.

Can we consider adding a retry with exponential backoff for a few seconds?

Event Timeline

Logs at the moment of the incident point to an etcd.Client uncaught exception.

Sep  8 15:17:23 conf2005 etcdmirror-conftool-eqiad-wmnet[27008]: CRITICAL: Generic error: Connection to etcd failed due to MaxRetryError("HTTPSConnectionPool(host='conf1009.eqiad.wmnet', port=4001): Max retries exceeded with url: /v2/keys/conftool?waitIndex=1012098&recursive=true&wait=true (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fa1400d56d0>: Failed to establish a new connection: [Errno 111] Connection refused',))",)
Sep  8 15:17:23 conf2005 etcdmirror-conftool-eqiad-wmnet[27008]: [etcd-mirror] CRITICAL: Generic error: Connection to etcd failed due to MaxRetryError("HTTPSConnectionPool(host='conf1009.eqiad.wmnet', port=4001): Max retries exceeded with url: /v2/keys/conftool?waitIndex=1012098&recursive=true&wait=true (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fa1400d56d0>: Failed to establish a new connection: [Errno 111] Connection refused',))",)

What I don't understand is why the python etcd lib client would fail on connection to only one of the etcd servers and not retry on other servers of the cluster.

Joe subscribed.

What I don't understand is why the python etcd lib client would fail on connection to only one of the etcd servers and not retry on other servers of the cluster.

That's because we don't pass etcdmirror the whole cluster but just a specific server to connect to.

Volans subscribed.