Page MenuHomePhabricator

pybal should automatically reconnect to etcd
Open, HighPublic

Description

Today we've rebooted conf2001.codfw.wmnet into a new kernel. After the system came back online, we've noticed that all codfw LVSs, which use conf2001 as their etcd backend, had no established TCP connections with it. Hosts were thus silently not being pooled/depooled upon admin request.

We need to:

  • make sure pybal attempts reconnecting to etcd in these situations
  • implement an icinga check to alert us whenever a running pybal has 0 established TCP connections to its etcd(s)

Details

Related Gerrit Patches:
operations/debs/pybal : masterReset waitIndex on etcd error 401
operations/debs/pybal : masterImprove reactor mocking
operations/debs/pybal : 1.141.14.4: Introduce etcd reconnectTimeout
operations/debs/pybal : master1.14.4: Introduce etcd reconnectTimeout
operations/debs/pybal : 1.14etcd: Introduce reconnectTimeout
operations/debs/pybal : masteretcd: Introduce reconnectTimeout

Event Timeline

ema created this task.Jul 5 2017, 4:13 PM
Restricted Application added a project: Operations. · View Herald TranscriptJul 5 2017, 4:13 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
ema triaged this task as High priority.Jul 5 2017, 4:13 PM
ema updated the task description. (Show Details)
ema moved this task from Triage to LoadBalancer on the Traffic board.Jul 5 2017, 4:16 PM
ema added a comment.Jul 5 2017, 4:19 PM

As @Joe suggested some days ago, we might want to rewrite pybal's etcd code using https://treq.readthedocs.io/.

elukey added a comment.Jul 6 2017, 7:23 AM

Another thing that would be nice is the possibility to specify more than one conf host in profile::pybal::config_host: conf2001.codfw.wmnet, and allow pybal to connect to more hosts in case of connection failures. For example, say conf2001 is down for long maintenance (disk broken, etc..), it would be nice not to make any puppet changes to flip pybal's config.

Another thing that would be nice is the possibility to specify more than one conf host in profile::pybal::config_host: conf2001.codfw.wmnet, and allow pybal to connect to more hosts in case of connection failures. For example, say conf2001 is down for long maintenance (disk broken, etc..), it would be nice not to make any puppet changes to flip pybal's config.

Doesn't even need to be a long maintenance, yesterday's reboot of conf2001 also caused pybal errors.

Volans added a subscriber: Volans.Jul 6 2017, 7:26 AM

Another thing that would be nice is the possibility to specify more than one conf host in profile::pybal::config_host: conf2001.codfw.wmnet, and allow pybal to connect to more hosts in case of connection failures.

That should be the "add support for SRV records" feature IIRC 😉

Joe added a comment.Jul 6 2017, 7:31 AM

One option to support reconnections and srv records and everything is to use the (blocking) python-etcd library via defer.deferToThread as etcd-mirror does.

The issue is that if a deferred is ongoing in a thread, there is no way to cancel it until it has finished, so we would need to resort to tricks like the one in etcd-mirror (where watch is limited to 60 seconds, then repeated), and still have to wait ~ 30s whenever we want to restart pybal.

Until I find a solution to this problem, which doesn't seem likely to be found, tbh, we might need to stick with our code, and just make it more solid.

ema added a comment.Jul 6 2017, 11:39 AM

Today @elukey took care of rebooting conf2003, and the pybals using it (ulsfo) did reconnect automatically.

I've observed the situation a bit more closely on lvs4004, whose TCP connections to conf2003 obviously got closed when the machine was rebooted. They were then re-attempted and stayed half-open (SYN_SENT) for roughly 2 minutes before getting established when the host came back online.

It might thus be a matter of timing: perhaps yesterday's reboot took longer?

Mentioned in SAL (#wikimedia-operations) [2017-07-06T14:06:24Z] <ema> restart pybal on lvs4004 T169765

Mentioned in SAL (#wikimedia-operations) [2017-07-06T14:08:13Z] <ema> restart pybal on lvs4002 T169765

Mentioned in SAL (#wikimedia-operations) [2017-07-06T14:20:29Z] <ema> restart pybal on lvs4003 T169765

Mentioned in SAL (#wikimedia-operations) [2017-07-06T14:29:29Z] <ema> restart pybal on lvs4001 T169765

Mentioned in SAL (#wikimedia-operations) [2017-07-17T14:57:04Z] <ema> restart pybal on lvs100[45] T169765

Mentioned in SAL (#wikimedia-operations) [2017-07-17T14:58:57Z] <ema> restart pybal on lvs100[12] T169765

ema added a comment.Jul 17 2017, 3:35 PM

So at first sight it looked like T169893 fixed this issue, but that's not the case. In particular, after conf1001 had been rebooted today I've noticed that both lvs1003 and lvs1006 had a bunch of established TCP connections to conf1001, while all other eqiad LVSs had none. However, I've tried changing the state of wdqs1002 from 'inactive' to 'yes' to see if lvs1003/1006 would pick up the change. lvs1006 did, while lvs1003 did not.

Among other error messages on lvs1003:

Jul 17 14:46:42 lvs1003 pybal[11221]: [config-etcd] ERROR: failed: [Failure instance: Traceback (failure with no frames): <class 'twisted.web.error.Error'>: 502 Bad Gateway
Jul 17 14:46:42 lvs1003 pybal[11221]: ]

There are multiple lines of successful connections such as:

Jul 17 14:46:42 lvs1003 pybal[11221]: [config-etcd] INFO: connected to etcd://conf1001.eqiad.wmnet/conftool/v1/pools/eqiad/scb/citoid/

But none for wdqs. On lvs1006 instead:

Jul 17 15:31:49 lvs1006 pybal[26073]: [config-etcd] INFO: connected to etcd://conf1001.eqiad.wmnet/conftool/v1/pools/eqiad/wdqs/wdqs/

Mentioned in SAL (#wikimedia-operations) [2017-07-17T15:35:53Z] <ema> restart pybal on lvs100[36] T169765

ema moved this task from Backlog to In Progress on the Pybal board.Feb 16 2018, 2:47 PM

Change 411264 had a related patch set uploaded (by Ema; owner: Ema):
[operations/debs/pybal@master] etcd: Introduce reconnectTimeout

https://gerrit.wikimedia.org/r/411264

Change 411264 merged by Ema:
[operations/debs/pybal@master] etcd: Introduce reconnectTimeout

https://gerrit.wikimedia.org/r/411264

Change 413141 had a related patch set uploaded (by Ema; owner: Ema):
[operations/debs/pybal@1.14] etcd: Introduce reconnectTimeout

https://gerrit.wikimedia.org/r/413141

Change 413141 merged by Ema:
[operations/debs/pybal@1.14] etcd: Introduce reconnectTimeout

https://gerrit.wikimedia.org/r/413141

Change 413145 had a related patch set uploaded (by Ema; owner: Ema):
[operations/debs/pybal@master] 1.14.4: Introduce etcd reconnectTimeout

https://gerrit.wikimedia.org/r/413145

Change 413145 merged by Ema:
[operations/debs/pybal@master] 1.14.4: Introduce etcd reconnectTimeout

https://gerrit.wikimedia.org/r/413145

Change 413146 had a related patch set uploaded (by Ema; owner: Ema):
[operations/debs/pybal@1.14] 1.14.4: Introduce etcd reconnectTimeout

https://gerrit.wikimedia.org/r/413146

Change 413146 merged by Ema:
[operations/debs/pybal@1.14] 1.14.4: Introduce etcd reconnectTimeout

https://gerrit.wikimedia.org/r/413146

Change 413154 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/debs/pybal@master] Improve reactor mocking

https://gerrit.wikimedia.org/r/413154

Change 413154 merged by Ema:
[operations/debs/pybal@master] Improve reactor mocking

https://gerrit.wikimedia.org/r/413154

Change 428303 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/debs/pybal@master] Reset waitIndex on etcd error 401

https://gerrit.wikimedia.org/r/428303

Change 428303 merged by Vgutierrez:
[operations/debs/pybal@master] Reset waitIndex on etcd error 401

https://gerrit.wikimedia.org/r/428303

jbond added a subscriber: jbond.Jun 24 2019, 3:46 PM