Page MenuHomePhabricator

pybal should automatically reconnect to etcd
Closed, DeclinedPublic

Description

Today we've rebooted conf2001.codfw.wmnet into a new kernel. After the system came back online, we've noticed that all codfw LVSs, which use conf2001 as their etcd backend, had no established TCP connections with it. Hosts were thus silently not being pooled/depooled upon admin request.

We need to:

  • make sure pybal attempts reconnecting to etcd in these situations
  • implement an icinga check to alert us whenever a running pybal has 0 established TCP connections to its etcd(s)

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
ema triaged this task as High priority.Jul 5 2017, 4:13 PM
ema updated the task description. (Show Details)

As @Joe suggested some days ago, we might want to rewrite pybal's etcd code using https://treq.readthedocs.io/.

Another thing that would be nice is the possibility to specify more than one conf host in profile::pybal::config_host: conf2001.codfw.wmnet, and allow pybal to connect to more hosts in case of connection failures. For example, say conf2001 is down for long maintenance (disk broken, etc..), it would be nice not to make any puppet changes to flip pybal's config.

Another thing that would be nice is the possibility to specify more than one conf host in profile::pybal::config_host: conf2001.codfw.wmnet, and allow pybal to connect to more hosts in case of connection failures. For example, say conf2001 is down for long maintenance (disk broken, etc..), it would be nice not to make any puppet changes to flip pybal's config.

Doesn't even need to be a long maintenance, yesterday's reboot of conf2001 also caused pybal errors.

Another thing that would be nice is the possibility to specify more than one conf host in profile::pybal::config_host: conf2001.codfw.wmnet, and allow pybal to connect to more hosts in case of connection failures.

That should be the "add support for SRV records" feature IIRC 😉

One option to support reconnections and srv records and everything is to use the (blocking) python-etcd library via defer.deferToThread as etcd-mirror does.

The issue is that if a deferred is ongoing in a thread, there is no way to cancel it until it has finished, so we would need to resort to tricks like the one in etcd-mirror (where watch is limited to 60 seconds, then repeated), and still have to wait ~ 30s whenever we want to restart pybal.

Until I find a solution to this problem, which doesn't seem likely to be found, tbh, we might need to stick with our code, and just make it more solid.

Today @elukey took care of rebooting conf2003, and the pybals using it (ulsfo) did reconnect automatically.

I've observed the situation a bit more closely on lvs4004, whose TCP connections to conf2003 obviously got closed when the machine was rebooted. They were then re-attempted and stayed half-open (SYN_SENT) for roughly 2 minutes before getting established when the host came back online.

It might thus be a matter of timing: perhaps yesterday's reboot took longer?

So at first sight it looked like T169893 fixed this issue, but that's not the case. In particular, after conf1001 had been rebooted today I've noticed that both lvs1003 and lvs1006 had a bunch of established TCP connections to conf1001, while all other eqiad LVSs had none. However, I've tried changing the state of wdqs1002 from 'inactive' to 'yes' to see if lvs1003/1006 would pick up the change. lvs1006 did, while lvs1003 did not.

Among other error messages on lvs1003:

Jul 17 14:46:42 lvs1003 pybal[11221]: [config-etcd] ERROR: failed: [Failure instance: Traceback (failure with no frames): <class 'twisted.web.error.Error'>: 502 Bad Gateway
Jul 17 14:46:42 lvs1003 pybal[11221]: ]

There are multiple lines of successful connections such as:

Jul 17 14:46:42 lvs1003 pybal[11221]: [config-etcd] INFO: connected to etcd://conf1001.eqiad.wmnet/conftool/v1/pools/eqiad/scb/citoid/

But none for wdqs. On lvs1006 instead:

Jul 17 15:31:49 lvs1006 pybal[26073]: [config-etcd] INFO: connected to etcd://conf1001.eqiad.wmnet/conftool/v1/pools/eqiad/wdqs/wdqs/

Change 411264 had a related patch set uploaded (by Ema; owner: Ema):
[operations/debs/pybal@master] etcd: Introduce reconnectTimeout

https://gerrit.wikimedia.org/r/411264

Change 411264 merged by Ema:
[operations/debs/pybal@master] etcd: Introduce reconnectTimeout

https://gerrit.wikimedia.org/r/411264

Change 413141 had a related patch set uploaded (by Ema; owner: Ema):
[operations/debs/pybal@1.14] etcd: Introduce reconnectTimeout

https://gerrit.wikimedia.org/r/413141

Change 413141 merged by Ema:
[operations/debs/pybal@1.14] etcd: Introduce reconnectTimeout

https://gerrit.wikimedia.org/r/413141

Change 413145 had a related patch set uploaded (by Ema; owner: Ema):
[operations/debs/pybal@master] 1.14.4: Introduce etcd reconnectTimeout

https://gerrit.wikimedia.org/r/413145

Change 413145 merged by Ema:
[operations/debs/pybal@master] 1.14.4: Introduce etcd reconnectTimeout

https://gerrit.wikimedia.org/r/413145

Change 413146 had a related patch set uploaded (by Ema; owner: Ema):
[operations/debs/pybal@1.14] 1.14.4: Introduce etcd reconnectTimeout

https://gerrit.wikimedia.org/r/413146

Change 413146 merged by Ema:
[operations/debs/pybal@1.14] 1.14.4: Introduce etcd reconnectTimeout

https://gerrit.wikimedia.org/r/413146

Change 413154 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/debs/pybal@master] Improve reactor mocking

https://gerrit.wikimedia.org/r/413154

Change 413154 merged by Ema:
[operations/debs/pybal@master] Improve reactor mocking

https://gerrit.wikimedia.org/r/413154

Change 428303 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/debs/pybal@master] Reset waitIndex on etcd error 401

https://gerrit.wikimedia.org/r/428303

Change 428303 merged by Vgutierrez:
[operations/debs/pybal@master] Reset waitIndex on etcd error 401

https://gerrit.wikimedia.org/r/428303

Change 557024 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Vgutierrez):
[operations/debs/pybal@1.15-stretch] Reset waitIndex on etcd error 401

https://gerrit.wikimedia.org/r/557024

Change 557024 merged by Vgutierrez:
[operations/debs/pybal@1.15-stretch] Reset waitIndex on etcd error 401

https://gerrit.wikimedia.org/r/557024

Change 566248 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/debs/pybal@1.15-stretch] Release 1.15.7

https://gerrit.wikimedia.org/r/566248

Change 566248 merged by Vgutierrez:
[operations/debs/pybal@1.15-stretch] Release 1.15.7

https://gerrit.wikimedia.org/r/566248

Mentioned in SAL (#wikimedia-operations) [2020-01-21T11:23:12Z] <vgutierrez> uploaded pybal 1.15.7 to apt.w.o (stretch) - T169765

Mentioned in SAL (#wikimedia-operations) [2020-01-21T11:23:59Z] <vgutierrez> Updating pybal to 1.15.7 on ulsfo load balancers - T169765

Mentioned in SAL (#wikimedia-operations) [2020-01-21T11:56:12Z] <vgutierrez> upgrading pybal on eqsin and codfw - T169765

Mentioned in SAL (#wikimedia-operations) [2020-01-21T12:19:56Z] <vgutierrez> upgrading pybal on esams and eqiad - T169765

Re-discovered this issue after working on T267065

In the task two etcd/zookeeper nodes were scheduled to be moved: conf1005 and conf1006. We were able to move the former to another rack without any issue, but not the latter since we had some concerns about this hiera config: profile::pybal::config_host: conf1006.eqiad.wmnet

As far I remember, all the pybals using conf1006 as config_host should be fine if the node goes down, but I am wondering what happens if, for example:

  • a longer maintenance is needed for conf1006 due to unexpected issues.
  • pybals get restarted (regular maintenance, miscommunication, etc..)

Are we safe in this case? I know that we'll not be able to pool/depool, but what I want to make sure is that nothing behaves in a weird way.

BBlack subscribed.

The swap of Traffic for Traffic-Icebox in this ticket's set of tags was based on a bulk action for all such tickets that haven't been updated in 6 months or more. This does not imply any human judgement about the validity or importance of the task, and is simply the first step in a larger task cleanup effort. Further manual triage and/or requests for updates will happen this month for all such tickets. For more detail, have a look at the extended explanation on the main page of Traffic-Icebox . Thank you!

BCornwall subscribed.

Untagging traffic as we've agreed on not adding any new functionality in pybal. This will still be tracked in the collection of issues for the pybal project, however.

jbond added a project: Traffic.

@BCornwall I think this ticket should still be tagged traffic. if traffic don't intend t work on pybal any-more you/then they should decline the ticket instead of leaving it open and tagged with no team ( i have dont this so please update if you disagree)