pybal should automatically reconnect to etcd
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	• ema
	Jul 5 2017, 4:13 PM

Description

Today we've rebooted conf2001.codfw.wmnet into a new kernel. After the system came back online, we've noticed that all codfw LVSs, which use conf2001 as their etcd backend, had no established TCP connections with it. Hosts were thus silently not being pooled/depooled upon admin request.

We need to:

make sure pybal attempts reconnecting to etcd in these situations
implement an icinga check to alert us whenever a running pybal has 0 established TCP connections to its etcd(s)

Details

Subject	Repo	Branch	Lines +/-
Release 1.15.7	operations/debs/pybal	1.15-stretch	+6 -0
Reset waitIndex on etcd error 401	operations/debs/pybal	1.15-stretch	+29 -3
Reset waitIndex on etcd error 401	operations/debs/pybal	master	+29 -3
Improve reactor mocking	operations/debs/pybal	master	+8 -8
1.14.4: Introduce etcd reconnectTimeout	operations/debs/pybal	1.14	+7 -0
1.14.4: Introduce etcd reconnectTimeout	operations/debs/pybal	master	+7 -0
etcd: Introduce reconnectTimeout	operations/debs/pybal	1.14	+61 -6
etcd: Introduce reconnectTimeout	operations/debs/pybal	master	+61 -6

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Declined	None	T169765 pybal should automatically reconnect to etcd
Resolved	• ema	T169893 pybal should reset the etcdindex it's looking at after losing a connection
Resolved	• ema	T170847 Icinga check for pybal HTTP connections to etcd
Declined	None	T240665 pybal fails to reconnect cleanly to etcd when etcd is restarted

Event Timeline

• ema created this task.Jul 5 2017, 4:13 PM

Restricted Application added a project: SRE. · View Herald TranscriptJul 5 2017, 4:13 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

• ema triaged this task as High priority.Jul 5 2017, 4:13 PM

• ema updated the task description. (Show Details)

• ema moved this task from Backlog to LoadBalancer on the Traffic board.Jul 5 2017, 4:16 PM

As @Joe suggested some days ago, we might want to rewrite pybal's etcd code using https://treq.readthedocs.io/.

Another thing that would be nice is the possibility to specify more than one conf host in profile::pybal::config_host: conf2001.codfw.wmnet, and allow pybal to connect to more hosts in case of connection failures. For example, say conf2001 is down for long maintenance (disk broken, etc..), it would be nice not to make any puppet changes to flip pybal's config.

In T169765#3410598, @elukey wrote:

Another thing that would be nice is the possibility to specify more than one conf host in profile::pybal::config_host: conf2001.codfw.wmnet, and allow pybal to connect to more hosts in case of connection failures. For example, say conf2001 is down for long maintenance (disk broken, etc..), it would be nice not to make any puppet changes to flip pybal's config.

Doesn't even need to be a long maintenance, yesterday's reboot of conf2001 also caused pybal errors.

In T169765#3410598, @elukey wrote:

Another thing that would be nice is the possibility to specify more than one conf host in profile::pybal::config_host: conf2001.codfw.wmnet, and allow pybal to connect to more hosts in case of connection failures.

That should be the "add support for SRV records" feature IIRC 😉

One option to support reconnections and srv records and everything is to use the (blocking) python-etcd library via defer.deferToThread as etcd-mirror does.

The issue is that if a deferred is ongoing in a thread, there is no way to cancel it until it has finished, so we would need to resort to tricks like the one in etcd-mirror (where watch is limited to 60 seconds, then repeated), and still have to wait ~ 30s whenever we want to restart pybal.

Until I find a solution to this problem, which doesn't seem likely to be found, tbh, we might need to stick with our code, and just make it more solid.

Today @elukey took care of rebooting conf2003, and the pybals using it (ulsfo) did reconnect automatically.

I've observed the situation a bit more closely on lvs4004, whose TCP connections to conf2003 obviously got closed when the machine was rebooted. They were then re-attempted and stayed half-open (SYN_SENT) for roughly 2 minutes before getting established when the host came back online.

It might thus be a matter of timing: perhaps yesterday's reboot took longer?

Mentioned in SAL (#wikimedia-operations) [2017-07-06T14:06:24Z] <ema> restart pybal on lvs4004 T169765

Mentioned in SAL (#wikimedia-operations) [2017-07-06T14:08:13Z] <ema> restart pybal on lvs4002 T169765

Mentioned in SAL (#wikimedia-operations) [2017-07-06T14:20:29Z] <ema> restart pybal on lvs4003 T169765

Mentioned in SAL (#wikimedia-operations) [2017-07-06T14:29:29Z] <ema> restart pybal on lvs4001 T169765

Joe created subtask T169893: pybal should reset the etcdindex it's looking at after losing a connection.Jul 6 2017, 2:35 PM

Mentioned in SAL (#wikimedia-operations) [2017-07-17T14:57:04Z] <ema> restart pybal on lvs100[45] T169765

Mentioned in SAL (#wikimedia-operations) [2017-07-17T14:58:57Z] <ema> restart pybal on lvs100[12] T169765

So at first sight it looked like T169893 fixed this issue, but that's not the case. In particular, after conf1001 had been rebooted today I've noticed that both lvs1003 and lvs1006 had a bunch of established TCP connections to conf1001, while all other eqiad LVSs had none. However, I've tried changing the state of wdqs1002 from 'inactive' to 'yes' to see if lvs1003/1006 would pick up the change. lvs1006 did, while lvs1003 did not.

Among other error messages on lvs1003:

Jul 17 14:46:42 lvs1003 pybal[11221]: [config-etcd] ERROR: failed: [Failure instance: Traceback (failure with no frames): <class 'twisted.web.error.Error'>: 502 Bad Gateway
Jul 17 14:46:42 lvs1003 pybal[11221]: ]

There are multiple lines of successful connections such as:

Jul 17 14:46:42 lvs1003 pybal[11221]: [config-etcd] INFO: connected to etcd://conf1001.eqiad.wmnet/conftool/v1/pools/eqiad/scb/citoid/

But none for wdqs. On lvs1006 instead:

Jul 17 15:31:49 lvs1006 pybal[26073]: [config-etcd] INFO: connected to etcd://conf1001.eqiad.wmnet/conftool/v1/pools/eqiad/wdqs/wdqs/

Mentioned in SAL (#wikimedia-operations) [2017-07-17T15:35:53Z] <ema> restart pybal on lvs100[36] T169765

• ema mentioned this in T170847: Icinga check for pybal HTTP connections to etcd.Jul 17 2017, 4:38 PM

• ema created subtask T170847: Icinga check for pybal HTTP connections to etcd.

• ema mentioned this in T169893: pybal should reset the etcdindex it's looking at after losing a connection.Jul 18 2017, 11:10 AM

• ema closed subtask T169893: pybal should reset the etcdindex it's looking at after losing a connection as Resolved.

• ema moved this task from Backlog to In Progress on the PyBal board.Feb 16 2018, 2:47 PM

Change 411264 had a related patch set uploaded (by Ema; owner: Ema):
[operations/debs/pybal@master] etcd: Introduce reconnectTimeout

https://gerrit.wikimedia.org/r/411264

gerritbot added a project: Patch-For-Review.Feb 16 2018, 3:29 PM

Change 411264 merged by Ema:
[operations/debs/pybal@master] etcd: Introduce reconnectTimeout

https://gerrit.wikimedia.org/r/411264

Change 413141 had a related patch set uploaded (by Ema; owner: Ema):
[operations/debs/pybal@1.14] etcd: Introduce reconnectTimeout

https://gerrit.wikimedia.org/r/413141

Change 413141 merged by Ema:
[operations/debs/pybal@1.14] etcd: Introduce reconnectTimeout

https://gerrit.wikimedia.org/r/413141

Change 413145 had a related patch set uploaded (by Ema; owner: Ema):
[operations/debs/pybal@master] 1.14.4: Introduce etcd reconnectTimeout

https://gerrit.wikimedia.org/r/413145

Change 413145 merged by Ema:
[operations/debs/pybal@master] 1.14.4: Introduce etcd reconnectTimeout

https://gerrit.wikimedia.org/r/413145

Change 413146 had a related patch set uploaded (by Ema; owner: Ema):
[operations/debs/pybal@1.14] 1.14.4: Introduce etcd reconnectTimeout

https://gerrit.wikimedia.org/r/413146

Change 413146 merged by Ema:
[operations/debs/pybal@1.14] 1.14.4: Introduce etcd reconnectTimeout

https://gerrit.wikimedia.org/r/413146

Change 413154 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/debs/pybal@master] Improve reactor mocking

https://gerrit.wikimedia.org/r/413154

Change 413154 merged by Ema:
[operations/debs/pybal@master] Improve reactor mocking

https://gerrit.wikimedia.org/r/413154

• ema closed subtask T170847: Icinga check for pybal HTTP connections to etcd as Resolved.Feb 22 2018, 8:59 AM

Change 428303 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/debs/pybal@master] Reset waitIndex on etcd error 401

https://gerrit.wikimedia.org/r/428303

Change 428303 merged by Vgutierrez:
[operations/debs/pybal@master] Reset waitIndex on etcd error 401

https://gerrit.wikimedia.org/r/428303

ArielGlenn removed a project: Patch-For-Review.Nov 13 2018, 11:15 AM

• Phabricator_maintenance moved this task from Backlog to Acknowledged on the SRE board.Jan 26 2019, 9:21 PM

jbond subscribed.Jun 24 2019, 3:46 PM

• ema mentioned this in T240665: pybal fails to reconnect cleanly to etcd when etcd is restarted.Dec 13 2019, 10:10 AM

Change 557024 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Vgutierrez):
[operations/debs/pybal@1.15-stretch] Reset waitIndex on etcd error 401

https://gerrit.wikimedia.org/r/557024

gerritbot added a project: Patch-For-Review.Dec 13 2019, 2:26 PM

jcrespo added a subtask: T240665: pybal fails to reconnect cleanly to etcd when etcd is restarted.Dec 13 2019, 2:31 PM

Change 557024 merged by Vgutierrez:
[operations/debs/pybal@1.15-stretch] Reset waitIndex on etcd error 401

https://gerrit.wikimedia.org/r/557024

Change 566248 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/debs/pybal@1.15-stretch] Release 1.15.7

https://gerrit.wikimedia.org/r/566248

Change 566248 merged by Vgutierrez:
[operations/debs/pybal@1.15-stretch] Release 1.15.7

https://gerrit.wikimedia.org/r/566248

Mentioned in SAL (#wikimedia-operations) [2020-01-21T11:23:12Z] <vgutierrez> uploaded pybal 1.15.7 to apt.w.o (stretch) - T169765

Mentioned in SAL (#wikimedia-operations) [2020-01-21T11:23:59Z] <vgutierrez> Updating pybal to 1.15.7 on ulsfo load balancers - T169765

Mentioned in SAL (#wikimedia-operations) [2020-01-21T11:56:12Z] <vgutierrez> upgrading pybal on eqsin and codfw - T169765

Maintenance_bot removed a project: Patch-For-Review.Jan 21 2020, 12:11 PM

Mentioned in SAL (#wikimedia-operations) [2020-01-21T12:19:56Z] <vgutierrez> upgrading pybal on esams and eqiad - T169765

Re-discovered this issue after working on T267065

In the task two etcd/zookeeper nodes were scheduled to be moved: conf1005 and conf1006. We were able to move the former to another rack without any issue, but not the latter since we had some concerns about this hiera config: profile::pybal::config_host: conf1006.eqiad.wmnet

As far I remember, all the pybals using conf1006 as config_host should be fine if the node goes down, but I am wondering what happens if, for example:

a longer maintenance is needed for conf1006 due to unexpected issues.
pybals get restarted (regular maintenance, miscommunication, etc..)

Are we safe in this case? I know that we'll not be able to pool/depool, but what I want to make sure is that nothing behaves in a weird way.

The swap of Traffic for Traffic-Icebox in this ticket's set of tags was based on a bulk action for all such tickets that haven't been updated in 6 months or more. This does not imply any human judgement about the validity or importance of the task, and is simply the first step in a larger task cleanup effort. Further manual triage and/or requests for updates will happen this month for all such tickets. For more detail, have a look at the extended explanation on the main page of Traffic-Icebox . Thank you!

BCornwall closed subtask T240665: pybal fails to reconnect cleanly to etcd when etcd is restarted as Declined.May 2 2023, 7:59 PM

akosiaris subscribed.Sep 7 2023, 12:28 PM

Untagging traffic as we've agreed on not adding any new functionality in pybal. This will still be tracked in the collection of issues for the pybal project, however.

@BCornwall I think this ticket should still be tagged traffic. if traffic don't intend t work on pybal any-more you/then they should decline the ticket instead of leaving it open and tagged with no team ( i have dont this so please update if you disagree)

pybal should automatically reconnect to etcdClosed, DeclinedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

pybal should automatically reconnect to etcd
Closed, DeclinedPublic
Actions

Related Objects
Search...