Unhandled pybal error causing services to be depooled in etcd but not in lvs
Open, HighPublic

Description

After depooling cp3007 with confctl, we've noticed that the system was still serving traffic. Depooling seemed to have worked fine as far as etcd was concerned:

{"cp3007.esams.wmnet": {"pooled": "no", "weight": 1}, "tags": "dc=esams,cluster=cache_misc,service=varnish-fe"}

We found that ipvsadm was not updated on lvs3002. We haven't noticed anything particularly interesting in pybal logs on lvs3002.
However, the following exception was logged on lvs3004:

May 03 17:00:04 lvs3004 pybal[15612]: Unhandled error in Deferred:
May 03 17:00:04 lvs3004 pybal[15612]: Unhandled Error
May 03 17:00:04 lvs3004 pybal[15612]: Traceback (most recent call last):
May 03 17:00:04 lvs3004 pybal[15612]: File "/usr/lib/python2.7/dist-packages/twisted/internet/posixbase.py", line 260, in _disconnectSelectable
May 03 17:00:04 lvs3004 pybal[15612]: selectable.connectionLost(f)
May 03 17:00:04 lvs3004 pybal[15612]: File "/usr/lib/python2.7/dist-packages/twisted/internet/tcp.py", line 484, in connectionLost
May 03 17:00:04 lvs3004 pybal[15612]: self._commonConnection.connectionLost(self, reason)
May 03 17:00:04 lvs3004 pybal[15612]: File "/usr/lib/python2.7/dist-packages/twisted/internet/tcp.py", line 298, in connectionLost
May 03 17:00:04 lvs3004 pybal[15612]: protocol.connectionLost(reason)
May 03 17:00:04 lvs3004 pybal[15612]: File "/usr/lib/python2.7/dist-packages/twisted/protocols/tls.py", line 473, in connectionLost
May 03 17:00:04 lvs3004 pybal[15612]: self._flushReceiveBIO()
May 03 17:00:04 lvs3004 pybal[15612]: --- <exception caught here> ---
May 03 17:00:04 lvs3004 pybal[15612]: File "/usr/lib/python2.7/dist-packages/twisted/protocols/tls.py", line 360, in _flushReceiveBIO
May 03 17:00:04 lvs3004 pybal[15612]: bytes = self._tlsConnection.recv(2 ** 15)
May 03 17:00:04 lvs3004 pybal[15612]: File "/usr/lib/python2.7/dist-packages/OpenSSL/SSL.py", line 995, in recv
May 03 17:00:04 lvs3004 pybal[15612]: self._raise_ssl_error(self._ssl, result)
May 03 17:00:04 lvs3004 pybal[15612]: File "/usr/lib/python2.7/dist-packages/OpenSSL/SSL.py", line 871, in _raise_ssl_error
May 03 17:00:04 lvs3004 pybal[15612]: _raise_current_error()
May 03 17:00:04 lvs3004 pybal[15612]: File "/usr/lib/python2.7/dist-packages/OpenSSL/_util.py", line 22, in exception_from_error_queue
May 03 17:00:04 lvs3004 pybal[15612]: raise exceptionType(errors)
May 03 17:00:04 lvs3004 pybal[15612]: OpenSSL.SSL.Error: [('SSL routines', 'ssl23_read', 'ssl handshake failure')]
ema created this task.May 10 2016, 3:10 PM
Restricted Application added a project: Operations. · View Herald TranscriptMay 10 2016, 3:10 PM
Restricted Application added subscribers: Zppix, Aklapper. · View Herald Transcript
chasemp triaged this task as High priority.May 11 2016, 7:29 PM

Mentioned in SAL [2016-09-01T16:24:06Z] <ema> restarting pybal on lvs4002 T134893

Mentioned in SAL [2016-09-05T17:43:25Z] <ema> restarting pybal on lvs2002 T134893

BBlack moved this task from Triage to LoadBalancer on the Traffic board.Sep 30 2016, 1:57 PM
Volans added a subscriber: Volans.Jan 18 2017, 8:10 PM

I encountered a similar issue today, this is the log on when it started:

Jan 12 13:19:09 lvs2003 pybal[23011]: [pybal] INFO: [api_80] Initialization complete
Jan 12 13:28:49 lvs2003 pybal[23011]: [parsoid_8000 ProxyFetch] WARN: wtp2019.codfw.wmnet (enabled/up/pooled): Fetch failed, 5.001 s
Jan 12 13:28:49 lvs2003 pybal[23011]: [parsoid_8000] ERROR: Monitoring instance ProxyFetch reports server wtp2019.codfw.wmnet (enabled/up/pooled) down: Getting http://localhost/ took longer than 5 seconds.
Jan 12 13:28:59 lvs2003 pybal[23011]: [parsoid_8000] INFO: Server wtp2019.codfw.wmnet (enabled/partially up/not pooled) is up
Jan 12 13:29:12 lvs2003 pybal[23011]: [parsoid_8000 ProxyFetch] WARN: wtp2006.codfw.wmnet (enabled/up/pooled): Fetch failed, 5.001 s
Jan 12 13:29:12 lvs2003 pybal[23011]: [parsoid_8000] ERROR: Monitoring instance ProxyFetch reports server wtp2006.codfw.wmnet (enabled/up/pooled) down: Getting http://localhost/ took longer than 5 seconds.
Jan 12 13:29:22 lvs2003 pybal[23011]: [parsoid_8000] INFO: Server wtp2006.codfw.wmnet (enabled/partially up/not pooled) is up
Jan 12 13:32:19 lvs2003 pybal[23011]: [pybal] ERROR: failed: [Failure instance: Traceback: <type 'exceptions.ValueError'>: No JSON object could be decoded
Jan 12 13:32:19 lvs2003 pybal[23011]: /usr/lib/python2.7/dist-packages/twisted/protocols/tls.py:477:connectionLost
Jan 12 13:32:19 lvs2003 pybal[23011]: /usr/lib/python2.7/dist-packages/twisted/protocols/policies.py:125:connectionLost
Jan 12 13:32:19 lvs2003 pybal[23011]: /usr/lib/python2.7/dist-packages/twisted/web/http.py:474:connectionLost
Jan 12 13:32:19 lvs2003 pybal[23011]: /usr/lib/python2.7/dist-packages/twisted/web/http.py:485:handleResponseEnd
Jan 12 13:32:19 lvs2003 pybal[23011]: --- <exception caught here> ---
Jan 12 13:32:19 lvs2003 pybal[23011]: /usr/lib/python2.7/dist-packages/pybal/etcd.py:72:handleResponse
Jan 12 13:32:19 lvs2003 pybal[23011]: /usr/lib/python2.7/json/__init__.py:338:loads
Jan 12 13:32:19 lvs2003 pybal[23011]: /usr/lib/python2.7/json/decoder.py:366:decode
Jan 12 13:32:19 lvs2003 pybal[23011]: /usr/lib/python2.7/json/decoder.py:384:raw_decode
Jan 12 13:32:19 lvs2003 pybal[23011]: ]

From there the log is just full of 5 lines per second of:

Jan 12 13:32:19 lvs2003 pybal[23011]: [pybal] ERROR: failed: [Failure instance: Traceback (failure with no frames): <class 'twisted.web.error.Error'>: 400 Bad Request
Jan 12 13:32:19 lvs2003 pybal[23011]: ]

And due to missing alert on this we discovered it today by chance.

Mentioned in SAL (#wikimedia-operations) [2017-01-18T21:00:50Z] <volans> restarted pybal on lvs2004 (passive) T134893

Mentioned in SAL (#wikimedia-operations) [2017-01-18T21:18:53Z] <volans> restarted pybal on lvs2001 (active) T134893

Mentioned in SAL (#wikimedia-operations) [2017-01-24T15:14:23Z] <godog> bounce pybal on lvs1003 - T134893

ema added a comment.EditedJan 26 2017, 2:39 PM

[...]

Jan 12 13:32:19 lvs2003 pybal[23011]: [pybal] ERROR: failed: [Failure instance: Traceback: <type 'exceptions.ValueError'>: No JSON object could be decoded
Jan 12 13:32:19 lvs2003 pybal[23011]: /usr/lib/python2.7/dist-packages/twisted/protocols/tls.py:477:connectionLost
Jan 12 13:32:19 lvs2003 pybal[23011]: /usr/lib/python2.7/dist-packages/twisted/protocols/policies.py:125:connectionLost
Jan 12 13:32:19 lvs2003 pybal[23011]: /usr/lib/python2.7/dist-packages/twisted/web/http.py:474:connectionLost
Jan 12 13:32:19 lvs2003 pybal[23011]: /usr/lib/python2.7/dist-packages/twisted/web/http.py:485:handleResponseEnd
Jan 12 13:32:19 lvs2003 pybal[23011]: --- <exception caught here> ---
Jan 12 13:32:19 lvs2003 pybal[23011]: /usr/lib/python2.7/dist-packages/pybal/etcd.py:72:handleResponse
Jan 12 13:32:19 lvs2003 pybal[23011]: /usr/lib/python2.7/json/__init__.py:338:loads
Jan 12 13:32:19 lvs2003 pybal[23011]: /usr/lib/python2.7/json/decoder.py:366:decode
Jan 12 13:32:19 lvs2003 pybal[23011]: /usr/lib/python2.7/json/decoder.py:384:raw_decode
Jan 12 13:32:19 lvs2003 pybal[23011]: ]

The reason for the exception being printed could be a simple connectionLost. Our etcd client inherits from twisted.web.http.HTTPClient, which in case of lost connections calls handleResponseEnd, which when the buffer is not None calls handleResponse. However, we have redefined handleResponse to load json, and that cannot end well with empty strings.

Why that would have any correlation with pybal not depooling correctly is still unclear.

Change 334369 had a related patch set uploaded (by Ema):
etcd.py: log a warning on empty responses from etcd

https://gerrit.wikimedia.org/r/334369

Change 334369 merged by Ema:
etcd.py: log a warning on empty responses from etcd

https://gerrit.wikimedia.org/r/334369

Change 335844 had a related patch set uploaded (by Ema):
Log etcd connection status

https://gerrit.wikimedia.org/r/335844

Change 335844 merged by Ema:
Log etcd connection status

https://gerrit.wikimedia.org/r/335844

Something similar just happened on lvs1003, where thumbor1001 isn't being depooled, latest etcd-related entries in pybal.log

May 24 10:26:03 lvs1003 pybal[32511]: [config-etcd] INFO: client connection closed cleanly
May 24 10:26:03 lvs1003 pybal[32511]: [config-etcd] INFO: connected to etcd://conf2002.codfw.wmnet/conftool/v1/pools/eqiad/api_appserver/apache2/
May 24 10:27:36 lvs1003 pybal[32511]: [config-etcd] INFO: client connection closed cleanly
May 24 10:27:36 lvs1003 pybal[32511]: [config-etcd] INFO: connected to etcd://conf2002.codfw.wmnet/conftool/v1/pools/eqiad/api_appserver/apache2/
May 24 10:55:03 lvs1003 pybal[32511]: [config-etcd] INFO: client connection closed cleanly
May 24 10:55:04 lvs1003 pybal[32511]: [config-etcd] INFO: connected to etcd://conf2002.codfw.wmnet/conftool/v1/pools/eqiad/api_appserver/apache2/
May 24 10:56:36 lvs1003 pybal[32511]: [config-etcd] INFO: client connection closed cleanly
May 24 10:56:36 lvs1003 pybal[32511]: [config-etcd] INFO: connected to etcd://conf2002.codfw.wmnet/conftool/v1/pools/eqiad/api_appserver/apache2/
May 24 11:24:04 lvs1003 pybal[32511]: [config-etcd] INFO: client connection closed cleanly
May 24 11:24:04 lvs1003 pybal[32511]: [config-etcd] INFO: connected to etcd://conf2002.codfw.wmnet/conftool/v1/pools/eqiad/api_appserver/apache2/
May 24 11:25:37 lvs1003 pybal[32511]: [config-etcd] INFO: client connection closed cleanly
May 24 11:25:37 lvs1003 pybal[32511]: [config-etcd] INFO: connected to etcd://conf2002.codfw.wmnet/conftool/v1/pools/eqiad/api_appserver/apache2/
May 24 11:53:04 lvs1003 pybal[32511]: [config-etcd] INFO: client connection closed cleanly
May 24 11:53:04 lvs1003 pybal[32511]: [config-etcd] INFO: connected to etcd://conf2002.codfw.wmnet/conftool/v1/pools/eqiad/api_appserver/apache2/
May 24 11:54:37 lvs1003 pybal[32511]: [config-etcd] INFO: client connection closed cleanly
May 24 11:54:37 lvs1003 pybal[32511]: [config-etcd] INFO: connected to etcd://conf2002.codfw.wmnet/conftool/v1/pools/eqiad/api_appserver/apache2/

Mentioned in SAL (#wikimedia-operations) [2017-05-24T12:08:08Z] <godog> bounce pybal on lvs1003 - T134893