Page MenuHomePhabricator

Unhandled pybal error causing services to be depooled in etcd but not in lvs
Closed, ResolvedPublic

Description

After depooling cp3007 with confctl, we've noticed that the system was still serving traffic. Depooling seemed to have worked fine as far as etcd was concerned:

{"cp3007.esams.wmnet": {"pooled": "no", "weight": 1}, "tags": "dc=esams,cluster=cache_misc,service=varnish-fe"}

We found that ipvsadm was not updated on lvs3002. We haven't noticed anything particularly interesting in pybal logs on lvs3002.
However, the following exception was logged on lvs3004:

May 03 17:00:04 lvs3004 pybal[15612]: Unhandled error in Deferred:
May 03 17:00:04 lvs3004 pybal[15612]: Unhandled Error
May 03 17:00:04 lvs3004 pybal[15612]: Traceback (most recent call last):
May 03 17:00:04 lvs3004 pybal[15612]: File "/usr/lib/python2.7/dist-packages/twisted/internet/posixbase.py", line 260, in _disconnectSelectable
May 03 17:00:04 lvs3004 pybal[15612]: selectable.connectionLost(f)
May 03 17:00:04 lvs3004 pybal[15612]: File "/usr/lib/python2.7/dist-packages/twisted/internet/tcp.py", line 484, in connectionLost
May 03 17:00:04 lvs3004 pybal[15612]: self._commonConnection.connectionLost(self, reason)
May 03 17:00:04 lvs3004 pybal[15612]: File "/usr/lib/python2.7/dist-packages/twisted/internet/tcp.py", line 298, in connectionLost
May 03 17:00:04 lvs3004 pybal[15612]: protocol.connectionLost(reason)
May 03 17:00:04 lvs3004 pybal[15612]: File "/usr/lib/python2.7/dist-packages/twisted/protocols/tls.py", line 473, in connectionLost
May 03 17:00:04 lvs3004 pybal[15612]: self._flushReceiveBIO()
May 03 17:00:04 lvs3004 pybal[15612]: --- <exception caught here> ---
May 03 17:00:04 lvs3004 pybal[15612]: File "/usr/lib/python2.7/dist-packages/twisted/protocols/tls.py", line 360, in _flushReceiveBIO
May 03 17:00:04 lvs3004 pybal[15612]: bytes = self._tlsConnection.recv(2 ** 15)
May 03 17:00:04 lvs3004 pybal[15612]: File "/usr/lib/python2.7/dist-packages/OpenSSL/SSL.py", line 995, in recv
May 03 17:00:04 lvs3004 pybal[15612]: self._raise_ssl_error(self._ssl, result)
May 03 17:00:04 lvs3004 pybal[15612]: File "/usr/lib/python2.7/dist-packages/OpenSSL/SSL.py", line 871, in _raise_ssl_error
May 03 17:00:04 lvs3004 pybal[15612]: _raise_current_error()
May 03 17:00:04 lvs3004 pybal[15612]: File "/usr/lib/python2.7/dist-packages/OpenSSL/_util.py", line 22, in exception_from_error_queue
May 03 17:00:04 lvs3004 pybal[15612]: raise exceptionType(errors)
May 03 17:00:04 lvs3004 pybal[15612]: OpenSSL.SSL.Error: [('SSL routines', 'ssl23_read', 'ssl handshake failure')]

Event Timeline

Restricted Application added subscribers: Zppix, Aklapper. · View Herald Transcript

Mentioned in SAL [2016-09-01T16:24:06Z] <ema> restarting pybal on lvs4002 T134893

Mentioned in SAL [2016-09-05T17:43:25Z] <ema> restarting pybal on lvs2002 T134893

I encountered a similar issue today, this is the log on when it started:

Jan 12 13:19:09 lvs2003 pybal[23011]: [pybal] INFO: [api_80] Initialization complete
Jan 12 13:28:49 lvs2003 pybal[23011]: [parsoid_8000 ProxyFetch] WARN: wtp2019.codfw.wmnet (enabled/up/pooled): Fetch failed, 5.001 s
Jan 12 13:28:49 lvs2003 pybal[23011]: [parsoid_8000] ERROR: Monitoring instance ProxyFetch reports server wtp2019.codfw.wmnet (enabled/up/pooled) down: Getting http://localhost/ took longer than 5 seconds.
Jan 12 13:28:59 lvs2003 pybal[23011]: [parsoid_8000] INFO: Server wtp2019.codfw.wmnet (enabled/partially up/not pooled) is up
Jan 12 13:29:12 lvs2003 pybal[23011]: [parsoid_8000 ProxyFetch] WARN: wtp2006.codfw.wmnet (enabled/up/pooled): Fetch failed, 5.001 s
Jan 12 13:29:12 lvs2003 pybal[23011]: [parsoid_8000] ERROR: Monitoring instance ProxyFetch reports server wtp2006.codfw.wmnet (enabled/up/pooled) down: Getting http://localhost/ took longer than 5 seconds.
Jan 12 13:29:22 lvs2003 pybal[23011]: [parsoid_8000] INFO: Server wtp2006.codfw.wmnet (enabled/partially up/not pooled) is up
Jan 12 13:32:19 lvs2003 pybal[23011]: [pybal] ERROR: failed: [Failure instance: Traceback: <type 'exceptions.ValueError'>: No JSON object could be decoded
Jan 12 13:32:19 lvs2003 pybal[23011]: /usr/lib/python2.7/dist-packages/twisted/protocols/tls.py:477:connectionLost
Jan 12 13:32:19 lvs2003 pybal[23011]: /usr/lib/python2.7/dist-packages/twisted/protocols/policies.py:125:connectionLost
Jan 12 13:32:19 lvs2003 pybal[23011]: /usr/lib/python2.7/dist-packages/twisted/web/http.py:474:connectionLost
Jan 12 13:32:19 lvs2003 pybal[23011]: /usr/lib/python2.7/dist-packages/twisted/web/http.py:485:handleResponseEnd
Jan 12 13:32:19 lvs2003 pybal[23011]: --- <exception caught here> ---
Jan 12 13:32:19 lvs2003 pybal[23011]: /usr/lib/python2.7/dist-packages/pybal/etcd.py:72:handleResponse
Jan 12 13:32:19 lvs2003 pybal[23011]: /usr/lib/python2.7/json/__init__.py:338:loads
Jan 12 13:32:19 lvs2003 pybal[23011]: /usr/lib/python2.7/json/decoder.py:366:decode
Jan 12 13:32:19 lvs2003 pybal[23011]: /usr/lib/python2.7/json/decoder.py:384:raw_decode
Jan 12 13:32:19 lvs2003 pybal[23011]: ]

From there the log is just full of 5 lines per second of:

Jan 12 13:32:19 lvs2003 pybal[23011]: [pybal] ERROR: failed: [Failure instance: Traceback (failure with no frames): <class 'twisted.web.error.Error'>: 400 Bad Request
Jan 12 13:32:19 lvs2003 pybal[23011]: ]

And due to missing alert on this we discovered it today by chance.

Mentioned in SAL (#wikimedia-operations) [2017-01-18T21:00:50Z] <volans> restarted pybal on lvs2004 (passive) T134893

Mentioned in SAL (#wikimedia-operations) [2017-01-18T21:18:53Z] <volans> restarted pybal on lvs2001 (active) T134893

[...]

Jan 12 13:32:19 lvs2003 pybal[23011]: [pybal] ERROR: failed: [Failure instance: Traceback: <type 'exceptions.ValueError'>: No JSON object could be decoded
Jan 12 13:32:19 lvs2003 pybal[23011]: /usr/lib/python2.7/dist-packages/twisted/protocols/tls.py:477:connectionLost
Jan 12 13:32:19 lvs2003 pybal[23011]: /usr/lib/python2.7/dist-packages/twisted/protocols/policies.py:125:connectionLost
Jan 12 13:32:19 lvs2003 pybal[23011]: /usr/lib/python2.7/dist-packages/twisted/web/http.py:474:connectionLost
Jan 12 13:32:19 lvs2003 pybal[23011]: /usr/lib/python2.7/dist-packages/twisted/web/http.py:485:handleResponseEnd
Jan 12 13:32:19 lvs2003 pybal[23011]: --- <exception caught here> ---
Jan 12 13:32:19 lvs2003 pybal[23011]: /usr/lib/python2.7/dist-packages/pybal/etcd.py:72:handleResponse
Jan 12 13:32:19 lvs2003 pybal[23011]: /usr/lib/python2.7/json/__init__.py:338:loads
Jan 12 13:32:19 lvs2003 pybal[23011]: /usr/lib/python2.7/json/decoder.py:366:decode
Jan 12 13:32:19 lvs2003 pybal[23011]: /usr/lib/python2.7/json/decoder.py:384:raw_decode
Jan 12 13:32:19 lvs2003 pybal[23011]: ]

The reason for the exception being printed could be a simple connectionLost. Our etcd client inherits from twisted.web.http.HTTPClient, which in case of lost connections calls handleResponseEnd, which when the buffer is not None calls handleResponse. However, we have redefined handleResponse to load json, and that cannot end well with empty strings.

Why that would have any correlation with pybal not depooling correctly is still unclear.

Change 334369 had a related patch set uploaded (by Ema):
etcd.py: log a warning on empty responses from etcd

https://gerrit.wikimedia.org/r/334369

Change 334369 merged by Ema:
etcd.py: log a warning on empty responses from etcd

https://gerrit.wikimedia.org/r/334369

Change 335844 had a related patch set uploaded (by Ema):
Log etcd connection status

https://gerrit.wikimedia.org/r/335844

Change 335844 merged by Ema:
Log etcd connection status

https://gerrit.wikimedia.org/r/335844

Something similar just happened on lvs1003, where thumbor1001 isn't being depooled, latest etcd-related entries in pybal.log

May 24 10:26:03 lvs1003 pybal[32511]: [config-etcd] INFO: client connection closed cleanly
May 24 10:26:03 lvs1003 pybal[32511]: [config-etcd] INFO: connected to etcd://conf2002.codfw.wmnet/conftool/v1/pools/eqiad/api_appserver/apache2/
May 24 10:27:36 lvs1003 pybal[32511]: [config-etcd] INFO: client connection closed cleanly
May 24 10:27:36 lvs1003 pybal[32511]: [config-etcd] INFO: connected to etcd://conf2002.codfw.wmnet/conftool/v1/pools/eqiad/api_appserver/apache2/
May 24 10:55:03 lvs1003 pybal[32511]: [config-etcd] INFO: client connection closed cleanly
May 24 10:55:04 lvs1003 pybal[32511]: [config-etcd] INFO: connected to etcd://conf2002.codfw.wmnet/conftool/v1/pools/eqiad/api_appserver/apache2/
May 24 10:56:36 lvs1003 pybal[32511]: [config-etcd] INFO: client connection closed cleanly
May 24 10:56:36 lvs1003 pybal[32511]: [config-etcd] INFO: connected to etcd://conf2002.codfw.wmnet/conftool/v1/pools/eqiad/api_appserver/apache2/
May 24 11:24:04 lvs1003 pybal[32511]: [config-etcd] INFO: client connection closed cleanly
May 24 11:24:04 lvs1003 pybal[32511]: [config-etcd] INFO: connected to etcd://conf2002.codfw.wmnet/conftool/v1/pools/eqiad/api_appserver/apache2/
May 24 11:25:37 lvs1003 pybal[32511]: [config-etcd] INFO: client connection closed cleanly
May 24 11:25:37 lvs1003 pybal[32511]: [config-etcd] INFO: connected to etcd://conf2002.codfw.wmnet/conftool/v1/pools/eqiad/api_appserver/apache2/
May 24 11:53:04 lvs1003 pybal[32511]: [config-etcd] INFO: client connection closed cleanly
May 24 11:53:04 lvs1003 pybal[32511]: [config-etcd] INFO: connected to etcd://conf2002.codfw.wmnet/conftool/v1/pools/eqiad/api_appserver/apache2/
May 24 11:54:37 lvs1003 pybal[32511]: [config-etcd] INFO: client connection closed cleanly
May 24 11:54:37 lvs1003 pybal[32511]: [config-etcd] INFO: connected to etcd://conf2002.codfw.wmnet/conftool/v1/pools/eqiad/api_appserver/apache2/

Change 367662 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] pybal::monitoring: add check_pybal_ipvs_diff

https://gerrit.wikimedia.org/r/367662

Change 367662 merged by Ema:
[operations/puppet@production] pybal::monitoring: add check_pybal_ipvs_diff

https://gerrit.wikimedia.org/r/367662

Change 368416 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] pybal::monitoring: add OK message to check_pybal_ipvs_diff

https://gerrit.wikimedia.org/r/368416

The added PyBal IPVS diff check is flapping a bit with UNKNOWN for some hosts (lvs100[3,6,9], lvs200[3,6]) with message:

HTTPConnectionPool(host='localhost', port=9090): Read timed out. (read timeout=1.0)
$ grep -c "PyBal IPVS diff check" icinga.log
34

When specifying the timeout in Requests you can use a tuple to put different values for connect and read timeouts. My guess is that sometimes on those hosts PyBal is not able to reply within the 1s timeout and we might need a larger one.

Change 368416 merged by Ema:
[operations/puppet@production] pybal::monitoring: bump check_pybal_ipvs_diff timeout

https://gerrit.wikimedia.org/r/368416

The added PyBal IPVS diff check is flapping a bit with UNKNOWN for some hosts (lvs100[3,6,9], lvs200[3,6]) with message:

HTTPConnectionPool(host='localhost', port=9090): Read timed out. (read timeout=1.0)
$ grep -c "PyBal IPVS diff check" icinga.log
34

When specifying the timeout in Requests you can use a tuple to put different values for connect and read timeouts. My guess is that sometimes on those hosts PyBal is not able to reply within the 1s timeout and we might need a larger one.

Yeah we'll almost certainly need a longer timeout. Even under ideal PyBal-level conditions, I wouldn't expect us to always get an answer in under a second. That it's happening often enough to really notice, and that most of the servers listed are low-traffic (which has tons of services), makes me think it's indicative of deeper issues, too. If PyBal can't answer these queries in under a second, we may be reaching some scaling limits in its design (the number of concurrent events happening in a single thread in twisted backlogging on cpu+waste?).

That it's happening often enough to really notice, and that most of the servers listed are low-traffic (which has tons of services), makes me think it's indicative of deeper issues, too. If PyBal can't answer these queries in under a second, we may be reaching some scaling limits in its design (the number of concurrent events happening in a single thread in twisted backlogging on cpu+waste?).

I've bumped the python-requests timeout to 5s earlier on today and that does seem to help a bit but does not solve the problem completely. Only low-traffic LVSs are affected since the timeout increase, which supports the theory of scalability problems in PyBal.

The timeout is set to 10s on lvs2006 since 2017-07-31 13:09:02 and there's been only one occurrence of the problem since then, in the form of a nrpe timeout this time (nrpe::monitor_service's $timeout defaults to 10s):

[Mon Jul 31 13:59:31 2017] SERVICE ALERT: lvs2006;PyBal IPVS diff check;CRITICAL;SOFT;1;CHECK_NRPE: Socket timeout after 10 seconds.

Change 369006 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] pybal::monitoring: check_pybal_ipvs_diff interval/timeouts

https://gerrit.wikimedia.org/r/369006

Change 369006 merged by Ema:
[operations/puppet@production] pybal::monitoring: check_pybal_ipvs_diff interval/timeouts

https://gerrit.wikimedia.org/r/369006

@ema @Vgutierrez AIUI this bug is resolved since we've fixed the EtcdConfigObserver class. Resolving this, please feel free to re-open it in case.