Page MenuHomePhabricator

pybal etcd coroutine crashed
Closed, ResolvedPublic

Description

P2549 has the log from lvs1004. Same happened on lvs1001 at the same time. Other LVSes (polling different pools) were not affected. Further confctl updates failed to hit pybal->ipvs after this. Restarted pybal on lvs1004 and it picked up the new data on start. Currently lvs1001 is still running with the dead co-routine, in case we need to investigate.

Related Objects

Event Timeline

BBlack assigned this task to Joe.
BBlack raised the priority of this task from to Unbreak Now!.
BBlack updated the task description. (Show Details)
BBlack added projects: PyBal, Traffic.
BBlack subscribed.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

So, I guess this has been some sort of weird race condition. Or that etcd responded with stale/spurious data in some way (the way recursive watch works might have caused that).

Just triggered another one directly and immediately.

The sequence in this case is:

  1. Merged https://gerrit.wikimedia.org/r/#/c/267230/ (removes 1/9 servers from dc=eqiad,cluster=cache_mobile,service=nginx)
  2. puppet-merge's conftool-merge run says:
Now running conftool-merge to sync any changes to conftool data
Running conftool-sync on /etc/conftool/data
eqiad: Removing node cp1060.eqiad.wmnet from cluster cache_mobile/nginx
Done.
  1. pybal on lvs1001 immediately does:
Feb 03 16:48:23 lvs1001 pybal[7181]: Unhandled Error
Feb 03 16:48:23 lvs1001 pybal[7181]: Traceback (most recent call last):
Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/pybal/pybal.py", line 638, in main
Feb 03 16:48:23 lvs1001 pybal[7181]: reactor.run()
Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/twisted/internet/base.py", line 1192, in run
Feb 03 16:48:23 lvs1001 pybal[7181]: self.mainLoop()
Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/twisted/internet/base.py", line 1204, in mainLoop
Feb 03 16:48:23 lvs1001 pybal[7181]: self.doIteration(t)
Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/twisted/internet/epollreactor.py", line 396, in doPoll
Feb 03 16:48:23 lvs1001 pybal[7181]: log.callWithLogger(selectable, _drdw, selectable, fd, event)
Feb 03 16:48:23 lvs1001 pybal[7181]: --- <exception caught here> ---
Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/twisted/python/log.py", line 88, in callWithLogger
Feb 03 16:48:23 lvs1001 pybal[7181]: return callWithContext({"system": lp}, func, *args, **kw)
Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/twisted/python/log.py", line 73, in callWithContext
Feb 03 16:48:23 lvs1001 pybal[7181]: return context.call({ILogContext: newCtx}, func, *args, **kw)
Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 118, in callWithContext
Feb 03 16:48:23 lvs1001 pybal[7181]: return self.currentContext().callWithContext(ctx, func, *args, **kw)
Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 81, in callWithContext
Feb 03 16:48:23 lvs1001 pybal[7181]: return func(*args,**kw)
Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/twisted/internet/posixbase.py", line 627, in _doReadOrWrite
Feb 03 16:48:23 lvs1001 pybal[7181]: self._disconnectSelectable(selectable, why, inRead)
Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/twisted/internet/posixbase.py", line 260, in _disconnectSelectable
Feb 03 16:48:23 lvs1001 pybal[7181]: selectable.connectionLost(f)
Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/twisted/internet/tcp.py", line 484, in connectionLost
Feb 03 16:48:23 lvs1001 pybal[7181]: self._commonConnection.connectionLost(self, reason)
Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/twisted/internet/tcp.py", line 298, in connectionLost
Feb 03 16:48:23 lvs1001 pybal[7181]: protocol.connectionLost(reason)
Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/twisted/protocols/tls.py", line 477, in connectionLost
Feb 03 16:48:23 lvs1001 pybal[7181]: ProtocolWrapper.connectionLost(self, reason)
Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/twisted/protocols/policies.py", line 125, in connectionLost
Feb 03 16:48:23 lvs1001 pybal[7181]: self.wrappedProtocol.connectionLost(reason)
Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/twisted/web/http.py", line 474, in connectionLost
Feb 03 16:48:23 lvs1001 pybal[7181]: self.handleResponseEnd()
Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/twisted/web/http.py", line 485, in handleResponseEnd
Feb 03 16:48:23 lvs1001 pybal[7181]: self.handleResponse(b)
Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/pybal/etcd.py", line 71, in handleResponse
Feb 03 16:48:23 lvs1001 pybal[7181]: self.factory.onUpdate(config, self.etcdIndex)
Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/pybal/etcd.py", line 144, in onUpdate
Feb 03 16:48:23 lvs1001 pybal[7181]: config.update(decode_etcd_data(update))
Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/pybal/etcd.py", line 42, in decode_etcd_data
Feb 03 16:48:23 lvs1001 pybal[7181]: key, value = decode_node(node)
Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/pybal/etcd.py", line 28, in decode_node
Feb 03 16:48:23 lvs1001 pybal[7181]: value = json.loads(node['value'])
Feb 03 16:48:23 lvs1001 pybal[7181]: exceptions.KeyError: 'value'
Feb 03 16:48:23 lvs1001 pybal[7181]: Unhandled Error
Feb 03 16:48:23 lvs1001 pybal[7181]: Traceback (most recent call last):
Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/pybal/pybal.py", line 638, in main
Feb 03 16:48:23 lvs1001 pybal[7181]: reactor.run()
Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/twisted/internet/base.py", line 1192, in run
Feb 03 16:48:23 lvs1001 pybal[7181]: self.mainLoop()
Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/twisted/internet/base.py", line 1204, in mainLoop
Feb 03 16:48:23 lvs1001 pybal[7181]: self.doIteration(t)
Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/twisted/internet/epollreactor.py", line 396, in doPoll
Feb 03 16:48:23 lvs1001 pybal[7181]: log.callWithLogger(selectable, _drdw, selectable, fd, event)
Feb 03 16:48:23 lvs1001 pybal[7181]: --- <exception caught here> ---
Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/twisted/python/log.py", line 88, in callWithLogger
Feb 03 16:48:23 lvs1001 pybal[7181]: return callWithContext({"system": lp}, func, *args, **kw)
Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/twisted/python/log.py", line 73, in callWithContext
Feb 03 16:48:23 lvs1001 pybal[7181]: return context.call({ILogContext: newCtx}, func, *args, **kw)
Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 118, in callWithContext
Feb 03 16:48:23 lvs1001 pybal[7181]: return self.currentContext().callWithContext(ctx, func, *args, **kw)
Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 81, in callWithContext
Feb 03 16:48:23 lvs1001 pybal[7181]: return func(*args,**kw)
Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/twisted/internet/posixbase.py", line 627, in _doReadOrWrite
Feb 03 16:48:23 lvs1001 pybal[7181]: self._disconnectSelectable(selectable, why, inRead)
Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/twisted/internet/posixbase.py", line 260, in _disconnectSelectable
Feb 03 16:48:23 lvs1001 pybal[7181]: selectable.connectionLost(f)
Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/twisted/internet/tcp.py", line 484, in connectionLost
Feb 03 16:48:23 lvs1001 pybal[7181]: self._commonConnection.connectionLost(self, reason)
Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/twisted/internet/tcp.py", line 298, in connectionLost
Feb 03 16:48:23 lvs1001 pybal[7181]: protocol.connectionLost(reason)
Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/twisted/protocols/tls.py", line 477, in connectionLost
Feb 03 16:48:23 lvs1001 pybal[7181]: ProtocolWrapper.connectionLost(self, reason)
Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/twisted/protocols/policies.py", line 125, in connectionLost
Feb 03 16:48:23 lvs1001 pybal[7181]: self.wrappedProtocol.connectionLost(reason)
Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/twisted/web/http.py", line 474, in connectionLost
Feb 03 16:48:23 lvs1001 pybal[7181]: self.handleResponseEnd()
Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/twisted/web/http.py", line 485, in handleResponseEnd
Feb 03 16:48:23 lvs1001 pybal[7181]: self.handleResponse(b)
Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/pybal/etcd.py", line 71, in handleResponse
Feb 03 16:48:23 lvs1001 pybal[7181]: self.factory.onUpdate(config, self.etcdIndex)
Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/pybal/etcd.py", line 144, in onUpdate
Feb 03 16:48:23 lvs1001 pybal[7181]: config.update(decode_etcd_data(update))
Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/pybal/etcd.py", line 42, in decode_etcd_data
Feb 03 16:48:23 lvs1001 pybal[7181]: key, value = decode_node(node)
Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/pybal/etcd.py", line 28, in decode_node
Feb 03 16:48:23 lvs1001 pybal[7181]: value = json.loads(node['value'])
Feb 03 16:48:23 lvs1001 pybal[7181]: exceptions.KeyError: 'value'

Ok I think I pinned down a case where this can happen:

whenever we delete a directory, what happens is that etcd sends down the wire

{"action":"delete","node":{"key":"/testdir","dir":true,"modifiedIndex":7,"createdIndex":4},"prevNode":{"key":"/testdir","dir":true,"modifiedIndex":4,"createdIndex":4}}

where no 'value' is present, hence the exception.

I think the pretty clever way we are handling the etcd data over the wire is a bit simplistic at times, and I will fix it today.

That happens more in general when we remove a node.

Change 268361 had a related patch set uploaded (by Giuseppe Lavagetto):
Add tests for etcd

https://gerrit.wikimedia.org/r/268361

I just wrote some unit tests and they confirm my suspicion:

we do not manage DELETE in etcd within pybal.

The tests can be seen in the (still failing) PS

Change 268382 had a related patch set uploaded (by Giuseppe Lavagetto):
Make etcd updates more fault-tolerant

https://gerrit.wikimedia.org/r/268382

I think the two PS I just submitted should solve the issue correctly, as they do in my preliminary testing.

What I am going to do now is:

  1. Reproduce the problem on pybal-test2001
  2. Check out the two patches and confirm those fix the problem

Problem reproduced and it's properly fixed when https://gerrit.wikimedia.org/r/268361 is applied.

Change 268361 merged by Giuseppe Lavagetto:
Allow the etcd driver to handle deleted or inactive nodes

https://gerrit.wikimedia.org/r/268361

Change 268382 merged by jenkins-bot:
Make etcd updates more fault-tolerant

https://gerrit.wikimedia.org/r/268382

Joe set Security to None.