P2549 has the log from lvs1004. Same happened on lvs1001 at the same time. Other LVSes (polling different pools) were not affected. Further confctl updates failed to hit pybal->ipvs after this. Restarted pybal on lvs1004 and it picked up the new data on start. Currently lvs1001 is still running with the dead co-routine, in case we need to investigate.
Description
Details
Related Objects
- Mentioned Here
- P2549 pybal error, etcd related?
Event Timeline
So, I guess this has been some sort of weird race condition. Or that etcd responded with stale/spurious data in some way (the way recursive watch works might have caused that).
Just triggered another one directly and immediately.
The sequence in this case is:
- Merged https://gerrit.wikimedia.org/r/#/c/267230/ (removes 1/9 servers from dc=eqiad,cluster=cache_mobile,service=nginx)
- puppet-merge's conftool-merge run says:
Now running conftool-merge to sync any changes to conftool data Running conftool-sync on /etc/conftool/data eqiad: Removing node cp1060.eqiad.wmnet from cluster cache_mobile/nginx Done.
- pybal on lvs1001 immediately does:
Feb 03 16:48:23 lvs1001 pybal[7181]: Unhandled Error Feb 03 16:48:23 lvs1001 pybal[7181]: Traceback (most recent call last): Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/pybal/pybal.py", line 638, in main Feb 03 16:48:23 lvs1001 pybal[7181]: reactor.run() Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/twisted/internet/base.py", line 1192, in run Feb 03 16:48:23 lvs1001 pybal[7181]: self.mainLoop() Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/twisted/internet/base.py", line 1204, in mainLoop Feb 03 16:48:23 lvs1001 pybal[7181]: self.doIteration(t) Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/twisted/internet/epollreactor.py", line 396, in doPoll Feb 03 16:48:23 lvs1001 pybal[7181]: log.callWithLogger(selectable, _drdw, selectable, fd, event) Feb 03 16:48:23 lvs1001 pybal[7181]: --- <exception caught here> --- Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/twisted/python/log.py", line 88, in callWithLogger Feb 03 16:48:23 lvs1001 pybal[7181]: return callWithContext({"system": lp}, func, *args, **kw) Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/twisted/python/log.py", line 73, in callWithContext Feb 03 16:48:23 lvs1001 pybal[7181]: return context.call({ILogContext: newCtx}, func, *args, **kw) Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 118, in callWithContext Feb 03 16:48:23 lvs1001 pybal[7181]: return self.currentContext().callWithContext(ctx, func, *args, **kw) Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 81, in callWithContext Feb 03 16:48:23 lvs1001 pybal[7181]: return func(*args,**kw) Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/twisted/internet/posixbase.py", line 627, in _doReadOrWrite Feb 03 16:48:23 lvs1001 pybal[7181]: self._disconnectSelectable(selectable, why, inRead) Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/twisted/internet/posixbase.py", line 260, in _disconnectSelectable Feb 03 16:48:23 lvs1001 pybal[7181]: selectable.connectionLost(f) Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/twisted/internet/tcp.py", line 484, in connectionLost Feb 03 16:48:23 lvs1001 pybal[7181]: self._commonConnection.connectionLost(self, reason) Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/twisted/internet/tcp.py", line 298, in connectionLost Feb 03 16:48:23 lvs1001 pybal[7181]: protocol.connectionLost(reason) Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/twisted/protocols/tls.py", line 477, in connectionLost Feb 03 16:48:23 lvs1001 pybal[7181]: ProtocolWrapper.connectionLost(self, reason) Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/twisted/protocols/policies.py", line 125, in connectionLost Feb 03 16:48:23 lvs1001 pybal[7181]: self.wrappedProtocol.connectionLost(reason) Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/twisted/web/http.py", line 474, in connectionLost Feb 03 16:48:23 lvs1001 pybal[7181]: self.handleResponseEnd() Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/twisted/web/http.py", line 485, in handleResponseEnd Feb 03 16:48:23 lvs1001 pybal[7181]: self.handleResponse(b) Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/pybal/etcd.py", line 71, in handleResponse Feb 03 16:48:23 lvs1001 pybal[7181]: self.factory.onUpdate(config, self.etcdIndex) Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/pybal/etcd.py", line 144, in onUpdate Feb 03 16:48:23 lvs1001 pybal[7181]: config.update(decode_etcd_data(update)) Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/pybal/etcd.py", line 42, in decode_etcd_data Feb 03 16:48:23 lvs1001 pybal[7181]: key, value = decode_node(node) Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/pybal/etcd.py", line 28, in decode_node Feb 03 16:48:23 lvs1001 pybal[7181]: value = json.loads(node['value']) Feb 03 16:48:23 lvs1001 pybal[7181]: exceptions.KeyError: 'value' Feb 03 16:48:23 lvs1001 pybal[7181]: Unhandled Error Feb 03 16:48:23 lvs1001 pybal[7181]: Traceback (most recent call last): Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/pybal/pybal.py", line 638, in main Feb 03 16:48:23 lvs1001 pybal[7181]: reactor.run() Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/twisted/internet/base.py", line 1192, in run Feb 03 16:48:23 lvs1001 pybal[7181]: self.mainLoop() Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/twisted/internet/base.py", line 1204, in mainLoop Feb 03 16:48:23 lvs1001 pybal[7181]: self.doIteration(t) Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/twisted/internet/epollreactor.py", line 396, in doPoll Feb 03 16:48:23 lvs1001 pybal[7181]: log.callWithLogger(selectable, _drdw, selectable, fd, event) Feb 03 16:48:23 lvs1001 pybal[7181]: --- <exception caught here> --- Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/twisted/python/log.py", line 88, in callWithLogger Feb 03 16:48:23 lvs1001 pybal[7181]: return callWithContext({"system": lp}, func, *args, **kw) Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/twisted/python/log.py", line 73, in callWithContext Feb 03 16:48:23 lvs1001 pybal[7181]: return context.call({ILogContext: newCtx}, func, *args, **kw) Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 118, in callWithContext Feb 03 16:48:23 lvs1001 pybal[7181]: return self.currentContext().callWithContext(ctx, func, *args, **kw) Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 81, in callWithContext Feb 03 16:48:23 lvs1001 pybal[7181]: return func(*args,**kw) Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/twisted/internet/posixbase.py", line 627, in _doReadOrWrite Feb 03 16:48:23 lvs1001 pybal[7181]: self._disconnectSelectable(selectable, why, inRead) Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/twisted/internet/posixbase.py", line 260, in _disconnectSelectable Feb 03 16:48:23 lvs1001 pybal[7181]: selectable.connectionLost(f) Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/twisted/internet/tcp.py", line 484, in connectionLost Feb 03 16:48:23 lvs1001 pybal[7181]: self._commonConnection.connectionLost(self, reason) Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/twisted/internet/tcp.py", line 298, in connectionLost Feb 03 16:48:23 lvs1001 pybal[7181]: protocol.connectionLost(reason) Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/twisted/protocols/tls.py", line 477, in connectionLost Feb 03 16:48:23 lvs1001 pybal[7181]: ProtocolWrapper.connectionLost(self, reason) Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/twisted/protocols/policies.py", line 125, in connectionLost Feb 03 16:48:23 lvs1001 pybal[7181]: self.wrappedProtocol.connectionLost(reason) Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/twisted/web/http.py", line 474, in connectionLost Feb 03 16:48:23 lvs1001 pybal[7181]: self.handleResponseEnd() Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/twisted/web/http.py", line 485, in handleResponseEnd Feb 03 16:48:23 lvs1001 pybal[7181]: self.handleResponse(b) Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/pybal/etcd.py", line 71, in handleResponse Feb 03 16:48:23 lvs1001 pybal[7181]: self.factory.onUpdate(config, self.etcdIndex) Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/pybal/etcd.py", line 144, in onUpdate Feb 03 16:48:23 lvs1001 pybal[7181]: config.update(decode_etcd_data(update)) Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/pybal/etcd.py", line 42, in decode_etcd_data Feb 03 16:48:23 lvs1001 pybal[7181]: key, value = decode_node(node) Feb 03 16:48:23 lvs1001 pybal[7181]: File "/usr/lib/python2.7/dist-packages/pybal/etcd.py", line 28, in decode_node Feb 03 16:48:23 lvs1001 pybal[7181]: value = json.loads(node['value']) Feb 03 16:48:23 lvs1001 pybal[7181]: exceptions.KeyError: 'value'
Ok I think I pinned down a case where this can happen:
whenever we delete a directory, what happens is that etcd sends down the wire
{"action":"delete","node":{"key":"/testdir","dir":true,"modifiedIndex":7,"createdIndex":4},"prevNode":{"key":"/testdir","dir":true,"modifiedIndex":4,"createdIndex":4}}
where no 'value' is present, hence the exception.
I think the pretty clever way we are handling the etcd data over the wire is a bit simplistic at times, and I will fix it today.
Change 268361 had a related patch set uploaded (by Giuseppe Lavagetto):
Add tests for etcd
I just wrote some unit tests and they confirm my suspicion:
we do not manage DELETE in etcd within pybal.
The tests can be seen in the (still failing) PS
Change 268382 had a related patch set uploaded (by Giuseppe Lavagetto):
Make etcd updates more fault-tolerant
I think the two PS I just submitted should solve the issue correctly, as they do in my preliminary testing.
What I am going to do now is:
- Reproduce the problem on pybal-test2001
- Check out the two patches and confirm those fix the problem
Problem reproduced and it's properly fixed when https://gerrit.wikimedia.org/r/268361 is applied.
Change 268361 merged by Giuseppe Lavagetto:
Allow the etcd driver to handle deleted or inactive nodes