Since we've recently fixed the Event-Platform logs we're starting to see some timeouts on event delivery from the EventBus extension to the proxy service.
See https://logstash.wikimedia.org/goto/749413b11913eaba9e7a37e106c22fb5
Since we've recently fixed the Event-Platform logs we're starting to see some timeouts on event delivery from the EventBus extension to the proxy service.
See https://logstash.wikimedia.org/goto/749413b11913eaba9e7a37e106c22fb5
We have 16 CPU cores with HyperThreading enabled on each node, but only 8 EventBus proxy service workers, so one immediate thing that we need to do is to increase the number of workers.
Change 389975 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[operations/puppet@production] [EventBus] Increase number of worker processes to 16
+1 ^
There's also this old unmerged patch:
https://gerrit.wikimedia.org/r/#/c/302372/
However, when I tested at the time, it didn't show any performance improvements.
Change 389975 merged by Ottomata:
[operations/puppet@production] [EventBus] Increase number of worker processes to 16
Mentioned in SAL (#wikimedia-operations) [2017-11-08T14:55:09Z] <otto@tin> Started restart [eventlogging/eventbus@41e3418]: Bumping worker processes to 16: T180017
Mentioned in SAL (#wikimedia-operations) [2017-11-08T15:01:43Z] <otto@tin> Started restart [eventlogging/eventbus@41e3418]: Bumping worker processes to 16 on all targets: T180017
Ah! There might not be immediate, apparent perf improvements, but properly handling async events definitely helps when events are received at a rate that's higher than usual. IMHO, work on it should be continued.
Judging by the logs (on mw-log1001, not currently available in logstash) the timeouts did not disappear. Maybe we could consider increasing request timeout?
Seems like we don't specify it in mediawiki-config, so we're using the default of 5 seconds.
Change 391532 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/extensions/EventBus@master] Increase request timeout to match kafka produce timeout
Change 391532 merged by jenkins-bot:
[mediawiki/extensions/EventBus@master] Increase request timeout to match kafka produce timeout
Change 391535 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/extensions/EventBus@wmf/1.31.0-wmf.8] Increase request timeout to match kafka produce timeout
Change 391535 merged by jenkins-bot:
[mediawiki/extensions/EventBus@wmf/1.31.0-wmf.8] Increase request timeout to match kafka produce timeout
Mentioned in SAL (#wikimedia-operations) [2017-11-15T14:12:11Z] <zfilipin@tin> Synchronized php-1.31.0-wmf.8/extensions/EventBus/EventBus.php: SWAT: [[gerrit:391535|Increase request timeout to match kafka produce timeout (T180017)]] (duration: 00m 50s)
I've increased request timeout in Event-Platform extension to 10 seconds to match sync timeout in Kafka, but it did not fix the timeout errors.
The most weird part is that the events service never logs anything for these timeouts. I'm wondering if we could add some logging to the service when
What do you think @Ottomata ?
https://groups.google.com/forum/#!msg/python-tornado/qqF5JQdP6XU/ZB8bI4AIU5sJ seems relevant to the problem.
https://groups.google.com/forum/#!msg/python-tornado/qqF5JQdP6XU/ZB8bI4AIU5sJ seems relevant to the problem.
That example seems to be more about the a Tornado HTTP client, rather than server, but it might still be generally relevant about Tornado coroutines.
I'm wondering if we could add some logging to the service
I'll see what I can do...
Event production to Kafka times out
I'm pretty sure this will be logged, although it is pretty difficult to test this.
The request is not accepted for > 1 seconds waiting for available slots on tornado (if the last even makes any sense)
I'm just trying to add logging where the client closes the connection before the server, and not having much log. on_connection_close Should do it, but I think this might only happen if there are pending async coroutines waiting to finish.
I'm testing more this change, and am wondering if we should just try it. We could deploy to a single prod server and let it run for a bit to make sure it doesn't break anything. We'd also set async=True in the kafka producer options.
Change 391634 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Try async kafka producer to fix http timeouts
Change 391634 merged by Ottomata:
[operations/puppet@production] Try async kafka producer to fix http timeouts
Ok, I've merged https://gerrit.wikimedia.org/r/#/c/302372/ and deployed it to kafka1001. I've also merged https://gerrit.wikimedia.org/r/#/c/391634/, which enables async kafka produce requests on kafka1001 only.
302372 also adds a on_connection_close handler to hopefully log when the HTTP client disconnects itself, but I wasn't able to get that handler to fire in my testing.
Ergh, I was hoping we could use this to see if the timeouts would disappear to requests to eventbus on kafka1001...buuut since the MW log only knows about the LVS service, unless my on_connection_close fires we won't be able to tell if kafka1001 no longer sees timeouts.
I guess we can let this run for a few days to make sure nothing bad happens, and then deploy and enable async=True for the other eventbus instances.
Mentioned in SAL (#wikimedia-operations) [2017-11-16T07:17:34Z] <ppchelko@tin> Started deploy [eventlogging/eventbus@872cfb3]: Revert gerrit 302372 due to AssertionError T180017
Mentioned in SAL (#wikimedia-operations) [2017-11-16T07:17:49Z] <ppchelko@tin> Finished deploy [eventlogging/eventbus@872cfb3]: Revert gerrit 302372 due to AssertionError T180017 (duration: 00m 14s)
@Ottomata I've reverted your change on kafka1001 as there's some AttributeError: 'NoneType' object has no attribute 'append' I the local logs.
Change 393637 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[eventlogging@master] Log generic Exceptions with stack trace in eventlogging/service.py
Change 393637 merged by Ottomata:
[eventlogging@master] Log generic Exceptions with stack trace in eventlogging/service.py
Mentioned in SAL (#wikimedia-operations) [2017-11-27T19:34:06Z] <otto@tin> Started deploy [eventlogging/eventbus@3df06ab]: Temp deploy to kafka1001 only to catch bug: T180017
Mentioned in SAL (#wikimedia-operations) [2017-11-27T19:34:19Z] <otto@tin> Finished deploy [eventlogging/eventbus@3df06ab]: Temp deploy to kafka1001 only to catch bug: T180017 (duration: 00m 12s)
I caught a stacktrace:
Nov 27 20:31:41 kafka1001 eventlogging-service-eventbus[12281]: (MainThread) Failed sending event <Event fa22bfcb-d3b1-11e7-a7f0-90b11c2788f0 of schema (u'mediawiki/recentchange', 1)>. AttributeError: 'NoneType' object has no attribute 'append' Traceback (most recent call last): File "/srv/deployment/eventlogging/eventbus/eventlogging/service.py", line 262, in handle_events yield self.process_event(event) File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 297, in wrapper Runner(result, future, yielded) File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 962, in __init__ if self.handle_yield(first_yielded): File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 1104, in handle_yield self.future, lambda f: self.run()) File "/usr/lib/python2.7/dist-packages/tornado/ioloop.py", line 596, in add_future lambda future: self.add_callback(callback, future)) File "/usr/lib/python2.7/dist-packages/tornado/concurrent.py", line 267, in add_done_callback self._callbacks.append(fn) AttributeError: 'NoneType' object has no attribute 'append'
Possibly related? Not sure:
https://github.com/tornadoweb/tornado/issues/1769
The Tornado version we have in prod does not have this fix.
Mentioned in SAL (#wikimedia-operations) [2017-11-27T22:21:31Z] <otto@tin> Started deploy [eventlogging/eventbus@e024af3]: deploy revert checked out at 3df06ab (we caught one). T180017
Mentioned in SAL (#wikimedia-operations) [2017-11-27T22:21:42Z] <otto@tin> Finished deploy [eventlogging/eventbus@e024af3]: deploy revert checked out at 3df06ab (we caught one). T180017 (duration: 00m 10s)
An interesting but probably completely useless and unrelated piece of information
I went through a lot of logs about these timeouts and here's the distribution of job types:
1292 total errors in the log file on mwlog1001
762 cirrusSearchLinksUpdatePrioritized
190 enqueue
162 refreshLinks
44 cirrusSearchIncomingLinkCount
5 resource_change - this is not even from JobQueue, it's from RESTBase update events
So the fact that the majority of timeouts are for cirrusSearchLinksUpdatePrioritized is probably just a coincidence and is not related
So I've spent some time digging the error stack trace that we're seeing and it's indeed thread safety issue. The Kafka callback is actually run on a thread other then MainThread, so the ticket I've found seem to be the reason.
Change 395164 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[eventlogging@master] Make the kafka async deliver callback thread-safe
Change 395164 merged by Ottomata:
[eventlogging@master] Make the kafka async deliver callback thread-safe
Mentioned in SAL (#wikimedia-operations) [2017-12-05T18:24:19Z] <ppchelko@tin> Started deploy [eventlogging/eventbus@6ca0372]: Make the kafka async deliver callback thread-safe. Limited to kafka1001. T180017
Mentioned in SAL (#wikimedia-operations) [2017-12-05T18:24:33Z] <ppchelko@tin> Finished deploy [eventlogging/eventbus@6ca0372]: Make the kafka async deliver callback thread-safe. Limited to kafka1001. T180017 (duration: 00m 14s)
Change 395593 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Enable async for eventbus kafka producer
Change 395593 merged by Ottomata:
[operations/puppet@production] Enable async for eventbus kafka producer
Mentioned in SAL (#wikimedia-operations) [2017-12-05T19:36:15Z] <ppchelko@tin> Started deploy [eventlogging/eventbus@6ca0372]: Make the kafka async deliver callback thread-safe. T180017
Mentioned in SAL (#wikimedia-operations) [2017-12-05T19:37:44Z] <ppchelko@tin> Finished deploy [eventlogging/eventbus@6ca0372]: Make the kafka async deliver callback thread-safe. T180017 (duration: 01m 29s)
After the deployment of the above patches, we haven't seen and AttributeError exceptions and the rate of Event-Platform timeout logs drastically decreased to almost zero, but it's still not zero :(
I've been digging it a bit more to figure out the reason for the remaining timeouts, but I have not found anything that could be suspicious or provide some more insight into the problem. I give up for now.
Most timeouts during normal operation seem to have been handled, but there is still the problem of timeouts during broker restarts that should not happen, e.g.:
Mar 19 14:41:36 kafka1001 eventlogging-service-eventbus[1043]: (eventlogging-0f2b4998-2859-11e8-b2fb-141877335242-kafka1001.eqiad.wmnet.1043-network-thread) Connection attempt to <BrokerConnection host=kafka1002.eqiad.wmnet/10.64.16.41 port=9092> timed out Mar 19 14:41:36 kafka1001 eventlogging-service-eventbus[1043]: (eventlogging-0f2b4998-2859-11e8-b2fb-141877335242-kafka1001.eqiad.wmnet.1043-network-thread) Node 1002 connection failed -- refreshing metadata Mar 19 14:41:36 kafka1001 eventlogging-service-eventbus[1043]: (MainThread) Failed sending event <Event 8e9fec78-2b83-11e8-8da2-c6cb766cf290 of schema (u'resource_change', 1)>. KafkaTimeoutError: #012Traceback (most recent call last):#012 File "/srv/deployment/eventlogging/eventbus/eventlogging/service.py", line 263, in handle_events#012 yield self.process_event(event)#012 File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 1015, in run#012 value = future.result()#012 File "/usr/lib/python2.7/dist-packages/tornado/concurrent.py", line 237, in result#012 raise_exc_info(self._exc_info)#012 File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 1021, in run#012 yielded = self.gen.throw(*exc_info)#012 File "/srv/deployment/eventlogging/eventbus/eventlogging/service.py", line 242, in process_event#012 yield self.send(event)#012 File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 1015, in run#012 value = future.result()#012 File "/usr/lib/python2.7/dist-packages/tornado/concurrent.py", line 237, in result#012 raise_exc_info(self._exc_info)#012 File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 792, in callback#012 result_list.append(f.result())#012 File "/usr/lib/python2.7/dist-packages/tornado/concurrent.py", line 237, in result#012 raise_exc_info(self._exc_info)#012 File "<string>", line 3, in raise_exc_info#012KafkaTimeoutError: KafkaTimeoutError: ('Batch containing %s record(s) expired due to timeout while requesting metadata from brokers for %s', 1, TopicPartition(topic='eqiad.resource_change', partition=0))Mar 19 14:41:36 kafka1001 eventlogging-service-eventbus[1043]: (MainThread) Failed processing event: Failed sending event <Event 8e9fec78-2b83-11e8-8da2-c6cb766cf290 of schema (u'resource_change', 1)>. KafkaTimeoutError:
Change 420355 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Set retries=3 for eventbus kafka producer
Change 420355 merged by Ottomata:
[operations/puppet@production] Set retries=3 for eventbus kafka producer
We are very actively moving to eventgate, so I don't think any work here is relevant anymore.