Page MenuHomePhabricator

Timeouts on event delivery to EventBus
Closed, InvalidPublicPRODUCTION ERROR

Description

Since we've recently fixed the Event-Platform logs we're starting to see some timeouts on event delivery from the EventBus extension to the proxy service.

See https://logstash.wikimedia.org/goto/749413b11913eaba9e7a37e106c22fb5

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

We have 16 CPU cores with HyperThreading enabled on each node, but only 8 EventBus proxy service workers, so one immediate thing that we need to do is to increase the number of workers.

Change 389975 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[operations/puppet@production] [EventBus] Increase number of worker processes to 16

https://gerrit.wikimedia.org/r/389975

+1 ^

There's also this old unmerged patch:
https://gerrit.wikimedia.org/r/#/c/302372/

However, when I tested at the time, it didn't show any performance improvements.

Change 389975 merged by Ottomata:
[operations/puppet@production] [EventBus] Increase number of worker processes to 16

https://gerrit.wikimedia.org/r/389975

Mentioned in SAL (#wikimedia-operations) [2017-11-08T14:55:09Z] <otto@tin> Started restart [eventlogging/eventbus@41e3418]: Bumping worker processes to 16: T180017

Mentioned in SAL (#wikimedia-operations) [2017-11-08T15:01:43Z] <otto@tin> Started restart [eventlogging/eventbus@41e3418]: Bumping worker processes to 16 on all targets: T180017

+1 ^

There's also this old unmerged patch:
https://gerrit.wikimedia.org/r/#/c/302372/

However, when I tested at the time, it didn't show any performance improvements.

Ah! There might not be immediate, apparent perf improvements, but properly handling async events definitely helps when events are received at a rate that's higher than usual. IMHO, work on it should be continued.

Judging by the logs (on mw-log1001, not currently available in logstash) the timeouts did not disappear. Maybe we could consider increasing request timeout?

Judging by the logs (on mw-log1001, not currently available in logstash) the timeouts did not disappear. Maybe we could consider increasing request timeout?

What is the current timeout set for POSTing to Eventbus?

Seems like we don't specify it in mediawiki-config, so we're using the default of 5 seconds.

Change 391532 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/extensions/EventBus@master] Increase request timeout to match kafka produce timeout

https://gerrit.wikimedia.org/r/391532

Change 391532 merged by jenkins-bot:
[mediawiki/extensions/EventBus@master] Increase request timeout to match kafka produce timeout

https://gerrit.wikimedia.org/r/391532

Change 391535 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/extensions/EventBus@wmf/1.31.0-wmf.8] Increase request timeout to match kafka produce timeout

https://gerrit.wikimedia.org/r/391535

Change 391535 merged by jenkins-bot:
[mediawiki/extensions/EventBus@wmf/1.31.0-wmf.8] Increase request timeout to match kafka produce timeout

https://gerrit.wikimedia.org/r/391535

Mentioned in SAL (#wikimedia-operations) [2017-11-15T14:12:11Z] <zfilipin@tin> Synchronized php-1.31.0-wmf.8/extensions/EventBus/EventBus.php: SWAT: [[gerrit:391535|Increase request timeout to match kafka produce timeout (T180017)]] (duration: 00m 50s)

I've increased request timeout in Event-Platform extension to 10 seconds to match sync timeout in Kafka, but it did not fix the timeout errors.

The most weird part is that the events service never logs anything for these timeouts. I'm wondering if we could add some logging to the service when

  • Event production to Kafka times out
  • The request is not accepted for > 1 seconds waiting for available slots on tornado (if the last even makes any sense)

What do you think @Ottomata ?

https://groups.google.com/forum/#!msg/python-tornado/qqF5JQdP6XU/ZB8bI4AIU5sJ seems relevant to the problem.

That example seems to be more about the a Tornado HTTP client, rather than server, but it might still be generally relevant about Tornado coroutines.

I'm wondering if we could add some logging to the service

I'll see what I can do...

Event production to Kafka times out

I'm pretty sure this will be logged, although it is pretty difficult to test this.

The request is not accepted for > 1 seconds waiting for available slots on tornado (if the last even makes any sense)

I'm just trying to add logging where the client closes the connection before the server, and not having much log. on_connection_close Should do it, but I think this might only happen if there are pending async coroutines waiting to finish.

I'm testing more this change, and am wondering if we should just try it. We could deploy to a single prod server and let it run for a bit to make sure it doesn't break anything. We'd also set async=True in the kafka producer options.

Change 391634 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Try async kafka producer to fix http timeouts

https://gerrit.wikimedia.org/r/391634

Change 391634 merged by Ottomata:
[operations/puppet@production] Try async kafka producer to fix http timeouts

https://gerrit.wikimedia.org/r/391634

Ok, I've merged https://gerrit.wikimedia.org/r/#/c/302372/ and deployed it to kafka1001. I've also merged https://gerrit.wikimedia.org/r/#/c/391634/, which enables async kafka produce requests on kafka1001 only.

302372 also adds a on_connection_close handler to hopefully log when the HTTP client disconnects itself, but I wasn't able to get that handler to fire in my testing.

Ergh, I was hoping we could use this to see if the timeouts would disappear to requests to eventbus on kafka1001...buuut since the MW log only knows about the LVS service, unless my on_connection_close fires we won't be able to tell if kafka1001 no longer sees timeouts.

I guess we can let this run for a few days to make sure nothing bad happens, and then deploy and enable async=True for the other eventbus instances.

Mentioned in SAL (#wikimedia-operations) [2017-11-16T07:17:34Z] <ppchelko@tin> Started deploy [eventlogging/eventbus@872cfb3]: Revert gerrit 302372 due to AssertionError T180017

Mentioned in SAL (#wikimedia-operations) [2017-11-16T07:17:49Z] <ppchelko@tin> Finished deploy [eventlogging/eventbus@872cfb3]: Revert gerrit 302372 due to AssertionError T180017 (duration: 00m 14s)

@Ottomata I've reverted your change on kafka1001 as there's some AttributeError: 'NoneType' object has no attribute 'append' I the local logs.

Ok, good catch, thanks. Will figure that out next week then.

Change 393637 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[eventlogging@master] Log generic Exceptions with stack trace in eventlogging/service.py

https://gerrit.wikimedia.org/r/393637

Change 393637 merged by Ottomata:
[eventlogging@master] Log generic Exceptions with stack trace in eventlogging/service.py

https://gerrit.wikimedia.org/r/393637

Mentioned in SAL (#wikimedia-operations) [2017-11-27T19:34:06Z] <otto@tin> Started deploy [eventlogging/eventbus@3df06ab]: Temp deploy to kafka1001 only to catch bug: T180017

Mentioned in SAL (#wikimedia-operations) [2017-11-27T19:34:19Z] <otto@tin> Finished deploy [eventlogging/eventbus@3df06ab]: Temp deploy to kafka1001 only to catch bug: T180017 (duration: 00m 12s)

I caught a stacktrace:

Nov 27 20:31:41 kafka1001 eventlogging-service-eventbus[12281]: (MainThread) Failed sending event <Event fa22bfcb-d3b1-11e7-a7f0-90b11c2788f0 of schema (u'mediawiki/recentchange', 1)>. AttributeError: 'NoneType' object has no attribute 'append'
Traceback (most recent call last):
  File "/srv/deployment/eventlogging/eventbus/eventlogging/service.py", line 262, in handle_events
    yield self.process_event(event)
  File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 297, in wrapper
    Runner(result, future, yielded)
  File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 962, in __init__
    if self.handle_yield(first_yielded):
  File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 1104, in handle_yield
    self.future, lambda f: self.run())
  File "/usr/lib/python2.7/dist-packages/tornado/ioloop.py", line 596, in add_future
    lambda future: self.add_callback(callback, future))
  File "/usr/lib/python2.7/dist-packages/tornado/concurrent.py", line 267, in add_done_callback
    self._callbacks.append(fn)
AttributeError: 'NoneType' object has no attribute 'append'

Possibly related? Not sure:
https://github.com/tornadoweb/tornado/issues/1769

The Tornado version we have in prod does not have this fix.

Mentioned in SAL (#wikimedia-operations) [2017-11-27T22:21:31Z] <otto@tin> Started deploy [eventlogging/eventbus@e024af3]: deploy revert checked out at 3df06ab (we caught one). T180017

Mentioned in SAL (#wikimedia-operations) [2017-11-27T22:21:42Z] <otto@tin> Finished deploy [eventlogging/eventbus@e024af3]: deploy revert checked out at 3df06ab (we caught one). T180017 (duration: 00m 10s)

An interesting but probably completely useless and unrelated piece of information
I went through a lot of logs about these timeouts and here's the distribution of job types:

1292 total errors in the log file on mwlog1001
762 cirrusSearchLinksUpdatePrioritized
190 enqueue
162 refreshLinks
44 cirrusSearchIncomingLinkCount
5 resource_change - this is not even from JobQueue, it's from RESTBase update events

So the fact that the majority of timeouts are for cirrusSearchLinksUpdatePrioritized is probably just a coincidence and is not related

So I've spent some time digging the error stack trace that we're seeing and it's indeed thread safety issue. The Kafka callback is actually run on a thread other then MainThread, so the ticket I've found seem to be the reason.

Change 395164 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[eventlogging@master] Make the kafka async deliver callback thread-safe

https://gerrit.wikimedia.org/r/395164

Change 395164 merged by Ottomata:
[eventlogging@master] Make the kafka async deliver callback thread-safe

https://gerrit.wikimedia.org/r/395164

Mentioned in SAL (#wikimedia-operations) [2017-12-05T18:24:19Z] <ppchelko@tin> Started deploy [eventlogging/eventbus@6ca0372]: Make the kafka async deliver callback thread-safe. Limited to kafka1001. T180017

Mentioned in SAL (#wikimedia-operations) [2017-12-05T18:24:33Z] <ppchelko@tin> Finished deploy [eventlogging/eventbus@6ca0372]: Make the kafka async deliver callback thread-safe. Limited to kafka1001. T180017 (duration: 00m 14s)

Change 395593 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Enable async for eventbus kafka producer

https://gerrit.wikimedia.org/r/395593

Change 395593 merged by Ottomata:
[operations/puppet@production] Enable async for eventbus kafka producer

https://gerrit.wikimedia.org/r/395593

Mentioned in SAL (#wikimedia-operations) [2017-12-05T19:36:15Z] <ppchelko@tin> Started deploy [eventlogging/eventbus@6ca0372]: Make the kafka async deliver callback thread-safe. T180017

Mentioned in SAL (#wikimedia-operations) [2017-12-05T19:37:44Z] <ppchelko@tin> Finished deploy [eventlogging/eventbus@6ca0372]: Make the kafka async deliver callback thread-safe. T180017 (duration: 01m 29s)

After the deployment of the above patches, we haven't seen and AttributeError exceptions and the rate of Event-Platform timeout logs drastically decreased to almost zero, but it's still not zero :(

I've been digging it a bit more to figure out the reason for the remaining timeouts, but I have not found anything that could be suspicious or provide some more insight into the problem. I give up for now.

Most timeouts during normal operation seem to have been handled, but there is still the problem of timeouts during broker restarts that should not happen, e.g.:

Mar 19 14:41:36 kafka1001 eventlogging-service-eventbus[1043]: (eventlogging-0f2b4998-2859-11e8-b2fb-141877335242-kafka1001.eqiad.wmnet.1043-network-thread) Connection attempt to <BrokerConnection host=kafka1002.eqiad.wmnet/10.64.16.41 port=9092> timed out
Mar 19 14:41:36 kafka1001 eventlogging-service-eventbus[1043]: (eventlogging-0f2b4998-2859-11e8-b2fb-141877335242-kafka1001.eqiad.wmnet.1043-network-thread) Node 1002 connection failed -- refreshing metadata
Mar 19 14:41:36 kafka1001 eventlogging-service-eventbus[1043]: (MainThread) Failed sending event <Event 8e9fec78-2b83-11e8-8da2-c6cb766cf290 of schema (u'resource_change', 1)>. KafkaTimeoutError: #012Traceback (most recent call last):#012  
File "/srv/deployment/eventlogging/eventbus/eventlogging/service.py", line 263, in handle_events#012
    yield self.process_event(event)#012  
File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 1015, in run#012
    value = future.result()#012  
File "/usr/lib/python2.7/dist-packages/tornado/concurrent.py", line 237, in result#012
    raise_exc_info(self._exc_info)#012  
File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 1021, in run#012
    yielded = self.gen.throw(*exc_info)#012  
File "/srv/deployment/eventlogging/eventbus/eventlogging/service.py", line 242, in process_event#012
    yield self.send(event)#012  
File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 1015, in run#012
    value = future.result()#012  
File "/usr/lib/python2.7/dist-packages/tornado/concurrent.py", line 237, in result#012
    raise_exc_info(self._exc_info)#012  
File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 792, in callback#012
    result_list.append(f.result())#012  
File "/usr/lib/python2.7/dist-packages/tornado/concurrent.py", line 237, in result#012
    raise_exc_info(self._exc_info)#012  
File "<string>", line 3, in raise_exc_info#012KafkaTimeoutError: KafkaTimeoutError: ('Batch containing %s record(s) expired due to timeout while requesting metadata from brokers for %s', 1, TopicPartition(topic='eqiad.resource_change', partition=0))Mar 19 14:41:36 kafka1001 eventlogging-service-eventbus[1043]: (MainThread) Failed processing event: Failed sending event <Event 8e9fec78-2b83-11e8-8da2-c6cb766cf290 of schema (u'resource_change', 1)>. KafkaTimeoutError:

and https://logstash.wikimedia.org/app/kibana#/dashboard/5587ec70-d421-11e7-a2bf-bb774cde766e?_g=()&_a=(description:'',filters:!(('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'logstash-*',key:query,negate:!f,type:custom,value:'%7B%22bool%22:%7B%22should%22:%5B%7B%22match%22:%7B%22type%22:%7B%22query%22:%22cpjobqueue%22%7D%7D%7D,%7B%22match%22:%7B%22channel%22:%7B%22query%22:%22EventBus%22%7D%7D%7D,%7B%22match%22:%7B%22channel%22:%7B%22query%22:%22JobExecutor%22%7D%7D%7D%5D%7D%7D'),query:(bool:(should:!((match:(type:(query:cpjobqueue))),(match:(channel:(query:EventBus))),(match:(channel:(query:JobExecutor))))))),('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'logstash-*',key:channel,negate:!f,type:phrase,value:EventBus),query:(match:(channel:(query:EventBus,type:phrase))))),options:(darkTheme:!f),panels:!((col:1,id:Events-Over-Time,panelIndex:1,row:1,size_x:12,size_y:2,type:visualization),(col:1,id:Event-Types,panelIndex:2,row:3,size_x:12,size_y:2,type:visualization),(col:1,columns:!(level,wiki,channel,message),id:Channel-Events,panelIndex:6,row:5,size_x:12,size_y:5,sort:!('@timestamp',desc),type:search)),query:(query_string:(analyze_wildcard:!t,query:'*')),timeRestore:!f,title:'JobQueue%20EventBus',uiState:(P-1:(vis:(legendOpen:!f))),viewMode:view)

Change 420355 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Set retries=3 for eventbus kafka producer

https://gerrit.wikimedia.org/r/420355

Change 420355 merged by Ottomata:
[operations/puppet@production] Set retries=3 for eventbus kafka producer

https://gerrit.wikimedia.org/r/420355

Pchelolo added a subscriber: Krinkle.

Given that it reappeared again in T204183 I'm inclined to raise the priority.

We are very actively moving to eventgate, so I don't think any work here is relevant anymore.

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:09 PM