Change Details

Recently, a Kafka failure resulted in an API outage when Monolog was unable to make progress (see: {T125084}). With EventBus's service timeout of 5 seconds, and a current throughput of a ~20 messages/sec., there is probably little danger in exhausting the number of HHVM threads as happened in T125084. However, throughput is only likely to increase with time and it's worth thinking again about the failure scenarios, and how we might be more resilient to them. Ideas: * Lowering the timeout further (i.e. 100 blocked connections can accumulate in 5 seconds, 80 in 4 seconds, etc). * Failure detection; Track service failures `$somewhere`, and back-off when failures prevent making progress anyway * Harden eventlogging-service against Kafka failures (NOTE) //Note: There were [[https://logstash.wikimedia.org/#dashboard/temp/AVKdfzVRptxhN1Xajg0B|9 timeout errors]] (5 second timeout) between 2016-01-29T02:20:24.000Z and 2016-02-01T02:21:05.000Z (https://logstash.wikimedia.org/#dashboard/temp/AVKdfzVRptxhN1Xajg0B).//