Recently, a Kafka failure resulted in an API outage when Monolog was unable to make progress (see: T125084: MediaWiki monolog doesn't handle Kafka failures gracefully). With EventBus's service timeout of 5 seconds, and a current throughput of a ~20 messages/sec., there is probably little danger in exhausting the number of HHVM threads as happened in T125084. However, throughput is only likely to increase with time and it's worth thinking again about the failure scenarios, and how we might be more resilient to them.
- Lowering the timeout further (i.e. 100 blocked connections can accumulate in 5 seconds, 80 in 4 seconds, etc).
- Failure detection; Track service failures $somewhere, and back-off when failures prevent making progress anyway
- Harden eventlogging-service against Kafka failures