Page MenuHomePhabricator

Ensure that EventBus extension gracefully handles service failures
Open, NormalPublic

Description

Recently, a Kafka failure resulted in an API outage when Monolog was unable to make progress (see: T125084: MediaWiki monolog doesn't handle Kafka failures gracefully). With EventBus's service timeout of 5 seconds, and a current throughput of a ~20 messages/sec., there is probably little danger in exhausting the number of HHVM threads as happened in T125084. However, throughput is only likely to increase with time and it's worth thinking again about the failure scenarios, and how we might be more resilient to them.

Ideas:

  • Lowering the timeout further (i.e. 100 blocked connections can accumulate in 5 seconds, 80 in 4 seconds, etc).
  • Failure detection; Track service failures $somewhere, and back-off when failures prevent making progress anyway
  • Harden eventlogging-service against Kafka failures
Note: There were 9 timeout errors (5 second timeout) between 2016-01-29T02:20:24.000Z and 2016-02-01T02:21:05.000Z.

Event Timeline

Eevans created this task.Feb 1 2016, 3:30 PM
Eevans claimed this task.
Eevans raised the priority of this task from to Normal.
Eevans updated the task description. (Show Details)
Eevans added projects: EventBus, Analytics, Services.
Eevans added subscribers: Eevans, Ottomata.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 1 2016, 3:30 PM
Eevans updated the task description. (Show Details)Feb 1 2016, 3:53 PM
Eevans set Security to None.
Eevans updated the task description. (Show Details)
Milimetric moved this task from Analytics Query Service to Radar on the Analytics board.
Pchelolo moved this task from Backlog to later on the Services board.Oct 12 2016, 6:49 PM
Pchelolo edited projects, added Services (later); removed Services.