Page MenuHomePhabricator

Ensure that EventBus extension gracefully handles service failures
Open, MediumPublic

Description

Recently, a Kafka failure resulted in an API outage when Monolog was unable to make progress (see: T125084: MediaWiki monolog doesn't handle Kafka failures gracefully). With EventBus's service timeout of 5 seconds, and a current throughput of a ~20 messages/sec., there is probably little danger in exhausting the number of HHVM threads as happened in T125084. However, throughput is only likely to increase with time and it's worth thinking again about the failure scenarios, and how we might be more resilient to them.

Ideas:

  • Lowering the timeout further (i.e. 100 blocked connections can accumulate in 5 seconds, 80 in 4 seconds, etc).
  • Failure detection; Track service failures $somewhere, and back-off when failures prevent making progress anyway
  • Harden eventlogging-service against Kafka failures
Note: There were 9 timeout errors (5 second timeout) between 2016-01-29T02:20:24.000Z and 2016-02-01T02:21:05.000Z.

Event Timeline

Eevans created this task.Feb 1 2016, 3:30 PM
Eevans claimed this task.
Eevans raised the priority of this task from to Medium.
Eevans updated the task description. (Show Details)
Eevans added subscribers: Eevans, Ottomata.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 1 2016, 3:30 PM
Eevans updated the task description. (Show Details)Feb 1 2016, 3:53 PM
Eevans set Security to None.
Eevans updated the task description. (Show Details)
Milimetric moved this task from Analytics Query Service to Radar on the Analytics board.
Pchelolo moved this task from Backlog to later on the Services board.Oct 12 2016, 6:49 PM
Pchelolo edited projects, added Services (later); removed Services.
Ottomata moved this task from Backlog to Radar on the Event-Platform board.Apr 14 2020, 1:26 PM
Aklapper edited projects, added Analytics-Radar; removed Analytics.Jun 10 2020, 6:33 AM
Restricted Application edited projects, added Analytics; removed Analytics-Radar. · View Herald TranscriptJun 10 2020, 6:33 AM
Aklapper edited projects, added Analytics-Radar; removed Analytics.Jun 10 2020, 6:36 AM
Restricted Application edited projects, added Analytics; removed Analytics-Radar. · View Herald TranscriptJun 10 2020, 6:36 AM
Aklapper edited projects, added Analytics-Radar; removed Analytics.Jun 10 2020, 6:41 AM
Aklapper removed Eevans as the assignee of this task.Jun 19 2020, 4:23 PM

This task has been assigned to the same task owner for more than two years. Resetting task assignee due to inactivity, to decrease task cookie-licking and to get a slightly more realistic overview of plans. Please feel free to assign this task to yourself again if you still realistically work or plan to work on this task - it would be welcome!

For tips how to manage individual work in Phabricator (noisy notifications, lists of task, etc.), see https://phabricator.wikimedia.org/T228575#6237124 for available options.
(For the records, two emails were sent to assignee addresses before resetting assignees. See T228575 for more info and for potential feedback. Thanks!)