2016-10-17 API cluster overload
Closed, ResolvedPublic


This task tracks the investigation of a partial outage of the MediaWiki API cluster, which started at 15:18 on 2016-10-17 and ended at XXX.

What happened:

  • Requests piled up on individual app servers.
  • App servers eventually became unresponsive and were depooled.
  • The number of pooled API backends shrank until it could no longer meet demand
  • Users were served 5xx errors.

Event Timeline

Backtrace from mw1194: P4264

elukey triaged this task as High priority.Oct 20 2016, 1:30 PM

So, this never happened (this == the investigation). It is now almost 4 months later, with a new year in between. There was no incident report written and nothing obvious from SAL:

This task was referenced in T151702#2828577.

I can't view that paste :/

@fgiunchedi thoughts on whether we should keep this task open any longer? Anything you think we can still do at this point?

fgiunchedi claimed this task.

@greg I think it is safe to close, we've mitigated the issue by having a separate cluster for async processing in T151702 and related.