Page MenuHomePhabricator

2016-10-17 API cluster overload
Closed, ResolvedPublic

Description

This task tracks the investigation of a partial outage of the MediaWiki API cluster, which started at 15:18 on 2016-10-17 and ended at XXX.

What happened:

  • Requests piled up on individual app servers.
  • App servers eventually became unresponsive and were depooled.
  • The number of pooled API backends shrank until it could no longer meet demand
  • Users were served 5xx errors.

Event Timeline

Backtrace from mw1194: P4264

elukey triaged this task as High priority.Oct 20 2016, 1:30 PM

So, this never happened (this == the investigation). It is now almost 4 months later, with a new year in between. There was no incident report written and nothing obvious from SAL: https://tools.wmflabs.org/sal/production?p=0&q=&d=2016-10-17

This task was referenced in T151702#2828577.

Backtrace from mw1194: P4264

I can't view that paste :/

@fgiunchedi thoughts on whether we should keep this task open any longer? Anything you think we can still do at this point?

fgiunchedi claimed this task.

@greg I think it is safe to close, we've mitigated the issue by having a separate cluster for async processing in T151702 and related.