Page MenuHomePhabricator

14 March 2021 Wikimedia API Outage
Open, MediumPublic

Description

Follow up task to the API Server outage, between approx 17:00-17:26 UTC time.

High rates of reads on s4 (commonswiki) caused db1144 to fall over and api servers to run out a php-fpm workers.

This also brought down any third party wiki using Instant Commons. or just ones with rubbish timeouts set.

Event Timeline

This also brought down any third party wiki using Instant Commons.

The wikis actually went down? Or just couldn't serve new images or...?

This also brought down any third party wiki using Instant Commons.

The wikis actually went down? Or just couldn't serve new images or...?

Completely down. Timeout at first and then for us it was everything because it upset health checks.

The database host that got overloaded was db1144, which serves the following groups:

  • contributions
  • recentchanges
  • recentchangeslinked
  • watchlist
  • logpager

Traffic: https://grafana.wikimedia.org/d/000000273/mysql?viewPanel=37&orgId=1&var-server=db1144&var-port=13314&from=1615738747603&to=1615745074190

Captura de pantalla 2021-03-14 a las 19.42.31.png (1×2 px, 327 KB)

This also brought down any third party wiki using Instant Commons.

The wikis actually went down? Or just couldn't serve new images or...?

Completely down. Timeout at first and then for us it was everything because it upset health checks.

Can you file a separate task for this? I would have expected InstantCommons to have reasonable timeouts if it can't reach Commons for whatever reason.

jijiki triaged this task as Medium priority.Mar 29 2021, 5:27 PM
Marostegui added a subscriber: LSobanski.

I am removing the DBA tag from here, as there's not much else for us here.
Please ping me or @LSobanski if something else happens to be needed from Data Persistence.