Page MenuHomePhabricator

sleeper database connection surges during outage
Closed, ResolvedPublic

Description

During the outage the database servers saw:

  • Abrupt drop in load
  • Abrupt spike in load
  • A few boxes hit max_connections

The first two are just effects of the outage, with stuff all over the place and memcached issues pushing load back on the databases. For the most part, the databases were heavily loaded but ok.

However the connection issues were due to stale, sleeping connections using up slots, not due to active legitimate load. We saw:

  • A bunch of stale connections dating from the initial switch outage time.
  • Waves of new but instantly sleeping connections as Ops rebooted things.
  • An instant recovery to normal connection levels when the logstash problem was addressed.

Evidently whatever was causing the clients to wait on logstash was also causing the DB connections to sleep and mount up. Fair enough; it's an outage and unpredictable, so we probably need to respond to this scenario on the DB side somehow. During the outage I used pt-kill to repeatedly flush out the sleeping connections in order to maintain headroom for the active ones.

Some areas for improvement:

  • The MariaDB 10 boxes with thread pool all handled the outage better than the remaining MariaDB 5.5 boxes. The thread pool kept system load reasonable and the higher max_connections setting took longer to exhaust. We should definitely push ahead and finish the MariaDB 10 migration.
  • The slaves all have events[1] running to watch for sleeping connections at 300s, but the delay was too long in this case. The automated killing could be enhanced to be more brutal when sleepers appear en masse or above some arbitrary rate. Needs testing.
  • Look like we could safely increase max_connections on the MariaDB 10 boxes still further without causing problems. With threads capped it's just a matter of file handles, and it would be handly to have more headroom.

[1] https://git.wikimedia.org/blob/operations%2Fsoftware/76ebb8ad3e1f9c5cd7879d654d19d96b33b8f33b/dbtools%2Fevents_coredb_slave.sql

Event Timeline

Springle created this task.Feb 6 2015, 2:07 AM
Springle claimed this task.
Springle raised the priority of this task from to Normal.
Springle updated the task description. (Show Details)
Springle added a subscriber: Springle.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 6 2015, 2:07 AM
Springle renamed this task from Stale database connections during outage to sleeper database connection surges during outage.Feb 6 2015, 2:08 AM
Springle set Security to None.
Springle added subscribers: faidon, akosiaris, Joe.
RandomDSdevel added a subscriber: RandomDSdevel.
Restricted Application added a subscriber: Matanya. · View Herald TranscriptAug 28 2015, 4:28 AM
Krenair added a subscriber: Krenair.
Krinkle moved this task from Triage to Backlog on the DBA board.Sep 23 2015, 4:27 AM
Krinkle moved this task from Backlog to Triage on the DBA board.Sep 23 2015, 7:07 AM
jcrespo closed this task as Resolved.Jul 29 2016, 6:45 AM

This is resolved:

  1. All important boxes are using mariadb10 and pool of connections
  2. There are some watchdogs in place that kill long running connections
  3. max_connctions was increased to 10000 not a long time ago

All of these will have to be improved and refined even more, but the immediate issues were solved a long time ago. To prove it, for example, because of the measures in place, almost nobody noticed issues caused by T140108.