During the outage the database servers saw:
- Abrupt drop in load
- Abrupt spike in load
- A few boxes hit `max_connections`
The first two are just effects of the outage, with stuff all over the place and memcached issues pushing load back on the databases. For the most part, the databases were heavily loaded but ok.
However the connection issues were due to stale, sleeping connections using up slots, not due to active legitimate load. We saw:
- A bunch of stale connections dating from the initial switch outage time.
- Waves of new but instantly sleeping connections as Ops rebooted things.
- An instant recovery to normal connection levels when the logstash problem was addressed.
Evidently whatever was causing the clients to wait on logstash was also causing the DB connections to sleep and mount up. Fair enough; it's an outage and unpredictable, so we probably need to respond to this scenario on the DB side somehow. During the outage I used pt-kill to repeatedly flush out the sleeping connections in order to maintain headroom for the active ones.
Some areas for improvement:
- The MariaDB 10 boxes with thread pool all handled the outage better than the remaining MariaDB 5.5 boxes. The thread pool kept system load reasonable and the higher `max_connections` setting took longer to exhaust. We should definitely push ahead and finish the MariaDB 10 migration.
- The slaves all have events[1] running to watch for sleeping connections at 300s, but the delay was too long in this case. The automated killing could be enhanced to be more brutal when sleepers appear en masse or above some arbitrary rate. Needs testing.
- Look like we could safely increase `max_connections` on the MariaDB 10 boxes still further without causing problems. With threads capped it's just a matter of file handles, and it would be handly to have more headroom.
[1] https://git.wikimedia.org/blob/operations%2Fsoftware/76ebb8ad3e1f9c5cd7879d654d19d96b33b8f33b/dbtools%2Fevents_coredb_slave.sql