Apparently, last deployment of parsoid created a database outage on enwiki API servers:
<logmsgbot> !log arlolra@tin Starting deploy [parsoid/deploy@0df8628]: Updating Parsoid to 6719e240 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log <grrrit-wm> (PS2) Giuseppe Lavagetto: api: add TLS termination in eqiad [puppet] - https://gerrit.wikimedia.org/r/327541 * SPF|Cloud has quit (Quit: Connection closed for inactivity) * Zppix (uid182351@wikipedia/Zppix) has joined <grrrit-wm> (CR) Jcrespo: [C: 2] mariadb: Move role::mariadb::client to a separate file [puppet] - https://gerrit.wikimedia.org/r/327556 (https://phabricator.wikimedia.org/T150850) (owner: Jcrespo) <icinga-wm> RECOVERY - puppet last run on mw1296 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures <icinga-wm> PROBLEM - MariaDB Slave IO: s1 on db1065 is CRITICAL: CRITICAL slave_io_state could not connect * OuKB (~max@198.73.209.4) has joined <jynus> that looks bad <robh> =[ <icinga-wm> RECOVERY - MariaDB Slave IO: s1 on db1065 is OK: OK slave_io_state Slave_IO_Running: Yes <logmsgbot> !log arlolra@tin Finished deploy [parsoid/deploy@0df8628]: Updating Parsoid to 6719e240 (duration: 07m 56s) <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log <arlolra> !log Updated Parsoid to 6719e240 (T96555) <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log <stashbot> T96555: Bug tokenizing commented <ref> - https://phabricator.wikimedia.org/T96555
https://logstash.wikimedia.org/goto/c80ee605942700af7442656e9b051296
Most of the api queries at that time seem to be contributions-related:
Hits Tmax Tavg Tsum Hosts Users Schemas 406 323 81 33,076 db1065, db1066 wikiuser enwiki SELECT /* ApiQueryContributors::execute */ rev_page AS `page`, rev_user AS `user`, MAX(rev_user_text) AS `username` FROM `revision` WHERE rev_page = '9228' AND (rev_user != 0) AND ((rev_deleted & 4) = 0) GROUP BY rev_page, rev_user ORDER BY rev_user LIMIT 501 /* 844ceca7dab690eb120535782678485f db1066 enwiki 23s */
The database API at the time seem to be stuck on executing >5000 queries at the same time:
https://grafana.wikimedia.org/dashboard/db/mysql?var-dc=eqiad%20prometheus%2Fops&var-server=db1065&from=1481825322195&to=1481828922196
When MySQL reaches max-connections, new connections are rejected.
Bringing down a server is something that should not be taken lightly, and a worse outage could had happen if the 2 servers would have been saturated at the same time rather than only 1.