dbproxy1005 at 5:07 marked db1009 as down as a result of a huge load spike:
05:07 < icinga-wm> PROBLEM - haproxy failover on dbproxy1005 is CRITICAL: CRITICAL check_failover servers up 2 down 1
As db1009 looked up, I reloaded dbproxy1009 and it started to get heavily pounded.
db1009 (m5 master) is being overloaded since 06:10AM (as soon as HAproxy was reloaded as it marked db1009 as down, probably because of the same overload)
This is getting the server to reach "Too many connections". I am not sure if this is a cause or a consecuence.
Most of the connections are coming to the nova database and they are sleeping
root@neodymium:~# mysql --skip-ssl -hdb1009 -e "show processlist;" | grep nova | grep -i sleep | wc -l 233 root@neodymium:~# mysql --skip-ssl -hdb1009 -e "show global variables like 'max_connections'" +-----------------+-------+ | Variable_name | Value | +-----------------+-------+ | max_connections | 500 | +-----------------+-------+
Most of the queries are coming from:
- labcontrol1001
- labnet1001
HW doesn't look broken:
root@db1009:~# megacli -LDInfo -Lall -aALL Adapter 0 -- Virtual Drive Information: Virtual Drive: 0 (Target Id: 0) Name : RAID Level : Primary-1, Secondary-0, RAID Level Qualifier-0 Size : 1.633 TB Sector Size : 512 Mirror Data : 1.633 TB State : Optimal Strip Size : 256 KB Number Of Drives per span:2 Span Depth : 6 Default Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU Current Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU root@db1009:~# megacli -PDList -aALL |grep "Firmware state:" Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun Up
There are some disks with errors, but those could be old
root@db1009:~# megacli -PDList -aALL |grep "Media" Media Error Count: 0 Media Type: Hard Disk Device Media Error Count: 16 Media Type: Hard Disk Device Media Error Count: 0 Media Type: Hard Disk Device Media Error Count: 85 Media Type: Hard Disk Device Media Error Count: 3 Media Type: Hard Disk Device Media Error Count: 18 Media Type: Hard Disk Device Media Error Count: 92 Media Type: Hard Disk Device Media Error Count: 0 Media Type: Hard Disk Device Media Error Count: 111 Media Type: Hard Disk Device Media Error Count: 0 Media Type: Hard Disk Device Media Error Count: 120 Media Type: Hard Disk Device Media Error Count: 0 Media Type: Hard Disk Device
page that @madhuvishy and then @chasemp responded to:
https://lists.wikimedia.org/pipermail/cloud-admin-feed/2018-March/000007.html
https://lists.wikimedia.org/pipermail/cloud-admin-feed/2018-March/000008.html