dbproxy1005 at 5:07 marked db1009 as down as a result of a huge load spike:
https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&orgId=1&var-server=db1009&var-network=eth0&from=1519865294031&to=1519886894031&panelId=7&fullscreen
```
05:07 < icinga-wm> PROBLEM - haproxy failover on dbproxy1005 is CRITICAL: CRITICAL check_failover servers up 2 down 1
```
As db1009 looked up, I reloaded dbproxy1009 and it started to get heavily pounded.
db1009 (m5 master) is being overloaded since 06:10AM (as soon as HAproxy was reloaded as it marked db1009 as down, probably because of the same overload)
CPU:
https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&orgId=1&var-server=db1009&var-network=eth0&from=1519864556811&to=1519886156812&panelId=7&fullscreen
Connections:
https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&orgId=1&var-server=db1009&var-network=eth0&from=1519864572622&to=1519886172622&panelId=15&fullscreen
Disk utilization:
https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&orgId=1&var-server=db1009&var-network=eth0&from=1519864586113&to=1519886186113&panelId=19&fullscreen
This is getting the server to reach "Too many connections". I am not sure if this is a cause or a consecuence.
Most of the connections are coming to the nova database and they are sleeping
```
root@neodymium:~# mysql --skip-ssl -hdb1009 -e "show processlist;" | grep nova | grep -i sleep | wc -l
233
root@neodymium:~# mysql --skip-ssl -hdb1009 -e "show global variables like 'max_connections'"
+-----------------+-------+
| Variable_name | Value |
+-----------------+-------+
| max_connections | 500 |
+-----------------+-------+
```
Most of the queries are coming from:
- labcontrol1001
- labnet1001
HW doesn't look broken:
```
root@db1009:~# megacli -LDInfo -Lall -aALL
Adapter 0 -- Virtual Drive Information:
Virtual Drive: 0 (Target Id: 0)
Name :
RAID Level : Primary-1, Secondary-0, RAID Level Qualifier-0
Size : 1.633 TB
Sector Size : 512
Mirror Data : 1.633 TB
State : Optimal
Strip Size : 256 KB
Number Of Drives per span:2
Span Depth : 6
Default Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
root@db1009:~# megacli -PDList -aALL |grep "Firmware state:"
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
```
There are some disks with errors, but those could be old
```
root@db1009:~# megacli -PDList -aALL |grep "Media"
Media Error Count: 0
Media Type: Hard Disk Device
Media Error Count: 16
Media Type: Hard Disk Device
Media Error Count: 0
Media Type: Hard Disk Device
Media Error Count: 85
Media Type: Hard Disk Device
Media Error Count: 3
Media Type: Hard Disk Device
Media Error Count: 18
Media Type: Hard Disk Device
Media Error Count: 92
Media Type: Hard Disk Device
Media Error Count: 0
Media Type: Hard Disk Device
Media Error Count: 111
Media Type: Hard Disk Device
Media Error Count: 0
Media Type: Hard Disk Device
Media Error Count: 120
Media Type: Hard Disk Device
Media Error Count: 0
Media Type: Hard Disk Device
```