Change Details

labsdb1009 (m5 master) is being overloaded since 06:10AM. CPU: https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&orgId=1&var-server=db1009&var-network=eth0&from=1519864556811&to=1519886156812&panelId=7&fullscreen Connections: https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&orgId=1&var-server=db1009&var-network=eth0&from=1519864572622&to=1519886172622&panelId=15&fullscreen Disk utilization: https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&orgId=1&var-server=db1009&var-network=eth0&from=1519864586113&to=1519886186113&panelId=19&fullscreen This is getting the server to reach "Too many connections". I am not sure if this is a cause or a consecuence. Most of the connections are coming to the nova database and they are sleeping ``` root@neodymium:~# mysql --skip-ssl -hdb1009 -e "show processlist;" | grep nova | grep -i sleep | wc -l 233 root@neodymium:~# mysql --skip-ssl -hdb1009 -e "show global variables like 'max_connections'" +-----------------+-------+ | Variable_name | Value | +-----------------+-------+ | max_connections | 500 | +-----------------+-------+ ``` Most of the queries are coming from: - labcontrol1001 - labnet1001 HW doesn't look broken: ``` root@db1009:~# megacli -LDInfo -Lall -aALL Adapter 0 -- Virtual Drive Information: Virtual Drive: 0 (Target Id: 0) Name : RAID Level : Primary-1, Secondary-0, RAID Level Qualifier-0 Size : 1.633 TB Sector Size : 512 Mirror Data : 1.633 TB State : Optimal Strip Size : 256 KB Number Of Drives per span:2 Span Depth : 6 Default Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU Current Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU root@db1009:~# megacli -PDList -aALL |grep "Firmware state:" Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun Up ``` There are some disks with errors, but those could be old ``` root@db1009:~# megacli -PDList -aALL |grep "Media" Media Error Count: 0 Media Type: Hard Disk Device Media Error Count: 16 Media Type: Hard Disk Device Media Error Count: 0 Media Type: Hard Disk Device Media Error Count: 85 Media Type: Hard Disk Device Media Error Count: 3 Media Type: Hard Disk Device Media Error Count: 18 Media Type: Hard Disk Device Media Error Count: 92 Media Type: Hard Disk Device Media Error Count: 0 Media Type: Hard Disk Device Media Error Count: 111 Media Type: Hard Disk Device Media Error Count: 0 Media Type: Hard Disk Device Media Error Count: 120 Media Type: Hard Disk Device Media Error Count: 0 Media Type: Hard Disk Device ```

dbproxy1005 at 5:07 marked db1009 as down as a result of a huge load spike: https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&orgId=1&var-server=db1009&var-network=eth0&from=1519865294031&to=1519886894031&panelId=7&fullscreen ``` 05:07 < icinga-wm> PROBLEM - haproxy failover on dbproxy1005 is CRITICAL: CRITICAL check_failover servers up 2 down 1 ``` As db1009 looked up, I reloaded dbproxy1009 and it started to get heavily pounded. db1009 (m5 master) is being overloaded since 06:10AM (as soon as HAproxy was reloaded as it marked db1009 as down, probably because of the same overload) CPU: https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&orgId=1&var-server=db1009&var-network=eth0&from=1519864556811&to=1519886156812&panelId=7&fullscreen Connections: https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&orgId=1&var-server=db1009&var-network=eth0&from=1519864572622&to=1519886172622&panelId=15&fullscreen Disk utilization: https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&orgId=1&var-server=db1009&var-network=eth0&from=1519864586113&to=1519886186113&panelId=19&fullscreen This is getting the server to reach "Too many connections". I am not sure if this is a cause or a consecuence. Most of the connections are coming to the nova database and they are sleeping ``` root@neodymium:~# mysql --skip-ssl -hdb1009 -e "show processlist;" | grep nova | grep -i sleep | wc -l 233 root@neodymium:~# mysql --skip-ssl -hdb1009 -e "show global variables like 'max_connections'" +-----------------+-------+ | Variable_name | Value | +-----------------+-------+ | max_connections | 500 | +-----------------+-------+ ``` Most of the queries are coming from: - labcontrol1001 - labnet1001 HW doesn't look broken: ``` root@db1009:~# megacli -LDInfo -Lall -aALL Adapter 0 -- Virtual Drive Information: Virtual Drive: 0 (Target Id: 0) Name : RAID Level : Primary-1, Secondary-0, RAID Level Qualifier-0 Size : 1.633 TB Sector Size : 512 Mirror Data : 1.633 TB State : Optimal Strip Size : 256 KB Number Of Drives per span:2 Span Depth : 6 Default Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU Current Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU root@db1009:~# megacli -PDList -aALL |grep "Firmware state:" Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun Up ``` There are some disks with errors, but those could be old ``` root@db1009:~# megacli -PDList -aALL |grep "Media" Media Error Count: 0 Media Type: Hard Disk Device Media Error Count: 16 Media Type: Hard Disk Device Media Error Count: 0 Media Type: Hard Disk Device Media Error Count: 85 Media Type: Hard Disk Device Media Error Count: 3 Media Type: Hard Disk Device Media Error Count: 18 Media Type: Hard Disk Device Media Error Count: 92 Media Type: Hard Disk Device Media Error Count: 0 Media Type: Hard Disk Device Media Error Count: 111 Media Type: Hard Disk Device Media Error Count: 0 Media Type: Hard Disk Device Media Error Count: 120 Media Type: Hard Disk Device Media Error Count: 0 Media Type: Hard Disk Device ```

labsdbproxy1005 at 5:07 marked db1009 as down as a result of a huge load spike: https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&orgId=1&var-server=db1009&var-network=eth0&from=1519865294031&to=1519886894031&panelId=7&fullscreen ``` 05:07 < icinga-wm> PROBLEM - haproxy failover on dbproxy1005 is CRITICAL: CRITICAL check_failover servers up 2 down 1 ``` As db1009 looked up, I reloaded dbproxy1009 and it started to get heavily pounded. db1009 (m5 master) is being overloaded since 06:10AM. (as soon as HAproxy was reloaded as it marked db1009 as down, probably because of the same overload) CPU: https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&orgId=1&var-server=db1009&var-network=eth0&from=1519864556811&to=1519886156812&panelId=7&fullscreen Connections: https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&orgId=1&var-server=db1009&var-network=eth0&from=1519864572622&to=1519886172622&panelId=15&fullscreen Disk utilization: https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&orgId=1&var-server=db1009&var-network=eth0&from=1519864586113&to=1519886186113&panelId=19&fullscreen This is getting the server to reach "Too many connections". I am not sure if this is a cause or a consecuence. Most of the connections are coming to the nova database and they are sleeping ``` root@neodymium:~# mysql --skip-ssl -hdb1009 -e "show processlist;" | grep nova | grep -i sleep | wc -l 233 root@neodymium:~# mysql --skip-ssl -hdb1009 -e "show global variables like 'max_connections'" +-----------------+-------+ | Variable_name | Value | +-----------------+-------+ | max_connections | 500 | +-----------------+-------+ ``` Most of the queries are coming from: - labcontrol1001 - labnet1001 HW doesn't look broken: ``` root@db1009:~# megacli -LDInfo -Lall -aALL Adapter 0 -- Virtual Drive Information: Virtual Drive: 0 (Target Id: 0) Name : RAID Level : Primary-1, Secondary-0, RAID Level Qualifier-0 Size : 1.633 TB Sector Size : 512 Mirror Data : 1.633 TB State : Optimal Strip Size : 256 KB Number Of Drives per span:2 Span Depth : 6 Default Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU Current Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU root@db1009:~# megacli -PDList -aALL |grep "Firmware state:" Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun Up ``` There are some disks with errors, but those could be old ``` root@db1009:~# megacli -PDList -aALL |grep "Media" Media Error Count: 0 Media Type: Hard Disk Device Media Error Count: 16 Media Type: Hard Disk Device Media Error Count: 0 Media Type: Hard Disk Device Media Error Count: 85 Media Type: Hard Disk Device Media Error Count: 3 Media Type: Hard Disk Device Media Error Count: 18 Media Type: Hard Disk Device Media Error Count: 92 Media Type: Hard Disk Device Media Error Count: 0 Media Type: Hard Disk Device Media Error Count: 111 Media Type: Hard Disk Device Media Error Count: 0 Media Type: Hard Disk Device Media Error Count: 120 Media Type: Hard Disk Device Media Error Count: 0 Media Type: Hard Disk Device ```