A bit after midnight, dmonitor1001 HTTPS alert for dbtree sent:
[00:10:30] <icinga-wm> PROBLEM - HTTPS-dbtree on dbmonitor1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [00:19:40] <mutante> dbtree works for me. expecting recovery.
But it didn't recover, after several hours, the alerts is ongoing.
Checking what the alerts is exactly (which took a bit to figure out):
[09:45] <jynus> the check does from icinga host: sudo -u nagios /usr/lib/nagios/plugins/check_http -H dbtree.wikimedia.org --ssl --sni -I dbmonitor1001.wikimedia.org -u https://dbtree.wikimedia.org [09:45] <jynus> and curl tells me it takes just over 10 seconds
Checking graphs for dbmonitor1001, I saw some strange network behaviour around 0h, which led me to check the database db1115. There is clear a pattern change after 0h: https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1115&var-port=9104&from=1597868777860&to=1597911977860
After following a few false clues (cron/event schedulers) I noticed a quert misbehaving on tendril, taking an average of 11 seconds- which is suspiciously similar to the webrequest latency, and starting around the same timeframe:
Hits Tmax Tavg Tsum Hosts Users Schemas 880 152 11 10,246 db1115 tendril_web tendril select srv.id, floor((max(value)-min(value))/(unix_timestamp(max(stamp))-unix_timestamp(min(stamp)))) as qps from tendril.global_status_log gsl join tendril.strings str on gsl.name_id = str.id join tendril.servers srv on gsl.server_id = srv.id where srv.id in (1227, 1228, 1229, 1230, 1231, 1232, 1233, 1234, 1235, 1270, 1271, 1272, 1273, 1274, 1275, 1276, 1277, 1278, 1281, 1282, 1283, 1303, 1305, 1306, 1307, 1312, 1313, 1314, 1317, 1318, 1342, 1344, 1345, 1352, 1358, 1366, 1370, 1373, 1374, 1375, 1379, 1389, 1390, 1395, 1397, 1400, 1408, 1409, 1423, 1434, 1438, 1440, 1453, 1461, 1468, 1487, 1488, 1489, 1490, 1491, 1492, 1493, 1494, 1497, 1498, 1499, 1500, 1501, 1502, 1503, 1504, 1531, 1534, 1535, 1558, 1559, 1560, 1561, 1562, 1563, 1565, 1567, 1569, 1570, 1595, 1596, 1597, 1599, 1602, 1605, 1611, 1612, 1614, 1616, 1617, 1618, 1620, 1622, 1623, 1629, 1630, 1631, 1632, 1633, 1634, 1636, 1645, 1650, 1651, 1662, 1675, 1676, 1677, 1679, 1680, 1681, 1684, 1687, 1688, 1690, 1691, 1692, 1693, 1694, 1695, 1696, 1697, 1698, 1699, 1700, 1705, 1706, 1707, 1708, 1710, 1711, 1712, 1713, 1714, 1715, 1716, 1718, 1719, 1720, 1721, 1722, 1723, 1724, 1725, 1726, 1727, 1728, 1729, 1730, 1731, 1732, 1733, 1734, 1735, 1736, 1737, 1740, 1741, 1742, 1743, 1744, 1745, 1748, 1749, 1750, 1754, 1755, 1756, 1757, 1758, 1759, 1763, 1765, 1766, 1767, 1768, 1770, 1773, 1774, 1776, 1777, 1778, 1779, 1780, 1781, 1782, 1783, 1789, 1790, 1791, 1792, 1793, 1794, 1795, 1796, 1797, 1799, 1800, 1801, 1802, 1803, 1804, 1806, 1807, 1808, 1813, 1814, 1815, 1816, 1817, 1818, 1819, 1820, 1822, 1823, 1824, 1825, 1826, 1827, 1828, 1829, 1830, 1831) and str.string = 'questions' and gsl.stamp > now() - interval 10 minute group by `server_id` order by null /* d1c34202ab7572eaf6b93373c221e9c8 db1115 tendril 41s */ 33 12 6 216 db1115 tendril_web tendril select srv.id, floor((max(value)-min(value))/(unix_timestamp(max(stamp))-unix_timestamp(min(stamp)))) as qps from tendril.global_status_log gsl join tendril.strings str on gsl.name_id = str.id join tendril.servers srv on gsl.server_id = srv.id where str.string = 'questions' and gsl.stamp > now() - interval 10 minute and `server_id` in (1281, 1283, 1282, 1693, 1677, 1313, 1740, 1312, 1305, 1306, 1807, 1318, 1314, 1303, 1796, 1307, 1763, 1765, 1744, 1827, 1317, 1793, 1700, 1675, 1684, 1679, 1680, 1681, 1830, 1831, 1366, 1408, 1423, 1801, 1749, 1370, 1708, 1707, 1373, 1817, 1822, 1823, 1409, 1828, 1676, 1434, 1766, 1767, 1699, 1440, 1650, 1651, 1789, 1790, 1791, 1792, 1783, 1824, 1814, 1461, 1453, 1468, 1487, 1488, 1489, 1490, 1491, 1492, 1493, 1494, 1618, 1750, 1623, 1611, 1612, 1616, 1826, 1629, 1617, 1802, 1622, 1694, 1614, 1569, 1570, 1741, 1742, 1721, 1722, 1723, 1727, 1728, 1799, 1800, 1731, 1732, 1735, 1736, 1724, 1745, 1342, 1345, 1344, 1778, 1352, 1358, 1713, 1714, 1715, 1716, 1374, 1375, 1815, 1379, 1820, 1759, 1687, 1688, 1757, 1758, 1710, 1395, 1389, 1390, 1400, 1397, 1438, 1770, 1754, 1696, 1501, 1502, 1503, 1504, 1497, 1498, 1499, 1500, 1803, 1559, 1558, 1560, 1563, 1561, 1562, 1567, 1565, 1795, 1705, 1595, 1596, 1597, 1818, 1599, 1776, 1692, 1602, 1777, 1645, 1605, 1690, 1695, 1780, 1819, 1620, 1781, 1797, 1691, 1768, 1630, 1706, 1632, 1633, 1636, 1631, 1634, 1782, 1748, 1779, 1794, 1825, 1829, 1720, 1725, 1726, 1729, 1730, 1733, 1734, 1737, 1751, 1752, 1753, 1760, 1762, 1761, 1784, 1785, 1786, 1788, 1227, 1229, 1230, 1228, 1233, 1232, 1231, 1235, 1234, 1816, 1808, 1813, 1804, 1712, 1774, 1276, 1277, 1278, 1270, 1273, 1271, 1274, 1272, 1275, 1806, 1662, 1773, 1756, 1755, 1711, 1331, 1332, 1746, 1557, 1719, 1698, 1743, 1535, 1531, 1697, 1534, 1718) group by `server_id` order by null /* 50f4b6f6ad4b50cbe56e82c25d8c2476 db1115 tendril 5s */