dbtree slowdown 2020-08-20
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	jcrespo
	Aug 20 2020, 8:30 AM

Description

A bit after midnight, dmonitor1001 HTTPS alert for dbtree sent:

[00:10:30] <icinga-wm>	 PROBLEM - HTTPS-dbtree on dbmonitor1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org
[00:19:40] <mutante>	 dbtree works for me. expecting recovery.

But it didn't recover, after several hours, the alerts is ongoing.

Checking what the alerts is exactly (which took a bit to figure out):

[09:45] <jynus> the check does from icinga host: sudo -u nagios /usr/lib/nagios/plugins/check_http -H dbtree.wikimedia.org --ssl --sni -I dbmonitor1001.wikimedia.org -u https://dbtree.wikimedia.org
[09:45] <jynus> and curl tells me it takes just over 10 seconds

Checking graphs for dbmonitor1001, I saw some strange network behaviour around 0h, which led me to check the database db1115. There is clear a pattern change after 0h: https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1115&var-port=9104&from=1597868777860&to=1597911977860

After following a few false clues (cron/event schedulers) I noticed a quert misbehaving on tendril, taking an average of 11 seconds- which is suspiciously similar to the webrequest latency, and starting around the same timeframe:

Hits 	Tmax 	Tavg 	Tsum 	Hosts 	Users 	Schemas
880	152	11	10,246	db1115	tendril_web	tendril
select srv.id, floor((max(value)-min(value))/(unix_timestamp(max(stamp))-unix_timestamp(min(stamp)))) as qps from tendril.global_status_log gsl join tendril.strings str on gsl.name_id = str.id join tendril.servers srv on gsl.server_id = srv.id where srv.id in (1227, 1228, 1229, 1230, 1231, 1232, 1233, 1234, 1235, 1270, 1271, 1272, 1273, 1274, 1275, 1276, 1277, 1278, 1281, 1282, 1283, 1303, 1305, 1306, 1307, 1312, 1313, 1314, 1317, 1318, 1342, 1344, 1345, 1352, 1358, 1366, 1370, 1373, 1374, 1375, 1379, 1389, 1390, 1395, 1397, 1400, 1408, 1409, 1423, 1434, 1438, 1440, 1453, 1461, 1468, 1487, 1488, 1489, 1490, 1491, 1492, 1493, 1494, 1497, 1498, 1499, 1500, 1501, 1502, 1503, 1504, 1531, 1534, 1535, 1558, 1559, 1560, 1561, 1562, 1563, 1565, 1567, 1569, 1570, 1595, 1596, 1597, 1599, 1602, 1605, 1611, 1612, 1614, 1616, 1617, 1618, 1620, 1622, 1623, 1629, 1630, 1631, 1632, 1633, 1634, 1636, 1645, 1650, 1651, 1662, 1675, 1676, 1677, 1679, 1680, 1681, 1684, 1687, 1688, 1690, 1691, 1692, 1693, 1694, 1695, 1696, 1697, 1698, 1699, 1700, 1705, 1706, 1707, 1708, 1710, 1711, 1712, 1713, 1714, 1715, 1716, 1718, 1719, 1720, 1721, 1722, 1723, 1724, 1725, 1726, 1727, 1728, 1729, 1730, 1731, 1732, 1733, 1734, 1735, 1736, 1737, 1740, 1741, 1742, 1743, 1744, 1745, 1748, 1749, 1750, 1754, 1755, 1756, 1757, 1758, 1759, 1763, 1765, 1766, 1767, 1768, 1770, 1773, 1774, 1776, 1777, 1778, 1779, 1780, 1781, 1782, 1783, 1789, 1790, 1791, 1792, 1793, 1794, 1795, 1796, 1797, 1799, 1800, 1801, 1802, 1803, 1804, 1806, 1807, 1808, 1813, 1814, 1815, 1816, 1817, 1818, 1819, 1820, 1822, 1823, 1824, 1825, 1826, 1827, 1828, 1829, 1830, 1831) and str.string = 'questions' and gsl.stamp > now() - interval 10 minute group by `server_id` order by null /* d1c34202ab7572eaf6b93373c221e9c8 db1115 tendril 41s */
33	12	6	216	db1115	tendril_web	tendril
select srv.id, floor((max(value)-min(value))/(unix_timestamp(max(stamp))-unix_timestamp(min(stamp)))) as qps from tendril.global_status_log gsl join tendril.strings str on gsl.name_id = str.id join tendril.servers srv on gsl.server_id = srv.id where str.string = 'questions' and gsl.stamp > now() - interval 10 minute and `server_id` in (1281, 1283, 1282, 1693, 1677, 1313, 1740, 1312, 1305, 1306, 1807, 1318, 1314, 1303, 1796, 1307, 1763, 1765, 1744, 1827, 1317, 1793, 1700, 1675, 1684, 1679, 1680, 1681, 1830, 1831, 1366, 1408, 1423, 1801, 1749, 1370, 1708, 1707, 1373, 1817, 1822, 1823, 1409, 1828, 1676, 1434, 1766, 1767, 1699, 1440, 1650, 1651, 1789, 1790, 1791, 1792, 1783, 1824, 1814, 1461, 1453, 1468, 1487, 1488, 1489, 1490, 1491, 1492, 1493, 1494, 1618, 1750, 1623, 1611, 1612, 1616, 1826, 1629, 1617, 1802, 1622, 1694, 1614, 1569, 1570, 1741, 1742, 1721, 1722, 1723, 1727, 1728, 1799, 1800, 1731, 1732, 1735, 1736, 1724, 1745, 1342, 1345, 1344, 1778, 1352, 1358, 1713, 1714, 1715, 1716, 1374, 1375, 1815, 1379, 1820, 1759, 1687, 1688, 1757, 1758, 1710, 1395, 1389, 1390, 1400, 1397, 1438, 1770, 1754, 1696, 1501, 1502, 1503, 1504, 1497, 1498, 1499, 1500, 1803, 1559, 1558, 1560, 1563, 1561, 1562, 1567, 1565, 1795, 1705, 1595, 1596, 1597, 1818, 1599, 1776, 1692, 1602, 1777, 1645, 1605, 1690, 1695, 1780, 1819, 1620, 1781, 1797, 1691, 1768, 1630, 1706, 1632, 1633, 1636, 1631, 1634, 1782, 1748, 1779, 1794, 1825, 1829, 1720, 1725, 1726, 1729, 1730, 1733, 1734, 1737, 1751, 1752, 1753, 1760, 1762, 1761, 1784, 1785, 1786, 1788, 1227, 1229, 1230, 1228, 1233, 1232, 1231, 1235, 1234, 1816, 1808, 1813, 1804, 1712, 1774, 1276, 1277, 1278, 1270, 1273, 1271, 1274, 1272, 1275, 1806, 1662, 1773, 1756, 1755, 1711, 1331, 1332, 1746, 1557, 1719, 1698, 1743, 1535, 1531, 1697, 1534, 1718) group by `server_id` order by null /* 50f4b6f6ad4b50cbe56e82c25d8c2476 db1115 tendril 5s */

Details

Subject	Repo	Branch	Lines +/-
Port use_index implementation from dbtree to tendril and apply to tree	operations/software/tendril	master	+27 -11
use_index: Fix missing leading whitespace before USE and wrong argument	operations/software/dbtree	master	+2 -2
dbtree: Implement use_index parsing and apply it to QPS query	operations/software/dbtree	master	+26 -13

Customize query in gerrit

Related Objects

Mentioned In: rOSTD13b973a100e5: Port use_index implementation from dbtree to tendril and apply to tree
Mentioned Here: T252331: tendril_purge_global_status_log_5m and global_status_log needs more frequent purging
rOSTD2160c63c8bd7: dashboard.sql: Create tendril_purge_global_status_log event

Event Timeline

jcrespo created this task.Aug 20 2020, 8:30 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 20 2020, 8:30 AM

query plan

root@db1115[tendril]> EXPLAIN select srv.id ...
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: str
         type: const
possible_keys: PRIMARY,name
          key: name
      key_len: 66
          ref: const
         rows: 1
        Extra: Using index; Using temporary
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: srv
         type: index
possible_keys: PRIMARY
          key: m_server_id
      key_len: 5
          ref: NULL
         rows: 242
        Extra: Using where; Using index
*************************** 3. row ***************************
           id: 1
  select_type: SIMPLE
        table: gsl
         type: range
possible_keys: PRIMARY,stamp,server_stamp
          key: PRIMARY
      key_len: 16
          ref: NULL
         rows: 5596
        Extra: Using where; Using index; Using join buffer (flat, BNL join)

execution latency and stats

root@db1115[tendril]> flush status;
Query OK, 0 rows affected (0.02 sec)

root@db1115[tendril]> select srv.id, ...
228 rows in set (11.13 sec)

root@db1115[tendril]> show status like 'Hand%';
+----------------------------+-------+
| Variable_name              | Value |
+----------------------------+-------+
| Handler_commit             | 2     |
| Handler_delete             | 0     |
| Handler_discover           | 0     |
| Handler_external_lock      | 0     |
| Handler_icp_attempts       | 0     |
| Handler_icp_match          | 0     |
| Handler_mrr_init           | 0     |
| Handler_mrr_key_refills    | 0     |
| Handler_mrr_rowid_refills  | 0     |
| Handler_prepare            | 0     |
| Handler_read_first         | 1     |
| Handler_read_key           | 5230  |
| Handler_read_last          | 0     |
| Handler_read_next          | 4787  |
| Handler_read_prev          | 0     |
| Handler_read_retry         | 0     |
| Handler_read_rnd           | 0     |
| Handler_read_rnd_deleted   | 0     |
| Handler_read_rnd_next      | 229   |
| Handler_rollback           | 0     |
| Handler_savepoint          | 0     |
| Handler_savepoint_rollback | 0     |
| Handler_tmp_update         | 4317  |
| Handler_tmp_write          | 228   |
| Handler_update             | 0     |
| Handler_write              | 0     |
+----------------------------+-------+
26 rows in set (0.00 sec)

Mentioned in SAL (#wikimedia-operations) [2020-08-20T08:44:52Z] <jynus> running analyze table on db1115's tendril.global_status_log, may case some stalls on tendril/dbtree T260876

I believe this is what happened, after further investigation:

2160c63c8bd7f66c73d639 (for T252331) and related was deployed, increasing the purge rate of global_status_log table
3 months later, after enough rows have been purged, and innodb stats are automatically refreshed, the qps query on the task head changes plan to not use that stamp index, going from 1+ seconds to 10+ seconds in execution time
The query slows down both dbtree (on both dbtree and tendril domain) and well as /hosts and other views, reaching over the 10 second timeout

There is 2 options to fix this:

Add an index hint, something that is not possible with the current query abstraction api, and that will have to be added
Convince the optimizer that using the stamp index is the preferred metric (histograms?)

None are ideal in terms of effort to cleanness ratio, given current deprecation state of the tool.

Current plan:

+------+-------------+-------+-------+----------------------------+-------------+---------+-------+------+--------------
| id   | select_type | table | type  | possible_keys              | key         | key_len | ref   | rows | Extra        
+------+-------------+-------+-------+----------------------------+-------------+---------+-------+------+--------------
|    1 | SIMPLE      | str   | const | PRIMARY,name               | name        | 66      | const |    1 | Using index; 
|    1 | SIMPLE      | srv   | index | PRIMARY                    | m_server_id | 5       | NULL  |  242 | Using where; 
|    1 | SIMPLE      | gsl   | range | PRIMARY,stamp,server_stamp | PRIMARY     | 16      | NULL  | 4869 | Using where; 
+------+-------------+-------+-------+----------------------------+-------------+---------+-------+------+--------------
3 rows in set (0.04 sec)

+----------------------------+-------+
| Variable_name              | Value |
+----------------------------+-------+
| Handler_commit             | 2     |
| Handler_delete             | 0     |
| Handler_discover           | 0     |
| Handler_external_lock      | 0     |
| Handler_icp_attempts       | 0     |
| Handler_icp_match          | 0     |
| Handler_mrr_init           | 0     |
| Handler_mrr_key_refills    | 0     |
| Handler_mrr_rowid_refills  | 0     |
| Handler_prepare            | 0     |
| Handler_read_first         | 1     |
| Handler_read_key           | 5245  |
| Handler_read_last          | 0     |
| Handler_read_next          | 4802  |
| Handler_read_prev          | 0     |
| Handler_read_retry         | 0     |
| Handler_read_rnd           | 0     |
| Handler_read_rnd_deleted   | 0     |
| Handler_read_rnd_next      | 229   |
| Handler_rollback           | 0     |
| Handler_savepoint          | 0     |
| Handler_savepoint_rollback | 0     |
| Handler_tmp_update         | 4332  |
| Handler_tmp_write          | 228   |
| Handler_update             | 0     |
| Handler_write              | 0     |
+----------------------------+-------+

Faster plan (despite explain showing more rows):

+------+-------------+-------+--------+---------------+---------+---------+-----------------------+---------+-----------
| id   | select_type | table | type   | possible_keys | key     | key_len | ref                   | rows    | Extra     
+------+-------------+-------+--------+---------------+---------+---------+-----------------------+---------+-----------
|    1 | SIMPLE      | str   | const  | PRIMARY,name  | name    | 66      | const                 |       1 | Using inde
|    1 | SIMPLE      | gsl   | range  | stamp         | stamp   | 8       | NULL                  | 1009715 | Using wher
|    1 | SIMPLE      | srv   | eq_ref | PRIMARY       | PRIMARY | 4       | tendril.gsl.server_id |       1 | Using inde
+------+-------------+-------+--------+---------------+---------+---------+-----------------------+---------+-----------
3 rows in set (0.00 sec)

+----------------------------+---------+
| Variable_name              | Value   |
+----------------------------+---------+
| Handler_commit             | 2       |
| Handler_delete             | 0       |
| Handler_discover           | 0       |
| Handler_external_lock      | 0       |
| Handler_icp_attempts       | 0       |
| Handler_icp_match          | 0       |
| Handler_mrr_init           | 0       |
| Handler_mrr_key_refills    | 0       |
| Handler_mrr_rowid_refills  | 0       |
| Handler_prepare            | 0       |
| Handler_read_first         | 0       |
| Handler_read_key           | 9108    |
| Handler_read_last          | 0       |
| Handler_read_next          | 2643223 |
| Handler_read_prev          | 0       |
| Handler_read_retry         | 0       |
| Handler_read_rnd           | 0       |
| Handler_read_rnd_deleted   | 0       |
| Handler_read_rnd_next      | 229     |
| Handler_rollback           | 0       |
| Handler_savepoint          | 0       |
| Handler_savepoint_rollback | 0       |
| Handler_tmp_update         | 4324    |
| Handler_tmp_write          | 228     |
| Handler_update             | 0       |
| Handler_write              | 0       |
+----------------------------+---------+

My guess is MySQL could be right to chose on its own on a faster storage medium/non-tokudb table?

Change 621529 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/software/dbtree@master] dbtree: Implement use_index parsing and apply it to QPS query

https://gerrit.wikimedia.org/r/621529

gerritbot added a project: Patch-For-Review.Aug 20 2020, 2:34 PM

jcrespo triaged this task as Medium priority.Aug 20 2020, 3:00 PM

The above should patch fix the alert, please have a look for obvious security bugs. A similar patch has to be ported to tendril, for both the tree and other views.

Change 621529 merged by Jcrespo:
[operations/software/dbtree@master] dbtree: Implement use_index parsing and apply it to QPS query

https://gerrit.wikimedia.org/r/621529

The above patch "fixed" the alert, or at least it made the web request 10x faster (from 12 seconds to 0.272 s).

Now we need to apply an equivalent patch to tendril, as I can see both the tree and host views slow, too.

Maintenance_bot removed a project: Patch-For-Review.Aug 21 2020, 6:10 AM

Change 621625 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/software/tendril@master] Port use_index implementation from dbtree to tendril and apply to tree

https://gerrit.wikimedia.org/r/621625

gerritbot added a project: Patch-For-Review.Aug 21 2020, 6:39 AM

Manually applying this fix to tendril for /host and /tree too seems to get rid of long running queries on db1115:

Screenshot_20200821_085445.png (1×2 px, 197 KB)

The graphs are slow, but I am not sure if they were always have been like that?

I've just realized that the QPS shows 0 for all hosts, so this is not really a fix as much as we fail the query immediately. Debugging.

Change 621677 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/software/dbtree@master] use_index: Fix missing leading whitespace before USE

https://gerrit.wikimedia.org/r/621677

I had a few formatting errors, fixed on the latest upload.

Change 621677 merged by Jcrespo:
[operations/software/dbtree@master] use_index: Fix missing leading whitespace before USE and wrong argument

https://gerrit.wikimedia.org/r/621677

Change 621625 merged by Jcrespo:
[operations/software/tendril@master] Port use_index implementation from dbtree to tendril and apply to tree

https://gerrit.wikimedia.org/r/621625

@Kormat @Marostegui Please check tendril works normally (for some meaning of "normal"), and I will close this once the workaround has been deployed.

Maintenance_bot removed a project: Patch-For-Review.Aug 21 2020, 8:10 AM

They both seem normal to me.

jcrespo closed this task as Resolved.Aug 21 2020, 9:16 AM

jcrespo claimed this task.

jcrespo mentioned this in rOSTD13b973a100e5: Port use_index implementation from dbtree to tendril and apply to tree.Aug 21 2020, 12:51 PM

Thank you guys for debugging and fixing!

	F32190821: Screenshot_20200821_085445.png
	Aug 21 2020, 6:56 AM

dbtree slowdown 2020-08-20Closed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

dbtree slowdown 2020-08-20
Closed, ResolvedPublic
Actions