Page MenuHomePhabricator

Create a cronjob or an event to truncate/delete rows from tendril.general_log_sampled table
Closed, DeclinedPublic

Description

As seen on T231185 the table general_log_sampled is now growing around 1TB per week. It causes stalls on Tendril, we should create a cronjob or an event that either truncates it or deletes rows everyday.
Deleting rows is problematic with such a big table and how much writes its gets.

However, truncating it means we'll no longer have slow query reports...we need to find a balance of how many days we want be able to check on tendril.

Event Timeline

Marostegui moved this task from Triage to Refine on the DBA board.

I am doing some archeology and it looks like this only affects: https://tendril.wikimedia.org/report/sampled_queries and NOT https://tendril.wikimedia.org/report/slow_queries?host=%5Edb&user=wikiuser&schema=wik&qmode=eq&query=&hours=24 which is what we normally use (or at least I do).

The event present on each server is:

create event ${server}_sampled
  on schedule every 10 second starts date(now()) + interval floor(rand() * 9) second
  do begin

    if (get_lock('${server}_sampled', 1) = 0) then
      signal sqlstate value '45000' set message_text = 'get_lock';
    end if;

    select @server_id := id, @enabled := enabled from servers where host = '${host}' and port = ${port};

    if (@enabled = 1) then

      create temporary table t1 as
        select event_time, user_host, thread_id, server_id, command_type, argument
          from ${server}_general_log_sampled;

      insert into general_log_sampled
        (server_id, event_time, user_host, thread_id, m_server_id, command_type, argument, checksum)
          select @server_id, event_time, user_host, thread_id, server_id, command_type, argument, md5(argument)
            from t1;

I could try disabling the per-host event and then also disabling it on every host creation so we can avoid this (or simply moving the general_sampled) table to BLACKHOLE engine or similar.
Going to keep investigating if it only affects that report and not the general processlist aggregation, which is the one we normally use and is controlled by a different event.

https://tendril.wikimedia.org/report/slow_queries doesn't seem to be using that table for anything:

| 64231005 | tendril_web     | 208.80.155.104:43546 | tendril            | Query       |    13 | statistics                                                            | select  pql.checksum, count(*) as hits, max(pql.time) as max_time, avg(pql.time) as avg_time, sum(pql.time) as sum_time, group_concat(distinct pql.user) as users, group_concat(distinct pql.db order by pql.db) as dbs, group_concat(distinct pql.server_id order by srv.host) as servers, pql.info as sample, pql.db as sample_db, pql.time as sample_time, pql.server_id as sample_server_id from tendril.processlist_query_log pql left join tendril.servers srv on pql.server_id = srv.id where pql.checksum is not null and pql.stamp > now() - interval 72 hour and pql.time > 1 and lower(trim(pql.info)) not like 'show%' and lower(trim(pql.info)) not like 'select master_pos_wait%' and pql.server_id in (1281,1313,1740,1305,1306,1318,1303,1684,1679,1680,1681,1830,1831,1366,1408,1423,1957,1801,1749,1370,1708,1707,1373,1817,1822,1823,1409,1828,1676,1434,1766,1767,1699,1440,1650,1651,1789,1790,1791,1792,1783,1824,1814,1461,1453,1468,1954,1955,1618,1750,1832,1611,1612,1842,1826,1833,1617,1802,1622,1694,1614,1569,1570,1741,1742,1721,1722,1723,1727,1728,1799,1800,1731,1732,1735,1736,1724,1848,1849,1878,1879,1881,1882,1885,1883,1884,1886,1888,1887,1889,1956,1898,1952,1909,1890,1922,1901,1902,1907,1924,1893,1958,1904,1891,1899,1900,1894,1895,1903,1897,1896,1892,1928,1929,1927,1925,1930,1953,1926,1745,1939,1941,1945,1778,1948,1949,1713,1714,1715,1716,1374,1375,1815,1950,1820,1759,1687,1688,1757,1758,1710,1395,1389,1390,1400,1397,1438,1770,1754,1696,1931,1932,1933,1934,1935,1936,1937,1938,1803,1559,1558,1560,1563,1561,1562,1567,1565,1923,1795,1705,1595,1596,1943,1818,1599,1776,1692,1602,1777,1645,1605,1690,1695,1780,1819,1620,1781,1797,1691,1768,1630,1706,1632,1940,1944,1946,1634,1782,1748,1779,1794,1825,1829,1720,1725,1726,1729,1730,1947,1733,1737,1835,1836,1875,1876,1877,1908,1910,1905,1919,1920,1921,1906,1751,1752,1753,1760,1762,1761,1784,1785,1786,1788) and pql.db regexp 'wik' and pql.user regexp 'wikiuser' group by pql.checksum  order by `sum_time` desc, `max_time` desc limit 50

And neither seems to be https://tendril.wikimedia.org/report/slow_queries:

| 64261394 | tendril_web     | 208.80.155.104:43672 | tendril            | Query       |     1 | Queried about 30000 rows                                              | select  q.footprint, count(*) as hits, group_concat(distinct qsl.server_id order by srv.host) as servers, q.template as template, q.content as sample, qsl.server_id as sample_server_id from tendril.queries_seen_log qsl left join tendril.servers srv on qsl.server_id = srv.id join tendril.queries q on qsl.checksum = q.checksum where q.footprint is not null and qsl.stamp > now() - interval 1000 hour and qsl.server_id in (1281,1313,1740,1305,1306,1318,1303,1684,1679,1680,1681,1830,1831,1366,1408,1423,1957,1801,1749,1370,1708,1707,1373,1817,1822,1823,1409,1828,1676,1434,1766,1767,1699,1440,1650,1651,1789,1790,1791,1792,1783,1824,1814,1461,1453,1468,1954,1955,1618,1750,1832,1611,1612,1842,1826,1833,1617,1802,1622,1694,1614,1569,1570,1741,1742,1721,1722,1723,1727,1728,1799,1800,1731,1732,1735,1736,1724,1848,1849,1878,1879,1881,1882,1885,1883,1884,1886,1888,1887,1889,1956,1898,1952,1909,1890,1922,1901,1902,1907,1924,1893,1958,1904,1891,1899,1900,1894,1895,1903,1897,1896,1892,1928,1929,1927,1925,1930,1953,1926,1745,1939,1941,1945,1778,1948,1949,1713,1714,1715,1716,1374,1375,1815,1950,1820,1759,1687,1688,1757,1758,1710,1395,1389,1390,1400,1397,1438,1770,1754,1696,1931,1932,1933,1934,1935,1936,1937,1938,1803,1559,1558,1560,1563,1561,1562,1567,1565,1923,1795,1705,1595,1596,1943,1818,1599,1776,1692,1602,1777,1645,1605,1690,1695,1780,1819,1620,1781,1797,1691,1768,1630,1706,1632,1940,1944,1946,1634,1782,1748,1779,1794,1825,1829,1720,1725,1726,1729,1730,1947,1733,1737,1835,1836,1875,1876,1877,1908,1910,1905,1919,1920,1921,1906,1751,1752,1753,1760,1762,1761,1784,1785,1786,1788) and lower(q.content) regexp cast(X'5e5b5b3a73706163653a5d5d2a2873656c6563747c696e736572747c7570646174657c64656c65746529' as char) group by q.footprint  order by `hits` desc limit 50 |    0.000 |

Tendril is again stuck and this is what I have seen:

| 66202925 | root            | 10.64.32.25          | tendril            | Connect     |   606 | Sending data                                                          | insert into general_log_sampled

Mentioned in SAL (#wikimedia-operations) [2021-04-29T20:16:17Z] <marostegui> Restart tendril database - T281486

For now I have:

mysql:root@localhost [tendril]> alter table general_log_sampled ENGINE = BLACKHOLE;
Query OK, 0 rows affected (0.36 sec)
Records: 0  Duplicates: 0  Warnings: 0

I haven't found what general_log_sampled is really used for, as this keeps working: https://tendril.wikimedia.org/report/slow_queries?host=%5Edb&user=wikiuser&schema=wik&qmode=eq&query=&hours=48

As I am sure the clean up of the global_status_log* tables isn't helping, I have disabled it for now - they probably need changing to try to delete less rows at the same time.
Right now every 5 minutes it tries to delete 3M rows, (which is pretty much the amount of rows that get written to those tables in that time).
I will change it tomorrow to do 1M instead and see if it works better than 3M, although it means the global_status_log* table will grow forever.

mysql:root@localhost [tendril]> alter event tendril_purge_global_status_log disable;
Query OK, 0 rows affected (0.10 sec)

mysql:root@localhost [tendril]> alter event tendril_purge_global_status_log_5m disable;
Query OK, 0 rows affected (0.00 sec)

I also noticed a query from tendril_web which was:

| 64261394 | tendril_web | 208.80.155.104:43672 | tendril            | Killed      | 19733 | removing tmp table                                                    | select  q.footprint, count(*) as hits, group_concat(distinct qsl.server_i

That is 5 hours...maybe this has caused the whole server to stall too.
I ended up restarting mysql, which took quite a bit, however, as soon as it came back it went crazy again with 1.5k connections and the writes getting stuck on several tables.

I am starting to believe that this is just a contention problem really, as we have 3k events that run pretty much every second or minute trying to write to aggregate data to the same tables.

For instance the event that grabs all the variables:

root@db1115:/srv/sqldata/tendril# mysql -A tendril -e "show events" | grep variables | wc -l
298

300 events running every minute for something we don't really use.
I have disabled all of them:

for i in `mysql -A tendril -e "show events" | grep variables | awk -F " " '{print $2}'`; do mysql -A tendril -e "alter event $i disable" ; done

I have also disabled the sampled one, which runs every seconds for 300 hosts.

root@db1115:/srv/sqldata/tendril# for i in `mysql -A tendril -e "show events" | grep sampled | awk -F " " '{print $2}'`; do mysql -A tendril -e "alter event $i disable" ; done
root@db1115:/srv/sqldata/tendril#

Also noticed:

[Thu Apr 29 18:18:08 2021] XFS: mysqld(30381) possible memory allocation deadlock size 524304 in kmem_realloc (mode:0x2400240)
[Thu Apr 29 18:18:10 2021] XFS: mysqld(30381) possible memory allocation deadlock size 524304 in kmem_realloc (mode:0x2400240)
[Thu Apr 29 18:18:12 2021] XFS: mysqld(30381) possible memory allocation deadlock size 524304 in kmem_realloc (mode:0x2400240)
[Thu Apr 29 18:18:14 2021] XFS: mysqld(30381) possible memory allocation deadlock size 524304 in kmem_realloc (mode:0x2400240)
[Thu Apr 29 18:18:16 2021] XFS: mysqld(30381) possible memory allocation deadlock size 524304 in kmem_realloc (mode:0x2400240)
[Thu Apr 29 18:18:17 2021] XFS: mysqld(30381) possible memory allocation deadlock size 524304 in kmem_realloc (mode:0x2400240)
[Thu Apr 29 18:18:18 2021] XFS: mysqld(30381) possible memory allocation deadlock size 524304 in kmem_realloc (mode:0x2400240)
[Thu Apr 29 18:18:20 2021] XFS: mysqld(30381) possible memory allocation deadlock size 524304 in kmem_realloc (mode:0x2400240)
[Thu Apr 29 18:18:21 2021] XFS: mysqld(30381) possible memory allocation deadlock size 524304 in kmem_realloc (mode:0x2400240)
[Thu Apr 29 18:18:22 2021] XFS: mysqld(30381) possible memory allocation deadlock size 524304 in kmem_realloc (mode:0x2400240)
[Thu Apr 29 18:18:24 2021] XFS: mysqld(30381) possible memory allocation deadlock size 524304 in kmem_realloc (mode:0x2400240)
[Thu Apr 29 18:18:25 2021] XFS: mysqld(30381) possible memory allocation deadlock size 524304 in kmem_realloc (mode:0x2400240)
[Thu Apr 29 18:18:26 2021] XFS: mysqld(30381) possible memory allocation deadlock size 524304 in kmem_realloc (mode:0x2400240)
[Thu Apr 29 18:18:27 2021] XFS: mysqld(30381) possible memory allocation deadlock size 524304 in kmem_realloc (mode:0x2400240)

The server still goes mad after enabling the event_scheduler:

20:39:11 up 7 days,  7:43,  3 users,  load average: 453.35, 332.61, 254.61

I have rebooted the host and will investigate if those memory errors are the source of hangs or a consequence

And after the reboot and the mysql start:

root@db1115:~# w
 20:47:03 up 4 min,  2 users,  load average: 264.16, 65.70, 22.02

The host is really unusable after some minutes, I am going to hard reset it, leave mysql stop and run some xfs fragmentation reports

-bash-4.4$ w
 20:53:26 up 10 min,  3 users,  load average: 455.90, 362.49, 179.98

Mentioned in SAL (#wikimedia-operations) [2021-04-29T20:54:34Z] <marostegui> Stop mysql on tendril for the UTC night, dbtree and tendrill will remain down for a few hours T281486

root@db1115:~# xfs_db -c frag -r /dev/md2
actual 17139, ideal 8175, fragmentation factor 52.30%

Tendril and dbtree will remain stopped for the next few hours. I am defragmenting xfs

The defragment finished:

root@db1115:~# xfs_db -c frag -r /dev/md2
actual 8276, ideal 8176, fragmentation factor 1.21%

Starting MySQL now without the event scheduler to check the general status of all the big tables.

As soon as we start the events, the server goes crazy again

Changing the isolation level to 'READ-UNCOMMITTED' seem to have worked and the server is now under control after restarting the event_scheduler.

Also, I have truncated processlist_query_log which now confirms, it is the table used by https://tendril.wikimedia.org/report/slow_queries?host=^db&user=wikiuser&schema=wik&hours=1, and not general_log_sampled, so I am leaving that event disabled to avoid extra load.

Unfortunately, the events *variables are used to provide the version, ram etc of the hosts, so those need to be enabled as otherwise we don't get those in the host view.

Changed back to READ-COMMITTED as otherwise the Host view isn't really useful as it doesn't show the replication status and lag for most of the hosts.

The load can be see with this image and the general bad status:

Captura de pantalla 2021-04-30 a las 5.56.10.png (806×1 px, 223 KB)

And the host is back stuck with 1000 connections again

I have increased the buffer pool to 50GB (we were using 20GB), but those big tables are InnoDB, so let's see if this helps.

The server seems "stable" at around 600 connections, which is way higher than it normally is but so far it is not increasing.
I really think we've reached the limits of what we can do with so many events and this host.

Disabled the *schema events (ones run every minute and the other one everyday), we don't use them for anything:

CREATE DEFINER=`root`@`localhost` EVENT `pc1010_eqiad_wmnet_3306_schema` ON SCHEDULE EVERY 1 MINUTE STARTS '2020-05-08 00:00:00' ON COMPLETION NOT PRESERVE DISABLE DO begin

    if (get_lock('pc1010_eqiad_wmnet_3306_schema', 1) = 0) then
      signal sqlstate value '45000' set message_text = 'get_lock';
    end if;

    select @server_id := id, @enabled := enabled from servers where host = 'pc1010.eqiad.wmnet' and port = 3306;

    if (@enabled = 1) then



      begin

        declare db_done int default 0;
        declare db_name varchar(100);

        declare db_names cursor for
          select schema_name from schemata_check where server_id = @server_id limit 10;

        declare continue handler for not found set db_done = 1;

        set db_done = 0;
        open db_names;

        repeat fetch db_names into db_name;

          if (db_done = 0 and db_name is not null) then

            delete from schemata where server_id = @server_id and schema_name = db_name;

            create temporary table new_tables as
              select table_name from pc1010_eqiad_wmnet_3306_tablenames where table_schema = db_name;

            begin

              declare tbl_done int default 0;
              declare tbl_name varchar(100);

              declare tbl_names cursor for
                select table_name from new_tables;

              declare continue handler for not found set tbl_done = 1;

              set tbl_done = 0;
              open tbl_names;

              repeat fetch tbl_names into tbl_name;

                if (tbl_done = 0 and tbl_name is not null) then



                  create temporary table new_table as
                    select * from pc1010_eqiad_wmnet_3306_tables where table_schema = db_name and table_name = tbl_name;

                  select @fields := group_concat(column_name) from information_schema.columns
                    where table_schema = database() and table_name = 'pc1010_eqiad_wmnet_3306_tables';

                  delete from tables where server_id = @server_id and table_schema = db_name and table_name = tbl_name;

                  set @sql := concat(
                    'insert into tables (server_id, ', @fields, ') select @server_id, ', @fields, ' from new_table'
                  );

                  prepare stmt from @sql; execute stmt; deallocate prepare stmt;

                  drop temporary table new_table;



                  create temporary table new_columns as
                    select * from pc1010_eqiad_wmnet_3306_columns where table_schema = db_name and table_name = tbl_name;

                  select @fields := group_concat(column_name) from information_schema.columns
                    where table_schema = database() and table_name = 'pc1010_eqiad_wmnet_3306_columns';

                  delete from columns where server_id = @server_id
                    and table_schema = db_name and table_name = tbl_name;

                  set @sql := concat(
                    'insert into columns (server_id, ', @fields, ') select @server_id, ', @fields, ' from new_columns'
                  );

                  prepare stmt from @sql; execute stmt; deallocate prepare stmt;

                  drop temporary table new_columns;



                  create temporary table new_partitions as
                    select * from pc1010_eqiad_wmnet_3306_partitions where table_schema = db_name and table_name = tbl_name;

                  select @fields := group_concat(column_name) from information_schema.columns
                    where table_schema = database() and table_name = 'pc1010_eqiad_wmnet_3306_partitions';

                  delete from partitions where server_id = @server_id
                    and table_schema = db_name and table_name = tbl_name;

                  set @sql := concat(
                    'insert into partitions (server_id, ', @fields, ') select @server_id, ', @fields, ' from new_partitions'
                  );

                  prepare stmt from @sql; execute stmt; deallocate prepare stmt;

                  drop temporary table new_partitions;

                end if;

                until tbl_done
              end repeat;

              close tbl_names;

            end;



            create temporary table new_triggers as
              select * from pc1010_eqiad_wmnet_3306_triggers where trigger_schema = db_name;

            select @fields := group_concat(column_name) from information_schema.columns
              where table_schema = database() and table_name = 'pc1010_eqiad_wmnet_3306_triggers';

            delete from triggers where server_id = @server_id and trigger_schema = db_name;

            set @sql := concat(
              'insert into triggers (server_id, ', @fields, ') select @server_id, ', @fields, ' from new_triggers'
            );

            prepare stmt from @sql; execute stmt; deallocate prepare stmt;



            delete from tables where server_id = @server_id and table_schema = db_name and table_name not in (
              select table_name from new_tables
            );

            delete from triggers where server_id = @server_id and trigger_schema = db_name and trigger_name not in (
              select trigger_name from new_triggers
            );

            drop temporary table new_tables;
            drop temporary table new_triggers;

            delete from schemata_check where server_id = @server_id and schema_name = db_name;
            insert into schemata select @server_id, t.* from pc1010_eqiad_wmnet_3306_schemata t where t.schema_name = db_name;

          end if;

          until db_done
        end repeat;

        close db_names;

      end;

      update servers set event_schema = now() where id = @server_id;

    end if;

    do release_lock('pc1010_eqiad_wmnet_3306_schema');
CREATE DEFINER=`root`@`localhost` EVENT `pc1010_eqiad_wmnet_3306_schema_prep` ON SCHEDULE EVERY 1 DAY STARTS '2020-05-08 05:00:00' ON COMPLETION NOT PRESERVE DISABLE DO begin

    select @server_id := id, @enabled := enabled from servers where host = 'pc1010.eqiad.wmnet' and port = 3306;

    delete from schemata_check where server_id = @server_id;
    insert into schemata_check select @server_id, t.schema_name from pc1010_eqiad_wmnet_3306_schemata t;

The host is around 400-600 connections all the time. I am not going to make more changes, it is slow but functional at this point.

Closing this for now - don't think we are doing this anytime soon