Page MenuHomePhabricator

db1115's (tendril) disk filling up
Closed, ResolvedPublic

Description

Looks like we need to purge some tables:

root@db1115:/srv# df -hT /srv/
Filesystem     Type  Size  Used Avail Use% Mounted on
/dev/md2       xfs   3.6T  3.3T  399G  90% /srv
root@db1115:/srv# du -sh *
3.3T	sqldata

-rw-rw---- 1 mysql mysql 3.1G Sep 14 05:35 global_status.ibd
-rw-rw---- 1 mysql mysql 2.6T Sep 14 05:35 general_log_sampled.ibd

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 14 2020, 5:35 AM
Marostegui triaged this task as High priority.
Marostegui moved this task from Triage to In progress on the DBA board.

Something happened the 26th of August, when the host started to grow a lot on disk space on that table

Most of the logged queries are coming from db1138 from what I can see.

Mentioned in SAL (#wikimedia-operations) [2020-09-14T05:54:25Z] <marostegui> Truncate tendril.general_log_sampled on db1115 - T262782

Marostegui lowered the priority of this task from High to Medium.Sep 14 2020, 6:02 AM

I have truncated the table and after a few minutes the table has again almost 1M rows and this is interesting:

mysql:root@localhost [tendril]> select distinct(m_server_id) from general_log_sampled;
+-------------+
| m_server_id |
+-------------+
|   171978876 |
+-------------+
1 row in set (0.57 sec)

mysql:root@localhost [tendril]> select INET_NTOA(171978876) from general_log_sampled limit 1;
+----------------------+
| INET_NTOA(171978876) |
+----------------------+
| 10.64.48.124         |
+----------------------+
1 row in set (0.00 sec)

That is db1138, I will check what's going on with that host or that event but the initial issue is mitigated:

root@db1115:/srv/sqldata/tendril# df -hT /srv
Filesystem     Type  Size  Used Avail Use% Mounted on
/dev/md2       xfs   3.6T  644G  3.0T  18% /srv
Marostegui closed this task as Resolved.Sep 14 2020, 7:44 AM

Mystery solved.
Looks like general_log table on db1138 contained lots of queries (248k) (this was probably enabled by me for MCR query capture) from the 26th Aug.
This was being imported every minute on tendril by:

| db1138_eqiad_wmnet_3306_general_log_sampled | CREATE TABLE `db1138_eqiad_wmnet_3306_general_log_sampled` (
  `event_time` timestamp(6) NOT NULL DEFAULT CURRENT_TIMESTAMP(6) ON UPDATE CURRENT_TIMESTAMP(6),
  `user_host` mediumtext NOT NULL,
  `thread_id` int(11) NOT NULL,
  `server_id` int(10) unsigned NOT NULL,
  `command_type` varchar(64) NOT NULL,
  `argument` mediumtext NOT NULL
) ENGINE=FEDERATED DEFAULT CHARSET=utf8 CONNECTION='mysql://xxx:xxxxxxx@db1138.eqiad.wmnet:3306/mysql/general_log'

Then, the general_log_sample event was including those every minute and hence the lineal growth.

I have truncated general_log on db1138.