Page MenuHomePhabricator

beta cluster databases have almost full disks
Closed, ResolvedPublic

Description

deployment-db03 and deployment-db04 have almost full disks. They are build on instances that provides ~ 111 GBytes of disk and each are left with 5 GBytes free disk.

There are 25 1.1GBytes files like /srv/sqldata/deployment-db03-bin.000021 that might be the cause.

Event Timeline

hashar added a subscriber: demon.

@demon connected on each of the beta database on May 23 for:

2017-05-23 16:55 <RainbowSprinkles> dropped flow_ext_ref from commonswiki on beta. schema migration is busted, going to let it recreate table

And now they look fine:

deployment-db03:~$ df -h /srv
Filesystem                          Size  Used Avail Use% Mounted on
/dev/mapper/vd-second--local--disk  111G   75G   31G  72% /srv

deployment-db04:~$ df -h /srv
Filesystem                          Size  Used Avail Use% Mounted on
/dev/mapper/vd-second--local--disk  111G   38G   68G  36% /srv

For DBA , there is still a lot of 1GB bin files on the instance:

deployment-db03:~$ ls -hl /srv/sqldata/*-bin.*|cut -d\  -f5-
1.1G May 19 18:02 /srv/sqldata/deployment-db03-bin.000014
1.1G May 19 20:56 /srv/sqldata/deployment-db03-bin.000015
1.1G May 19 23:44 /srv/sqldata/deployment-db03-bin.000016
1.1G May 20 02:26 /srv/sqldata/deployment-db03-bin.000017
1.1G May 20 05:05 /srv/sqldata/deployment-db03-bin.000018
1.1G May 20 08:01 /srv/sqldata/deployment-db03-bin.000019
1.1G May 20 10:51 /srv/sqldata/deployment-db03-bin.000020
1.1G May 20 13:44 /srv/sqldata/deployment-db03-bin.000021
1.1G May 20 16:45 /srv/sqldata/deployment-db03-bin.000022
1.1G May 20 19:38 /srv/sqldata/deployment-db03-bin.000023
1.1G May 20 22:21 /srv/sqldata/deployment-db03-bin.000024
1.1G May 21 00:57 /srv/sqldata/deployment-db03-bin.000025
1.1G May 21 03:35 /srv/sqldata/deployment-db03-bin.000026
1.1G May 21 06:21 /srv/sqldata/deployment-db03-bin.000027
1.1G May 21 09:11 /srv/sqldata/deployment-db03-bin.000028
1.1G May 21 11:57 /srv/sqldata/deployment-db03-bin.000029
1.1G May 21 14:49 /srv/sqldata/deployment-db03-bin.000030
1.1G May 21 17:42 /srv/sqldata/deployment-db03-bin.000031
1.1G May 21 20:25 /srv/sqldata/deployment-db03-bin.000032
1.1G May 21 23:11 /srv/sqldata/deployment-db03-bin.000033
1.1G May 22 01:55 /srv/sqldata/deployment-db03-bin.000034
1.1G May 22 04:36 /srv/sqldata/deployment-db03-bin.000035
1.1G May 22 07:17 /srv/sqldata/deployment-db03-bin.000036
1.1G May 22 09:53 /srv/sqldata/deployment-db03-bin.000037
1.1G May 22 12:32 /srv/sqldata/deployment-db03-bin.000038
1.1G May 22 15:11 /srv/sqldata/deployment-db03-bin.000039
1.1G May 22 17:54 /srv/sqldata/deployment-db03-bin.000040
1.1G May 22 20:31 /srv/sqldata/deployment-db03-bin.000041
1.1G May 22 23:06 /srv/sqldata/deployment-db03-bin.000042
1.1G May 23 01:41 /srv/sqldata/deployment-db03-bin.000043
1.1G May 23 04:19 /srv/sqldata/deployment-db03-bin.000044
1.1G May 23 07:01 /srv/sqldata/deployment-db03-bin.000045
1.1G May 23 09:47 /srv/sqldata/deployment-db03-bin.000046
1.1G May 23 12:36 /srv/sqldata/deployment-db03-bin.000047
1.1G May 23 15:19 /srv/sqldata/deployment-db03-bin.000048
1.1G May 23 17:43 /srv/sqldata/deployment-db03-bin.000049
1.1G May 23 19:33 /srv/sqldata/deployment-db03-bin.000050
1.1G May 26 00:36 /srv/sqldata/deployment-db03-bin.000051
1.1G May 26 19:54 /srv/sqldata/deployment-db03-bin.000052
615M May 31 07:31 /srv/sqldata/deployment-db03-bin.000053
1.2K May 26 19:54 /srv/sqldata/deployment-db03-bin.index

deployment-db04:

root@BETA[(none)]> SHOW PROCESSLIST \G
*************************** 1. row ***************************
      Id: 3
    User: system user
    Host: 
      db: NULL
 Command: Connect
    Time: 579425
   State: Waiting for master to send event
    Info: NULL
Progress: 0.000
*************************** 2. row ***************************
      Id: 4
    User: system user
    Host: 
      db: NULL
 Command: Connect
    Time: 69
   State: Slave has read all relay log; waiting for the slave I/O thread to update it
    Info: NULL
Progress: 0.000
*************************** 3. row ***************************
      Id: 960022
    User: root
    Host: localhost
      db: NULL
 Command: Query
    Time: 0
   State: init
    Info: SHOW PROCESSLIST
Progress: 0.000
3 rows in set (0.00 sec)

root@BETA[(none)]> SHOW SLAVE STATUS \G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: deployment-db03.eqiad.wmflabs
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: deployment-db03-bin.000053
          Read_Master_Log_Pos: 643935603
               Relay_Log_File: deployment-db04-relay-bin.000178
                Relay_Log_Pos: 643935900
        Relay_Master_Log_File: deployment-db03-bin.000053
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 0
                   Last_Error: 
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 643935603
              Relay_Log_Space: 643936260
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 0
               Last_SQL_Error: 
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 172234526
               Master_SSL_Crl: 
           Master_SSL_Crlpath: 
                   Using_Gtid: No
                  Gtid_IO_Pos: 
1 row in set (0.00 sec)

root@BETA[(none)]> 
root@BETA[(none)]>

Mentioned in SAL (#wikimedia-releng) [2017-05-31T07:49:29Z] <hashar> deployment-db03: mysql> set global expire_logs_days = 7 - to expire bin logs faster (instead of 30 days) - T166060

Change 356337 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/puppet@production] beta: keep less mysql bin logs

https://gerrit.wikimedia.org/r/356337

Mentioned in SAL (#wikimedia-releng) [2017-05-31T07:50:59Z] <hashar> deployment-db04: mysql> set global expire_logs_days = 7 - to expire bin logs faster (instead of 30 days) - T166060

Change 356337 merged by Marostegui:
[operations/puppet@production] beta: keep less mysql bin logs

https://gerrit.wikimedia.org/r/356337

Aced by Manuel in less than a minute. The root cause is the master was expiring the bin log after 30 days which we can not afford on beta. Lowered that to 7 days which reduce the disk usage. \O/

For the record, following Manuel instructions:

On the master (db03) I have copied the bin files 000 to 040 to /srv/sqldata/T166060 then on the master ran:

purge binary logs to 'deployment-db03-bin.000041';

That got rid of the files

For the record, following Manuel instructions:

On the master (db03) I have copied the bin files 000 to 040 to /srv/sqldata/T166060 then on the master ran:

purge binary logs to 'deployment-db03-bin.000041';

That got rid of the files

You should be fine to delete those backup'ed logs in a few days once you are sure everything is fine (which should be)

Mentioned in SAL (#wikimedia-releng) [2017-06-01T08:03:48Z] <hashar> Purged all mysql bin files from deployment-db03 ( rm -fR /srv/sqldata/T166060 ) - T166060

Mentioned in SAL (#wikimedia-releng) [2017-06-01T08:03:48Z] <hashar> Purged all mysql bin files from deployment-db03 ( rm -fR /srv/sqldata/T166060 ) - T166060