Page MenuHomePhabricator

parsercache used disk space increase
Closed, ResolvedPublic

Description

Looks like that yesterday at around 14:23 our parchercache hosts started to experience an increase on used disk space.

https://grafana.wikimedia.org/render/dashboard-solo/file/server-board.json?refresh=1m&panelId=17&orgId=1&from=1539149866002&to=1539236266002&var-server=1005&var-network=eth0&width=1000&height=500&tz=UTC%2B02%3A00

After we settle back to normal operation:

  • The binlog purger should be stopped on pc1004**
  • The binlog purger should be stopped on pc1005**
  • The binlog purger should be stopped on pc1006**
  • The binlog purger should be stopped on pc2004**
  • The binlog purger should be stopped on pc2005**
  • The binlog purger should be stopped on pc2006**
  • The tables in 'parsercache' schema should be truncated on pc2004
  • The tables in 'parsercache' schema should be truncated on pc2005
  • The tables in 'parsercache' schema should be truncated on pc2006
  • max_binlog_size should be set back to 1048576000*
*     there were a lot's of data purged already, but the innodb won't free up space automatically
**   they're running inside screen named 'purge' (screen -r purge)
*** or chage that in config too

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 11 2018, 5:42 AM
Marostegui triaged this task as Normal priority.Oct 11 2018, 5:43 AM
Marostegui moved this task from Triage to In progress on the DBA board.

Mentioned in SAL (#wikimedia-operations) [2018-10-11T05:43:30Z] <marostegui> Purge binary logs on pc2005 due to disk space issues - T206740

Marostegui updated the task description. (Show Details)Oct 11 2018, 5:48 AM

Edited the task description to remove some graphs that could be misleading (they were server reboots due to upgrades)

Marostegui raised the priority of this task from Normal to Unbreak Now!.Oct 11 2018, 7:50 AM

This is increasing quite rapidly

Restricted Application added subscribers: Liuxinyu970226, TerraCodes. · View Herald TranscriptOct 11 2018, 7:50 AM
Marostegui mentioned this in Unknown Object (Task).Oct 11 2018, 8:41 AM
Banyek added a subscriber: Banyek.Oct 11 2018, 9:19 AM

The binary logs were purged on pc1004. pc1005 pc1006.
Also the binlog_max_size were set to 10M and the hosts now have this running in a screen:

while true; do 
  echo "purging binglogs"; 
  sudo mysql --skip-ssl -e "PURGE BINARY LOGS BEFORE (NOW() - INTERVAL 1 HOUR)"; 
  sleep 60;
done
Banyek updated the task description. (Show Details)Oct 11 2018, 9:26 AM
Banyek updated the task description. (Show Details)
Banyek updated the task description. (Show Details)
akosiaris updated the task description. (Show Details)Oct 11 2018, 9:34 AM
Marostegui updated the task description. (Show Details)Oct 11 2018, 10:02 AM
Marostegui lowered the priority of this task from Unbreak Now! to High.Oct 11 2018, 2:23 PM

This is no longer "unbreak now"

pc1004
Filesystem                 Type  Size  Used Avail Use% Mounted on
/dev/mapper/pc1004--vg-srv xfs   2.2T  1.6T  644G  71% /srv

pc1005
Filesystem                 Type  Size  Used Avail Use% Mounted on
/dev/mapper/pc1005--vg-srv xfs   2.2T  1.5T  695G  69% /srv

pc1006
Filesystem                 Type  Size  Used Avail Use% Mounted on
/dev/mapper/pc1006--vg-srv xfs   2.2T  1.5T  651G  71% /srv

We need to start purging binlogs in codfw I would say (eqiad caught up with codfw):

pc2004
Filesystem                 Type  Size  Used Avail Use% Mounted on
/dev/mapper/pc2004--vg-srv xfs   2.2T  1.9T  332G  85% /srv

pc2005
Filesystem                 Type  Size  Used Avail Use% Mounted on
/dev/mapper/pc2005--vg-srv xfs   2.2T  1.9T  334G  85% /srv

pc2006
Filesystem                 Type  Size  Used Avail Use% Mounted on
/dev/mapper/pc2006--vg-srv xfs   2.2T  1.9T  299G  87% /srv

Forgetting the codfw -> eqiad replication was the most likely cause of overload on the application servers (and on External storage hosts).

Forgetting the codfw -> eqiad replication was the most likely cause of overload on the application servers (and on External storage hosts).

I think we should we track this (and the respective actionables) in a different task

Forgetting the codfw -> eqiad replication was the most likely cause of overload on the application servers (and on External storage hosts).

I think we should we track this (and the respective actionables) in a different task

Agreed - we put it on top of this task just to avoid forgetting it, but we can move it to a different one.

The MediaWiki eqiad-appserver cluster gasping for air,

as each server wakes up from a four-week switchover dream,

to an ice-cold parser cache.

Restricted Application added a project: Performance-Team. · View Herald TranscriptOct 12 2018, 2:36 AM

The binlog purging on codfw was started yesterday (sorry I didn't logged it here), and it runs since; the replication works, and the disk space is under control.

Marostegui added a comment.EditedOct 15 2018, 5:38 AM

@Banyek please double check the key purge has finished on mwmaint1002 and keep on with the rest of pending things to do here.
Probably after re-enabling the normal config options you might want to check if it is worth doing the table rebuild by doing it on codfw (make sure not to replicate the changes) and checking before/after space, so we can evaluate if it is worth the time or not (as we are on a good shape now)

on checking the screen '91243.parsercache' on mwmaint1002 I can confirm that the key purge - I proceed

Banyek claimed this task.Oct 15 2018, 7:59 AM
Banyek added a project: User-Banyek.

Sounds good!
Also as T206740#4659202, let's create a separate task for the replication check addition, so we can just focus on the immediate triage on this task I would say

Mentioned in SAL (#wikimedia-operations) [2018-10-15T09:41:35Z] <banyek> max_binlog_size is set back to 1048576000 on ParseCache hosts (T206740)

Banyek updated the task description. (Show Details)Oct 15 2018, 9:42 AM

The next step will be to stop replication of pc1004 from pc2004 and then run the following code in a screen in pc2004:

for TABLE in $(mysql --skip-ssl -BN -e "show tables" parsercache); do
    mysql --skip-ssl -e "SET SESSION sql_log_bin=0; ALTER TABLE parsercache.${TABLE} ENGINE=InnoDB"
    sleep 1;
done

After this we'll know exactly how much disk space we can reclaim, and is it worth to it do or not.

Mentioned in SAL (#wikimedia-operations) [2018-10-16T08:56:38Z] <banyek> stopping pc2004 -> pc1004 replication (T206740)

On pc1004 the slave status before 'reset slave all':

                  Slave_IO_State:
                  Master_Host: pc2004.codfw.wmnet
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: pc2004-bin.154391
          Read_Master_Log_Pos: 494855359
               Relay_Log_File: pc1004-relay-bin.154913
                Relay_Log_Pos: 113542
        Relay_Master_Log_File: pc2004-bin.154391
             Slave_IO_Running: No
            Slave_SQL_Running: No
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 494855359
              Relay_Log_Space: 113884
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: Yes
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 180359338
               Master_SSL_Crl:
           Master_SSL_Crlpath:
                   Using_Gtid: No
                  Gtid_IO_Pos:

Regarding the information_schema there will be not too much data to reclaim:

MariaDB [(none)]> select SUM(DATA_LENGTH)/1024/1024/1024  from information_schema.tables where TABLE_SCHEMA='parsercache';
+---------------------------------+
| SUM(DATA_LENGTH)/1024/1024/1024 |
+---------------------------------+
|               1581.942367553711 |
+---------------------------------+
1 row in set (0.01 sec)

The data data length is ~1,5T regarding the information_schema and the free space is ~12G:

MariaDB [(none)]> select SUM(DATA_FREE)/1024/1024/1024  from information_schema.tables where TABLE_SCHEMA='parsercache';
+-------------------------------+
| SUM(DATA_FREE)/1024/1024/1024 |
+-------------------------------+
|               11.985351562500 |
+-------------------------------+
1 row in set (0.00 sec)

I verified that the DATA_LENGTH value seems normal:

root@pc2004:/srv/sqldata-cache/parsercache# ls -al /srv/sqldata-cache/parsercache/pc*.ibd | awk '{sum +=$5} END {print sum}'
1.75614e+12
root@pc2004:/srv/sqldata-cache/parsercache# du -sk
1714984316	.
root@pc2004:/srv/sqldata-cache/parsercache# du -sh
1.6T	.

Pick some of the biggest tables and try to alter them to see how much you get and then we can probably extrapolate.
If it is not worth to be run on eqiad is good to know. Codfw should still be truncated after your tests I would suggest.

for luck all the tables around 6,5 Gb there, so yeah, I guess picking one is good enough.
I do it now, then provide the data here

for luck all the tables around 6,5 Gb there, so yeah, I guess picking one is good enough. ç

They are expected to have more or less the same size.

the result is 5.9 Gb, which is more than expected, but still not too much - less than 10% - with the whole operation we can free up around 124Gb.

Considering we are at around 70% at eqiad, and should remain like that, I don't think it is worth the hassle and the risk of depooling hosts at this point. So probably just truncating codfw tables and leaving replication codfw -> eqiad disconnected is the way to go here.

ok. I'll truncate the tables then in pc2004 today, 2005 tomorrow and 2006 the day after tomorrow.
I also stop the binlog purgers in codfw to see the rate as the binlogs taking up disk space to see if we can stop them in eqiad or not

You can probably disconnect pc1005 and pc1006 today already, but keep executing the truncating without binlog, just in case.

Marostegui updated the task description. (Show Details)Oct 16 2018, 10:12 AM

ok, good idea

pc1005 replication before reset slave all:

Master_Host: pc2005.codfw.wmnet
Master_Log_File: pc2005-bin.153328
Exec_Master_Log_Pos: 31559437

Mentioned in SAL (#wikimedia-operations) [2018-10-16T10:37:11Z] <banyek> stopping pc2006 -> pc1006 replication (T206740)

Mentioned in SAL (#wikimedia-operations) [2018-10-16T10:37:21Z] <banyek> stopping pc2005 -> pc1005 replication (T206740)

pc1006 replication before reset slave all:

Master_Host: pc2006.codfw.wmnet
Master_Log_File: pc2006-bin.154452
Exec_Master_Log_Pos: 470004334

I'll start truncating the tables on pc2004 with:

for TABLE in $(mysql --skip-ssl -BN -e "show tables" parsercache); do
    mysql --skip-ssl -e "SET SESSION sql_log_bin=0; TRUNCATE TABLE ${TABLE}"
done
Banyek updated the task description. (Show Details)Oct 16 2018, 10:54 AM
Banyek updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2018-10-16T11:02:11Z] <banyek> truncating tables in parsecache@pc2004 (T206740)

Mentioned in SAL (#wikimedia-operations) [2018-10-16T11:32:56Z] <banyek> the binlog purging stopped on pc2004 (T206740)

Banyek updated the task description. (Show Details)Oct 16 2018, 11:33 AM
Banyek moved this task from Backlog to In progress on the User-Banyek board.Oct 16 2018, 1:21 PM
Krinkle edited projects, added MediaWiki-Cache; removed Wikimedia-Rdbms.
Krinkle moved this task from Untriaged to ParserCache on the MediaWiki-Cache board.

after cleaning up parsercache data yesterday on pc2004, I checked the disk usage of the 'parsercache' database and the binlogs.
The database is now 176Gb, the binlogs were 155GB - this is a lot - the disk in the PC hosts is around 2200 MB, and the not-cleaned db is around 1600MB which means we have space for less than 4 days of binlogs.
On the other hand, the 'expire_logs_days' variable is set to 1 so we should be ok.
I think the best would be to keep the binlog cleaners running on the eqiad host, and keep freeing up space on codfw hosts - after the codfw is all right, next Monday we can stop the purgers safely on eqiad too.

Banyek updated the task description. (Show Details)Oct 17 2018, 7:51 AM

Keep in mind that the disk space is always constant on the pc hosts: https://grafana.wikimedia.org/dashboard/file/server-board.json?panelId=17&fullscreen&orgId=1&var-server=pc1005&var-network=eth0&from=1532045819383&to=1535494432566
If it grows then that is normally an indication of an issue.

eqiad is now 70% (1.6TB used) which is the normal space it should have (check the above graph). That shouldn't grow much, how many old binlogs do we have there?

the 'parsercache' database takes around 1,5T data on pc1004
the binlogs are 7Gb on pc1004; but the binlogs are cleaned up in every hour
on the pc2004 host the binlogs take 155Gb now, and the expire_logs_days variable is set to 1, so if we stop the purger on pc1004, the same amount of logs will be there. So let's sat 1,5T + 0.2T = 1,7T on a 2.2T disk (I rounded up the numbers as checking for safety)

the 'parsercache' database takes around 1,5T data on pc1004
the binlogs are 7Gb on pc1004; but the binlogs are cleaned up in every hour
on the pc2004 host the binlogs take 155Gb now, and the expire_logs_days variable is set to 1, so if we stop the purger on pc1004, the same amount of logs will be there. So let's sat 1,5T + 0.2T = 1,7T on a 2.2T disk (I rounded up the numbers as checking for safety)

That sounds about right to me, as 1.7TB would be around 77% of usage.
I suggest to stop the purgers on one host in eqiad, keep an eye on it for the next 24h and then we can evaluate. But I think we should be fine.

ok, I top the purgers there too.

Mentioned in SAL (#wikimedia-operations) [2018-10-17T08:00:36Z] <banyek> truncating parsercache tables on pc2005 (T206740)

I double checked, pc1005 is NOT replicating from pc2005 (show slave hosts is empty in pc2005, and show slave status is empty on pc1005) I proceed on truncating tables on pc2005
The truncating process will be done in a screen named 'truncate' with the following script

for TABLE in $(mysql --skip-ssl -BN -e "show tables" parsercache); do
    mysql --skip-ssl -e "SET SESSION sql_log_bin=0; TRUNCATE TABLE ${TABLE}"
done

Mentioned in SAL (#wikimedia-operations) [2018-10-17T08:09:22Z] <banyek> stopping binlog purgers on the parsercache hosts (the binlogs will be kept for 24hrs) - T206740

Banyek updated the task description. (Show Details)Oct 17 2018, 8:11 AM

Mentioned in SAL (#wikimedia-operations) [2018-10-17T09:30:27Z] <banyek> truncating parsercache tables on pc2006 (T206740)

Banyek updated the task description. (Show Details)Oct 17 2018, 9:35 AM

All done, the parsercache tables are truncated on all the codfw hosts, and the disk usage of binlogs seems normalized in eqiad.
The next will be to close this task, but I keep this open until friday EOD, then close if nothing breaks.

Banyek lowered the priority of this task from High to Lowest.Oct 17 2018, 11:26 AM

As the ticket is finished, and just kept open for a few days, there is not need to keep it at high priority

Banyek moved this task from In progress to On hold on the User-Banyek board.Oct 17 2018, 12:29 PM
Marostegui closed this task as Resolved.Oct 17 2018, 1:18 PM
Marostegui raised the priority of this task from Lowest to Normal.

Let's close it and if something breaks we can reopen.
Good job!