⚓ T206740 parsercache used disk space increase

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 11 2018, 5:42 AM

• Marostegui triaged this task as Medium priority.Oct 11 2018, 5:43 AM

• Marostegui moved this task from Triage to In progress on the DBA board.

Mentioned in SAL (#wikimedia-operations) [2018-10-11T05:43:30Z] <marostegui> Purge binary logs on pc2005 due to disk space issues - T206740

Edited the task description to remove some graphs that could be misleading (they were server reboots due to upgrades)

This is increasing quite rapidly

Restricted Application added subscribers: Liuxinyu970226, TerraCodes. · View Herald TranscriptOct 11 2018, 7:50 AM

• Marostegui added a project: SRE.Oct 11 2018, 7:50 AM

• Marostegui mentioned this in Unknown Object (Task).Oct 11 2018, 8:41 AM

The binary logs were purged on pc1004. pc1005 pc1006.
Also the binlog_max_size were set to 10M and the hosts now have this running in a screen:

while true; do 
  echo "purging binglogs"; 
  sudo mysql --skip-ssl -e "PURGE BINARY LOGS BEFORE (NOW() - INTERVAL 1 HOUR)"; 
  sleep 60;
done

• Banyek updated the task description. (Show Details)Oct 11 2018, 9:26 AM

• Banyek updated the task description. (Show Details)

akosiaris updated the task description. (Show Details)Oct 11 2018, 9:34 AM

• Marostegui updated the task description. (Show Details)Oct 11 2018, 10:02 AM

This is no longer "unbreak now"

pc1004
Filesystem                 Type  Size  Used Avail Use% Mounted on
/dev/mapper/pc1004--vg-srv xfs   2.2T  1.6T  644G  71% /srv

pc1005
Filesystem                 Type  Size  Used Avail Use% Mounted on
/dev/mapper/pc1005--vg-srv xfs   2.2T  1.5T  695G  69% /srv

pc1006
Filesystem                 Type  Size  Used Avail Use% Mounted on
/dev/mapper/pc1006--vg-srv xfs   2.2T  1.5T  651G  71% /srv

We need to start purging binlogs in codfw I would say (eqiad caught up with codfw):

pc2004
Filesystem                 Type  Size  Used Avail Use% Mounted on
/dev/mapper/pc2004--vg-srv xfs   2.2T  1.9T  332G  85% /srv

pc2005
Filesystem                 Type  Size  Used Avail Use% Mounted on
/dev/mapper/pc2005--vg-srv xfs   2.2T  1.9T  334G  85% /srv

pc2006
Filesystem                 Type  Size  Used Avail Use% Mounted on
/dev/mapper/pc2006--vg-srv xfs   2.2T  1.9T  299G  87% /srv

Forgetting the codfw -> eqiad replication was the most likely cause of overload on the application servers (and on External storage hosts).

In T206740#4658660, @jcrespo wrote:

Forgetting the codfw -> eqiad replication was the most likely cause of overload on the application servers (and on External storage hosts).

I think we should we track this (and the respective actionables) in a different task

In T206740#4659202, @akosiaris wrote:

In T206740#4658660, @jcrespo wrote:

Forgetting the codfw -> eqiad replication was the most likely cause of overload on the application servers (and on External storage hosts).

I think we should we track this (and the respective actionables) in a different task

Agreed - we put it on top of this task just to avoid forgetting it, but we can move it to a different one.

The MediaWiki eqiad-appserver cluster gasping for air,

figure 1.

as each server wakes up from a four-week switchover dream,

figure 2.

to an ice-cold parser cache.

Krinkle added a project: Wikimedia-Performance-publish.Oct 12 2018, 2:36 AM

Restricted Application added a project: Performance-Team. · View Herald TranscriptOct 12 2018, 2:36 AM

Krinkle moved this task from Untriaged to Ready for write-up on the Wikimedia-Performance-publish board.Oct 12 2018, 2:36 AM

The binlog purging on codfw was started yesterday (sorry I didn't logged it here), and it runs since; the replication works, and the disk space is under control.

Joe mentioned this in T206841: Evaluate the consequences of the parsercache being empty post-switchover.Oct 12 2018, 9:05 AM

• Marostegui mentioned this in T204006: Execute the schema change for Partial Blocks.Oct 15 2018, 5:26 AM

@Banyek please double check the key purge has finished on mwmaint1002 and keep on with the rest of pending things to do here.
Probably after re-enabling the normal config options you might want to check if it is worth doing the table rebuild by doing it on codfw (make sure not to replicate the changes) and checking before/after space, so we can evaluate if it is worth the time or not (as we are on a good shape now)

on checking the screen '91243.parsercache' on mwmaint1002 I can confirm that the key purge - I proceed

• Banyek claimed this task.Oct 15 2018, 7:59 AM

• Banyek added a project: User-Banyek.

Sounds good!
Also as T206740#4659202, let's create a separate task for the replication check addition, so we can just focus on the immediate triage on this task I would say

• Banyek mentioned this in T206992: Create replication icinga check for the Parsercache hosts.Oct 15 2018, 8:06 AM

• Banyek updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2018-10-15T09:41:35Z] <banyek> max_binlog_size is set back to 1048576000 on ParseCache hosts (T206740)

• Banyek updated the task description. (Show Details)Oct 15 2018, 9:42 AM

The next step will be to stop replication of pc1004 from pc2004 and then run the following code in a screen in pc2004:

for TABLE in $(mysql --skip-ssl -BN -e "show tables" parsercache); do
    mysql --skip-ssl -e "SET SESSION sql_log_bin=0; ALTER TABLE parsercache.${TABLE} ENGINE=InnoDB"
    sleep 1;
done

After this we'll know exactly how much disk space we can reclaim, and is it worth to it do or not.

• Imarlier edited projects, added Performance-Team (Radar); removed Performance-Team.Oct 15 2018, 8:01 PM

Incident report: https://wikitech.wikimedia.org/wiki/Incident_documentation/20181016-eqiad_parsercache_empty_post-switchover

Mentioned in SAL (#wikimedia-operations) [2018-10-16T08:56:38Z] <banyek> stopping pc2004 -> pc1004 replication (T206740)

On pc1004 the slave status before 'reset slave all':

                  Slave_IO_State:
                  Master_Host: pc2004.codfw.wmnet
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: pc2004-bin.154391
          Read_Master_Log_Pos: 494855359
               Relay_Log_File: pc1004-relay-bin.154913
                Relay_Log_Pos: 113542
        Relay_Master_Log_File: pc2004-bin.154391
             Slave_IO_Running: No
            Slave_SQL_Running: No
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 494855359
              Relay_Log_Space: 113884
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: Yes
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 180359338
               Master_SSL_Crl:
           Master_SSL_Crlpath:
                   Using_Gtid: No
                  Gtid_IO_Pos:

Regarding the information_schema there will be not too much data to reclaim:

MariaDB [(none)]> select SUM(DATA_LENGTH)/1024/1024/1024  from information_schema.tables where TABLE_SCHEMA='parsercache';
+---------------------------------+
| SUM(DATA_LENGTH)/1024/1024/1024 |
+---------------------------------+
|               1581.942367553711 |
+---------------------------------+
1 row in set (0.01 sec)

The data data length is ~1,5T regarding the information_schema and the free space is ~12G:

MariaDB [(none)]> select SUM(DATA_FREE)/1024/1024/1024  from information_schema.tables where TABLE_SCHEMA='parsercache';
+-------------------------------+
| SUM(DATA_FREE)/1024/1024/1024 |
+-------------------------------+
|               11.985351562500 |
+-------------------------------+
1 row in set (0.00 sec)

I verified that the DATA_LENGTH value seems normal:

root@pc2004:/srv/sqldata-cache/parsercache# ls -al /srv/sqldata-cache/parsercache/pc*.ibd | awk '{sum +=$5} END {print sum}'
1.75614e+12
root@pc2004:/srv/sqldata-cache/parsercache# du -sk
1714984316	.
root@pc2004:/srv/sqldata-cache/parsercache# du -sh
1.6T	.

Pick some of the biggest tables and try to alter them to see how much you get and then we can probably extrapolate.
If it is not worth to be run on eqiad is good to know. Codfw should still be truncated after your tests I would suggest.

for luck all the tables around 6,5 Gb there, so yeah, I guess picking one is good enough.
I do it now, then provide the data here

In T206740#4669725, @Banyek wrote:

for luck all the tables around 6,5 Gb there, so yeah, I guess picking one is good enough. ç

They are expected to have more or less the same size.

the result is 5.9 Gb, which is more than expected, but still not too much - less than 10% - with the whole operation we can free up around 124Gb.

Considering we are at around 70% at eqiad, and should remain like that, I don't think it is worth the hassle and the risk of depooling hosts at this point. So probably just truncating codfw tables and leaving replication codfw -> eqiad disconnected is the way to go here.

ok. I'll truncate the tables then in pc2004 today, 2005 tomorrow and 2006 the day after tomorrow.
I also stop the binlog purgers in codfw to see the rate as the binlogs taking up disk space to see if we can stop them in eqiad or not

You can probably disconnect pc1005 and pc1006 today already, but keep executing the truncating without binlog, just in case.

• Marostegui updated the task description. (Show Details)Oct 16 2018, 10:12 AM

ok, good idea

pc1005 replication before reset slave all:

Master_Host: pc2005.codfw.wmnet
Master_Log_File: pc2005-bin.153328
Exec_Master_Log_Pos: 31559437

Mentioned in SAL (#wikimedia-operations) [2018-10-16T10:37:11Z] <banyek> stopping pc2006 -> pc1006 replication (T206740)

Mentioned in SAL (#wikimedia-operations) [2018-10-16T10:37:21Z] <banyek> stopping pc2005 -> pc1005 replication (T206740)

pc1006 replication before reset slave all:

Master_Host: pc2006.codfw.wmnet
Master_Log_File: pc2006-bin.154452
Exec_Master_Log_Pos: 470004334

I'll start truncating the tables on pc2004 with:

for TABLE in $(mysql --skip-ssl -BN -e "show tables" parsercache); do
    mysql --skip-ssl -e "SET SESSION sql_log_bin=0; TRUNCATE TABLE ${TABLE}"
done

• Banyek updated the task description. (Show Details)Oct 16 2018, 10:54 AM

• Banyek updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2018-10-16T11:02:11Z] <banyek> truncating tables in parsecache@pc2004 (T206740)

Mentioned in SAL (#wikimedia-operations) [2018-10-16T11:32:56Z] <banyek> the binlog purging stopped on pc2004 (T206740)

• Banyek updated the task description. (Show Details)Oct 16 2018, 11:33 AM

• Banyek moved this task from Backlog to In progress on the User-Banyek board.Oct 16 2018, 1:21 PM

Krinkle moved this task from Untriaged to Usage problem on the MediaWiki-libs-Rdbms board.Oct 16 2018, 10:01 PM

Krinkle edited projects, added MediaWiki-libs-BagOStuff; removed MediaWiki-libs-Rdbms.

Krinkle moved this task from General to ParserCache on the MediaWiki-libs-BagOStuff board.

after cleaning up parsercache data yesterday on pc2004, I checked the disk usage of the 'parsercache' database and the binlogs.
The database is now 176Gb, the binlogs were 155GB - this is a lot - the disk in the PC hosts is around 2200 MB, and the not-cleaned db is around 1600MB which means we have space for less than 4 days of binlogs.
On the other hand, the 'expire_logs_days' variable is set to 1 so we should be ok.
I think the best would be to keep the binlog cleaners running on the eqiad host, and keep freeing up space on codfw hosts - after the codfw is all right, next Monday we can stop the purgers safely on eqiad too.

• Banyek updated the task description. (Show Details)Oct 17 2018, 7:51 AM

Keep in mind that the disk space is always constant on the pc hosts: https://grafana.wikimedia.org/dashboard/file/server-board.json?panelId=17&fullscreen&orgId=1&var-server=pc1005&var-network=eth0&from=1532045819383&to=1535494432566
If it grows then that is normally an indication of an issue.

eqiad is now 70% (1.6TB used) which is the normal space it should have (check the above graph). That shouldn't grow much, how many old binlogs do we have there?

the 'parsercache' database takes around 1,5T data on pc1004
the binlogs are 7Gb on pc1004; but the binlogs are cleaned up in every hour
on the pc2004 host the binlogs take 155Gb now, and the expire_logs_days variable is set to 1, so if we stop the purger on pc1004, the same amount of logs will be there. So let's sat 1,5T + 0.2T = 1,7T on a 2.2T disk (I rounded up the numbers as checking for safety)

In T206740#4673427, @Banyek wrote:

the 'parsercache' database takes around 1,5T data on pc1004
the binlogs are 7Gb on pc1004; but the binlogs are cleaned up in every hour
on the pc2004 host the binlogs take 155Gb now, and the expire_logs_days variable is set to 1, so if we stop the purger on pc1004, the same amount of logs will be there. So let's sat 1,5T + 0.2T = 1,7T on a 2.2T disk (I rounded up the numbers as checking for safety)

That sounds about right to me, as 1.7TB would be around 77% of usage.
I suggest to stop the purgers on one host in eqiad, keep an eye on it for the next 24h and then we can evaluate. But I think we should be fine.

ok, I top the purgers there too.

Mentioned in SAL (#wikimedia-operations) [2018-10-17T08:00:36Z] <banyek> truncating parsercache tables on pc2005 (T206740)

I double checked, pc1005 is NOT replicating from pc2005 (show slave hosts is empty in pc2005, and show slave status is empty on pc1005) I proceed on truncating tables on pc2005
The truncating process will be done in a screen named 'truncate' with the following script

for TABLE in $(mysql --skip-ssl -BN -e "show tables" parsercache); do
    mysql --skip-ssl -e "SET SESSION sql_log_bin=0; TRUNCATE TABLE ${TABLE}"
done

Mentioned in SAL (#wikimedia-operations) [2018-10-17T08:09:22Z] <banyek> stopping binlog purgers on the parsercache hosts (the binlogs will be kept for 24hrs) - T206740

• Banyek updated the task description. (Show Details)Oct 17 2018, 8:11 AM

Mentioned in SAL (#wikimedia-operations) [2018-10-17T09:30:27Z] <banyek> truncating parsercache tables on pc2006 (T206740)

• Banyek updated the task description. (Show Details)Oct 17 2018, 9:35 AM

All done, the parsercache tables are truncated on all the codfw hosts, and the disk usage of binlogs seems normalized in eqiad.
The next will be to close this task, but I keep this open until friday EOD, then close if nothing breaks.

As the ticket is finished, and just kept open for a few days, there is not need to keep it at high priority

• Banyek moved this task from In progress to On hold on the User-Banyek board.Oct 17 2018, 12:29 PM

Let's close it and if something breaks we can reopen.
Good job!

• Banyek mentioned this in T207273: Parser cache hit ratio alerting.Oct 18 2018, 7:34 AM

• Marostegui mentioned this in T207385: Create a check on the DC failover script to see if codfw -> eqiad replication is working before failing over to codfw (considering eqiad as the active DC by default).Oct 18 2018, 12:45 PM

Liuxinyu970226 unsubscribed.Oct 27 2018, 7:40 AM

Krinkle edited projects, added MediaWiki-Parser; removed MediaWiki-libs-BagOStuff.Apr 14 2020, 11:13 PM

Krinkle moved this task from Backlog to ParserCache/ParserOutputAccess on the MediaWiki-Parser board.

TerraCodes unsubscribed.Apr 15 2020, 12:06 PM

Aklapper removed subscribers: • Banyek, Anomie.Oct 16 2020, 5:41 PM

• Marostegui mentioned this in T263587: CAPEX for ParserCache for Parsoid.Oct 20 2020, 1:18 PM

parsercache used disk space increase
Closed, ResolvedPublic
Actions

Description

Related Objects

Event Timeline

	F26543389: capture-1.png
	Oct 12 2018, 2:36 AM

	F26543386: capture-2.png
	Oct 12 2018, 2:36 AM

parsercache used disk space increaseClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

parsercache used disk space increase
Closed, ResolvedPublic
Actions