dbstore1002 /srv filling up
Open, HighPublic

Description

Hi,

dbstore1002 is slowly filling up:

root@dbstore1002:/srv/sqldata# df -hT /srv/
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs   6.3T  5.6T  682G  90% /srv

It has been alerting over the last few months and I have been slowly increasing its volume, as there is still some space left:

root@dbstore1002:/srv/sqldata# pvs
  PV         VG   Fmt  Attr PSize PFree
  /dev/sda3  tank lvm2 a--  6.50t 245.64g

However, now that we still have some time, it would be nice to come up with a better solution (maybe drop old unused tables?)

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 19 2017, 2:34 PM
elukey triaged this task as Normal priority.Jun 19 2017, 2:38 PM
elukey added a project: User-Elukey.
elukey raised the priority of this task from Normal to High.Jun 22 2017, 11:07 AM
elukey edited projects, added Analytics-Kanban; removed Analytics.
elukey added a subscriber: elukey.Jun 23 2017, 2:12 PM

IIUC du -hsc _log_* should list the space consumption of the log database tables, that is around 1.3T. This slave will be a target of T156933 so the eventlogging_cleaner script will likely help a lot.

elukey claimed this task.Jul 7 2017, 1:45 PM

There is one table on dbstore1002 that holds ~370M rows (~170GB in size): _Edit_11448630_old. This table is not present on db104[67], so it might be a good candidate to drop. Sent an email to analytics-internal and data analysts to get some feedback before proceeding.

elukey moved this task from Next Up to In Progress on the Analytics-Kanban board.Jul 10 2017, 1:01 PM

Table dropped after a consultation with data analysts and the analytics team.

Awesome!
Once you're ready to start purging old rows, we can try to optimize a couple of tables and see what happens with the claimed disk space

Mentioned in SAL (#wikimedia-operations) [2017-07-14T11:41:16Z] <marostegui> Add 50G to /srv/ on dbstore1002 - T168303

This host reached this today:

Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs       6.3T  5.8T  477G  93% /srv

I have added 50G more to the volume to avoid it paging during the weekend:

root@dbstore1002:~# df -hT /srv/
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs   6.4T  5.8T  536G  92% /srv

Let's run the purge script as soon as possible next week and optimize the tables to see how much we gain.

From the last crash this host had (T170308) looks like there are two temporary files that were generated during the ALTER that were never cleaned up and are eating up 40GB in total:

root@dbstore1002:/srv/tmp# ls -lhS
total 44G
-rw-rw---- 1 mysql mysql  25G Jul 14 12:04 #sql_5e0b_0.MAI
-rw-rw---- 1 mysql mysql  21G Jul 14 12:04 #sql_5e0b_0.MAD

They are still in use, so I reckon we would need to restart mariadb on this host for them to stop being tracked and we could delete them probably:

root@dbstore1002:/srv/tmp# fuser "#sql_5e0b_0.MAI"
/srv/tmp/#sql_5e0b_0.MAI: 24075
root@dbstore1002:/srv/tmp# fuser "#sql_5e0b_0.MAD"
/srv/tmp/#sql_5e0b_0.MAD: 24075
root@dbstore1002:~# cat /srv/sqldata/dbstore1002.pid
24075

Actually, those are in use by:

      Id: 3616956
...
      db: log
 Command: Query
    Time: 6746
   State: Queried about 106270000 rows
    Info: select id,uuid,userAgent,count(*) as events from QuickSurveyInitiation_15278946 where timestamp like '201706%' group by userAgent,event_eventName,event_surveyCodeName,event_surveyInstanceToken,event_surveySessionToken having events > 1
Progress: 0.000

Which is responsible for that size shortage.

elukey moved this task from Ops Backlog to In Progress on the User-Elukey board.Jul 18 2017, 9:26 AM

Started the eventlogging cleaner script for 2014 data on dbstore1002, we'll run optimize queries afterwards to see how much space from the log database we gain..

Mentioned in SAL (#wikimedia-operations) [2017-07-19T13:24:29Z] <marostegui> Optimize EditConflict_8860941_15423246 and Echo_7731316 on dbstore1002 - T168303

This is the graph for the last 30 months of disk usage:
https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&panelId=17&fullscreen&orgId=1&var-server=dbstore1002&var-network=bond0&from=1497879482633&to=1500471482634

It has consumed around 300G in the last month.
So far we have available:

root@dbstore1002:~# df -hT /srv
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs   6.4T  5.9T  537G  92% /srv
root@dbstore1002:~# pvs
  PV         VG   Fmt  Attr PSize PFree
  /dev/sda3  tank lvm2 a--  6.50t 135.64g

That means only two months of disk space, so maybe we really need to go and check unused tables that can be removed as a more aggressive way to gain disk space back.
I have been also checking which InnoDB tables aren't compressed on dbstore1002 and there are a few wikis without InnoDB compression there, so maybe we can also compress those, although we'll not get a huge amount of space back, but at least something.

Mentioned in SAL (#wikimedia-operations) [2017-07-19T13:46:17Z] <marostegui> Compress database rowiki on dbstore1002 - T168303

All these wikis are not InnoDB compressed:

66G	viwiki
97G	ukwiki
32G	kowiki
82G	huwiki
43G	hewiki
48G	frwiktionary
69G	fawiki
131G	eswiki
52G	cawiki
145G	arwiki

So I will compress them

Mentioned in SAL (#wikimedia-operations) [2017-07-19T13:46:17Z] <marostegui> Compress database rowiki on dbstore1002 - T168303

rowiki has gone from 45G to around 17G, so that is nice. I am going to compress now all the wikis on T168303#3452667 so we can get more space back, specially from the big ones.

Mentioned in SAL (#wikimedia-operations) [2017-07-19T16:16:18Z] <marostegui> Compressing innodb on dbstore1002 for the following wikis: viwiki ukwiki kowiki huwiki hewiki frwiktionary fawiki eswiki cawiki arwiki - T168303

If I made correct calculations the top 20 tables of the log database on store/slave weight ~2.6TB, so even cutting their size in a half would be a massive win for us. Just sent an email to our data analysts to get their feedback/opinion.

In the meantime the eventlogging cleaner script is still running on 2014 and older data, but since most of the biggest tables are whitelisted we might not get a huge amount of space back.

Tbayer added a subscriber: Tbayer.Jul 20 2017, 10:09 PM

The compression for those uncompressed wikis finished and we've got some more room to breathe now:

root@dbstore1002:~# df -hT /srv
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs   6.4T  5.5T  970G  86% /srv
root@dbstore1002:~# pvs
  PV         VG   Fmt  Attr PSize PFree
  /dev/sda3  tank lvm2 a--  6.50t 135.64g

However, we _really_ need to keep auditing and deleting that old data and nuking unused tables where possible.
https://grafana-admin.wikimedia.org/dashboard/file/server-board.json?refresh=1m&panelId=17&fullscreen&orgId=1&from=1500101868124&to=1500706668124&var-server=dbstore1002&var-network=bond0

Its usual pace of 30G eaten per day...

[05:51:20] marostegui@dbstore1002:~$ df -hT /srv
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs   6.4T  5.5T  937G  86% /srv
elukey added a comment.Aug 1 2017, 3:53 PM

In https://phabricator.wikimedia.org/T170720 we are planning to move one big table (~500GB) to HDFS and drop it from mysql.

In https://phabricator.wikimedia.org/T170720 we are planning to move one big table (~500GB) to HDFS and drop it from mysql.

That would be great!!! :-)
Thanks

We are discussing to move ~1TB of data to HDFS in https://phabricator.wikimedia.org/T172322, another great news :)

elukey moved this task from In Progress to Stalled on the User-Elukey board.Aug 9 2017, 10:47 AM

I will be removing around 120G from m3 instance from dbstore1002 (T156758#3514245)

I will be removing around 120G from m3 instance from dbstore1002 (T156758#3514245)

Done and around 100G are back
https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&panelId=17&fullscreen&orgId=1&var-server=dbstore1002&var-network=bond0&from=1502370489229&to=1502371282750

Current state:

root@dbstore1002:/home/marostegui# df -hT /srv
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs   6.4T  5.6T  869G  87% /srv

And here we are again:

root@dbstore1002:/srv# df -hT /srv/
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs   6.4T  5.9T  556G  92% /srv

Mentioned in SAL (#wikimedia-operations) [2017-08-29T08:06:37Z] <elukey> drop log.MobileWebUIClickTracking_10742159_15423246 from dbstore1002 to free space (table archived on HDFS) - T172322 T168303

Back to acceptable levels:

/dev/mapper/tank-data  6.4T  5.5T  946G  86% /srv
phuedx added a subscriber: phuedx.Wed, Aug 30, 2:04 PM

@Marostegui: In T172291#3564966, you reported that available space on dbstore1002 had increased to 91%. That increase took a little over 16 hours. Do we know what caused it?

@Marostegui: In T172291#3564966, you reported that available space on dbstore1002 had increased to 91%. That increase took a little over 16 hours. Do we know what caused it?

At this point we believe it was created by a temporary table being used in a query.