Page MenuHomePhabricator

Check data currently stored on thorium and drop what it is not needed anymore
Closed, ResolvedPublic

Description

In the parent task SRE is asking us what hardware we need for the replacement of thorium, since the host is out of warranty and needs to be refreshed. Thorium has a ton of disk space, that we don't really use:

elukey@thorium:~$ df -h
Filesystem                    Size  Used Avail Use% Mounted on
[..]
/dev/md0                       92G  3.3G   84G   4% /
[..]
/dev/mapper/thorium--vg-data  7.2T  1.2T  5.7T  17% /srv
[..]

The 1.2T used for /srv don't allow us to use misc hw configurations (480GB SSD disks in raid 1 for example), so I am wondering if all the data stored in there is needed:

elukey@thorium:/srv$ sudo du -hs *
727G	analytics.wikimedia.org
68G	backup_wikistats_1
8.0K	deployment
4.0K	log
16K	lost+found
3.5G	org.wikimedia.community-analytics
11G	published-rsynced
85M	src
68G	stats.wikimedia.org
156G	wikistats

The dir wikistats has only a backup dir inside it, so probably those 156G can be dropped? Same for backup_wikistats_1. The biggest dir is of course analytics.wikimedia.org, so reducing that one would be nice as well.

If we don't reach less disk space it is not a big deal, we can order more disks etc.., I just want to make sure that we do it because we need it :)

Event Timeline

elukey triaged this task as High priority.Oct 20 2020, 7:12 AM
elukey created this task.
elukey created this object in space Restricted Space.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
elukey mentioned this in Unknown Object (Task).Oct 20 2020, 7:12 AM
elukey shifted this object from the Restricted Space space to the S1 Public space.Oct 22 2020, 4:27 PM

To do: backup archive directory to hdfs and delete from this node.

@Milimetric the data to review (if you have time) is the one under https://analytics.wikimedia.org/published/datasets/archive/public-datasets/, especially:

root@thorium:/srv/analytics.wikimedia.org/published/datasets/archive/public-datasets# du -hs * | sort -h
[..]
57G	analytics
281G	enwiki
287G	all

Mentioned in SAL (#wikimedia-analytics) [2020-11-03T16:52:41Z] <elukey> mv /srv/analytics.wikimedia.org/published/datasets/archive/public-datasets to /srv/backup/public-datasets on thorium - T265971

This is the current status:

root@thorium:/srv# du -hs * | sort -h
4.0K    log
8.0K    deployment
16K     lost+found
85M     src
3.5G    org.wikimedia.community-analytics
11G     published-rsynced
68G     stats.wikimedia.org
95G     analytics.wikimedia.org
849G    backup

root@thorium:/srv# ls backup/
backup_wikistats_1  public-datasets  wikistats

If this is ok we can create a tarball and save the backup dir on HDFS. Previously me and Fran checked if anything access public-datasets in httpd, and we found only bots.

I started a copy of /srv/backup to stat1004 via transfer.py, so we'll be able to push the backup to hdfs.

elukey@stat1004:~$ sudo du -hs  /srv/thorium_backup/backup/
849G	/srv/thorium_backup/backup/

Something weird that happens on thorium is that analytics.wikimedia.org gets hardlinks for files that I moved to the backup, ending up in something like:

elukey@thorium:/srv$ sudo du -hs *
721G	analytics.wikimedia.org  <===
224G	backup
8.0K	deployment
4.0K	log
16K	lost+found
3.5G	org.wikimedia.community-analytics
11G	published-rsynced
99M	src
68G	stats.wikimedia.org

elukey@thorium:/srv$ cd backup/
elukey@thorium:/srv/backup$ sudo du -hs *
68G	backup_wikistats_1
626G	public-datasets  <====
156G	wikistats

At first the above is super confusing, but I checked and backup and analytics.wikimedia.org are sharing hard links. I simply moved public-datasets to /srv/backup, but it seems that the recurrent script that manages symlinks for analytics.wikimedia.org restored the previous status. The copy of the data to stat1004 wasn't affected, so I guess that we can simply drop the data after:

  1. checking the data on stat1004
  2. uploading to hdfs

tl;dr
Did some vetting of the data and seems fine. Some minor differences but probably due to the way I measured them.


I ran tree on both thorium:/srv/backup and stat1004:/srv/thorium_backup/backup, and then diffed both results.
There were a couple differences, but all due to non-matching alphabetical ordering of paths related to hyphens (-) and underscores (_), i.e.:

14188d14187
< │       │   ├── CategoryOverview_ZH_MIN_NAN_Complete.htm
14189a14189
> │       │   ├── CategoryOverview_ZH_MIN_NAN_Complete.htm

Maybe due to different versions of tree:

mforns@stat1004:~$ tree --version
tree v1.8.0 (c) 1996 - 2018
---
mforns@thorium:~$ tree --version
tree v1.7.0 (c) 1996 - 2014

I checked all paths, and all of them were present. However, the last line of the diff wasn't matching:

< 10292 directories, 239984 files
---
> 10291 directories, 239985 files

It seems that one path was interpreted as directory in thorium, while in stat1004 was interpreted as file... But couldn't find which one.
Strange, but doesn't seem critical to me.


Did a similar thing with du -h and could only found 2 differences:

  1. The folder ./public-datasets/all/multimedia has 352K on thorium and 356K on stat1004. Couldn't find the cause, though.
  2. Lots of files in thorium list lots of hard links, while the files in stat1004 have just 1 hard link. @elukey, I guess this is expected?

Backup on HDFS finally completed:

elukey@an-launcher1002:~$ sudo -u hdfs kerberos-run-command hdfs hdfs dfs -ls /wmf/data/archive/backup/misc/thorium
Found 3 items
drwxr-x---   - hdfs hadoop          0 2021-01-13 16:58 /wmf/data/archive/backup/misc/thorium/backup_wikistats_1
drwxr-x---   - hdfs hadoop          0 2021-01-14 05:49 /wmf/data/archive/backup/misc/thorium/public-datasets
drwxr-x---   - hdfs hadoop          0 2021-01-14 05:49 /wmf/data/archive/backup/misc/thorium/wikistats

elukey@an-launcher1002:~$ sudo -u hdfs kerberos-run-command hdfs hdfs dfs -du -h /wmf/data/archive/backup/misc/thorium
67.3 G   201.8 G  /wmf/data/archive/backup/misc/thorium/backup_wikistats_1
625.4 G  1.8 T    /wmf/data/archive/backup/misc/thorium/public-datasets
876.5 M  2.6 G    /wmf/data/archive/backup/misc/thorium/wikistats

@Milimetric if you have time I'd ask you to quickly check that the backup is ok and then we could drop the data saved from thorium. Lemme know :)

@JAllemandou moving the request to you due to the ops week :)

Would you check with me that /wmf/data/archive/backup/misc/thorium contains all the data on thorium:/srv/backup ? If so I'll then drop the backup directory and free a ton of space on thorium :)

Ok as Joseph was mentioning, my backup is missing something:

elukey@thorium:/srv/backup$ sudo du -hs *
68G	backup_wikistats_1
626G	public-datasets
156G	wikistats
elukey@an-launcher1002:~$ sudo -u hdfs kerberos-run-command hdfs hdfs dfs -du -h /wmf/data/archive/backup/misc/thorium
67.3 G   201.8 G  /wmf/data/archive/backup/misc/thorium/backup_wikistats_1
625.4 G  1.8 T    /wmf/data/archive/backup/misc/thorium/public-datasets
876.5 M  2.6 G    /wmf/data/archive/backup/misc/thorium/wikistats

The first thing that I did, since thorium does not have hdfs access, was to move files via transfer.py (from cumin1001, sre only) to stat1004:

elukey@stat1004:/srv/thorium_backup$ sudo du -hs *
68G	backup_wikistats_1
626G	public-datasets
156G	wikistats

Then I did hdfs rsync on hdfs, but I have probably missed some stuff :(

Definitely better now:

elukey@an-launcher1002:~$ sudo -u hdfs kerberos-run-command hdfs hdfs dfs -du -h /wmf/data/archive/backup/misc/thorium
67.3 G   /wmf/data/archive/backup/misc/thorium/backup_wikistats_1
625.4 G  /wmf/data/archive/backup/misc/thorium/public-datasets
155.8 G  /wmf/data/archive/backup/misc/thorium/wikistats
elukey@thorium:/srv/backup$ sudo find -type f | wc -l
240006

elukey@an-launcher1002:~$ sudo -u hdfs find /mnt/hdfs/wmf/data/archive/backup/misc/thorium -type f | wc -l
240006
# Note: The backup public-datasets dir and the one in analytics.wikimedia.org are hardlinked, we'll need to drop both, this is why public-datasets shows up in both places
elukey@thorium:~$ sudo find  /srv/analytics.wikimedia.org/published/datasets/archive/public-datasets  -type f | wc -l
6823

elukey@an-launcher1002:~$ sudo -u hdfs find /mnt/hdfs/wmf/data/archive/backup/misc/thorium/public-datasets -type f | wc -l
6823

@Milimetric can we sync next week to double check and finally drop?

Validation has been made file by file, checking names and sizes.
Data present on thorium at folder /srv/backup is now in HDFS at folder hdfs://analytics-hadoop/wmf/data/archive/backup/misc/thorium
This task can be closed.

@JAllemandou in /srv/backup there was an hardlink to /srv/analytics.wikimedia.org/published/datasets/archive/public-datasets, so when I deleted the public-datasets dir it didn't do much. The last step is to drop /srv/analytics.wikimedia.org/published/datasets/archive/public-datasets, that should contain the same files as in hdfs://analytics-hadoop/wmf/data/archive/backup/misc/thorium/public-datasets.. Ok if I do it or do you prefer to re-check?

I finally figure out where the archive data comes from, namely stat1006:

elukey@stat1006:/srv/published/datasets/archive/public-datasets$ du -hs *
279G	all
57G	analytics
16K	commonswiki
13M	dewiki
281G	enwiki
324K	eswiki
62M	frwiki
304K	hewiki
44K	huwiki
336K	itwiki
4.0K	jawiki
8.0K	mediawikiwiki
292K	nlwiki
300K	plwiki
232K	ptwiki
56K	ptwikibooks
284K	ruwiki
288K	svwiki
1.9G	wikidatawiki
4.0K	zhwiki

We should drop:

  • /srv/published/datasets/archive/public-datasets/* on stat1006
  • /srv/published-rsynced/stat1006/datasets/archive/public-datasets on thorium

And then we are done. @JAllemandou if you have a moment to review and make sure that I am not crazy I'd be glad :)

Sorry to be so late looking at this. The newest of these files are 2.5 years old and besides us absolutely zero people know they exist (they've all left the foundation). Still I double checked they're all safely backed up on hdfs, I think it's ok to delete.

milimetric@an-launcher1002:~$ sudo -u hdfs kerberos-run-command hdfs hdfs dfs -du -s -h /wmf/data/archive/backup/misc/thorium/public-datasets/*
229.0 M /wmf/data/archive/backup/misc/thorium/public-datasets/2015_01_clickstream_final.tsv.gz
286.2 G /wmf/data/archive/backup/misc/thorium/public-datasets/all
56.6 G  /wmf/data/archive/backup/misc/thorium/public-datasets/analytics
116     /wmf/data/archive/backup/misc/thorium/public-datasets/commonswiki
12.1 M  /wmf/data/archive/backup/misc/thorium/public-datasets/dewiki
280.4 G /wmf/data/archive/backup/misc/thorium/public-datasets/enwiki
272.2 K /wmf/data/archive/backup/misc/thorium/public-datasets/eswiki
61.3 M  /wmf/data/archive/backup/misc/thorium/public-datasets/frwiki
245.6 K /wmf/data/archive/backup/misc/thorium/public-datasets/hewiki
21.6 K  /wmf/data/archive/backup/misc/thorium/public-datasets/huwiki
277.3 K /wmf/data/archive/backup/misc/thorium/public-datasets/itwiki
0       /wmf/data/archive/backup/misc/thorium/public-datasets/jawiki
0       /wmf/data/archive/backup/misc/thorium/public-datasets/mediawikiwiki
237.5 K /wmf/data/archive/backup/misc/thorium/public-datasets/nlwiki
246.5 K /wmf/data/archive/backup/misc/thorium/public-datasets/plwiki
183.2 K /wmf/data/archive/backup/misc/thorium/public-datasets/ptwiki
36.3 K  /wmf/data/archive/backup/misc/thorium/public-datasets/ptwikibooks
9.8 K   /wmf/data/archive/backup/misc/thorium/public-datasets/readership
252.5 K /wmf/data/archive/backup/misc/thorium/public-datasets/ruwiki
32.0 K  /wmf/data/archive/backup/misc/thorium/public-datasets/search
236.0 K /wmf/data/archive/backup/misc/thorium/public-datasets/svwiki
1.9 G   /wmf/data/archive/backup/misc/thorium/public-datasets/wikidatawiki
0       /wmf/data/archive/backup/misc/thorium/public-datasets/zhwiki

(I checked zhwiki and the 4K is just the empty directory)

Ok let's be bold: let; remove it from thorium, and if it misses someone, we'll know about it. It is backed up on HDFS in any case.

Ok to drop from everywhere in my opinion :P but definitely from everywhere not hdfs, including stat1006

+1, let'd drop from machines (we have HDFS backup).

elukey@thorium:/srv$ sudo du -hs *
177G	analytics.wikimedia.org
8.0K	deployment
4.0K	log
16K	lost+found
3.5G	org.wikimedia.community-analytics
18G	published-rsynced
113M	src
68G	stats.wikimedia.org

Much better now :)