As part of setting up labstore1004/5 as a HA setup for labstore, we need to sync over the data for tools-project from labstore1001. The data is available in the nfs share mounted on /srv/project/tools.
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | • Bstorm | T126083 overhaul labstore setup [tracking] | |||
Resolved | • madhuvishy | T144255 Sync data for tools-project from labstore1001 to labstore1004/5 |
Event Timeline
We have had some false starts with rsync choking on huge amounts of files in certain directories and then also a few seemingly corrupt files. We could run this with a flag to ignore read errors but that seems fraught with peril.
I decided to wipe out tools data on the active node labstore1004 and restart in a consistent fashion working through anomalies.
Command in use:
rsync --rsh 'ssh -i /root/.ssh/id_labstore' \ --quiet \ --archive \ --compress \ --progress \ --human-readable \ --hard-links \ --delete-during \ --force \ --max-size=10G \ --bwlimit=250000 \ --exclude-from=/root/rsync_tools_exclude.txt \ /srv/backup-tools/* \ root@labstore1004.eqiad.wmnet:/srv/tools/shared/tools/
where /root/rsync_tools_exclude.txt is:
project/ifttt/www/python/src/cache/*
Currently excluding >10G of which there are:
find /srv/project/tools/ -type f -size +10G
https://phabricator.wikimedia.org/P4320
And of these some look disposable:
grep -e \.log -e \.error -e \.err -e debug\.txt -e \.out$ /tmp/gt10g
11G /srv/project/tools/project/merlbot/AuszeichnungsKategorieFehlt_weekly.out 11G /srv/project/tools/project/osm4wiki/error.log 12G /srv/project/tools/project/wiwosm/access.log 14G /srv/project/tools/project/kenrick95bot/kenrick95bot-welcome.err 15G /srv/project/tools/project/rubinbot2/refs.err 15G /srv/project/tools/project/shuaib-bot/zumranew.err 19G /srv/project/tools/project/geohack/access.log 19G /srv/project/tools/project/ifttt/www/python/src/ifttt.log 25G /srv/project/tools/project/whymbot/enwikt.err 35G /srv/project/tools/project/wikivoyage/access.log 38G /srv/project/tools/project/persondata/templatedata/debug.txt 42G /srv/project/tools/project/ifttt/uwsgi.log 42G /srv/project/tools/project/osm/access.log 46G /srv/project/tools/project/wiwosm/error.log
@madhuvishy thoughts on truncating the disposable >10G files and kicking off an update of rsync over the weekend w/ the tree largest excluded for now:
48G /srv/project/tools/project/splinetools/dumps/enwiki-20141106-pages-articles.xml
52G /srv/project/tools/project/toolserver-home-archive/archive-2014-06-05.tar.xz
75G /srv/project/tools/project/oar/repository_text_2014-06-13.tar.gz
Started another sync now after truncating the >10G error/access log files from the above comment. New command (no >10G exclusion):
rsync --rsh 'ssh -i /root/.ssh/id_labstore' \ --quiet \ --archive \ --compress \ --progress \ --human-readable \ --hard-links \ --delete-during \ --force \ --bwlimit=250000 \ --exclude-from=/root/rsync_tools_exclude.txt \ /srv/backup-tools/* \ root@labstore1004.eqiad.wmnet:/srv/tools/shared/tools/
rsync_tools_exclude.txt is now:
project/ifttt/www/python/src/cache/* project/splinetools/dumps/enwiki-20141106-pages-articles.xml project/toolserver-home-archive/archive-2014-06-05.tar.xz project/oar/repository_text_2014-06-13.tar.gz
This was done on sunday for a sync within 24 hours of main maint for Tools. The actual outage period sync took around 5h for last batch of data.
Final sync options:
rsync --rsh 'ssh -C -i /root/.ssh/id_labstore' \ --archive \ --progress \ --quiet \ --human-readable \ --hard-links \ --delete-during \ --inplace \ --safe-links \ --executability \ --timeout=30 \ --force \ --bwlimit=250000 \ --exclude-from=/root/rsync_tools_exclude.txt \ /srv/backup-tools/* \ root@labstore1004.eqiad.wmnet:/srv/tools/shared/tools/