Context: there are periodical rsyncs from labstore1006/7 to stat1007 that pull analytics public data to be published. The worflow is the following:
* labstore100X starts a rsync pull to stat1007 via cron
* the rsync's module reads data from /mnt/hdfs, a fuse mountpoint for HDFS
* data is grabbed from HDFS, and returned to the fuse reader
* rsync moves data from stat1007 to labstore100X
There are some bottlenecks:
1) using the hdfs fuse mountpoint for a big dataset may cause performance issues like T234160
2) the fuse mountpoint is really brittle and not performant
3) data takes ages to move from HDFS to stat1007 to labstore100X
There are also some future things to solve: when kerberos will be enabled, rsync on stat1007 will need to be able to authenticate before pulling data from the fuse mountpoint. Nothing really very complicated but currently it requires extra config and testing.
A possible solution could be to have the labstore nodes to pull data directly from HDFS. There are no rsync-like commands available but the analytics team wrote [[ https://github.com/wikimedia/analytics-refinery/blob/master/python/refinery/hdfs.py#L218-L219 | one ]] that might be taken as prototype/inspiration.
Security wise, to make this happen the labstore nodes will need to be able to pull data from Hadoop (so install hadoop client packages, be whitelisted in ferm because outside the analytics network, etc..) and eventually they'll need to be kerberized (not a bit deal but mentioning it anyway).