In this task, we want to implement a new Airflow operator for moving data from point A to B.
The first use case will be to move data from HDFS to the clouddumps servers.
Something like:
publish = PublishData( source = "hdfs://wmf/data/archives/file_export", targets = ["sftp://clouddumps1001", "sftp://clouddumps1002"] )
Context:
In T366248#11152410, @BTullis wrote:In T366248#11145000, @xcollazo wrote:We need to add the step that syncs from hdfs to public. Not sure how that should be done.
In HDFS, we have folder /wmf/data/archive where you can move your files to. Let's say you do it to /wmf/data/archive/cirrus-search-index/{date}/blah.
Then you can set an hdfs_tools::hdfs_rsync_job in puppet to rsync from that HDFS path to the clouddumps* nodes that serve the dumps (examples here).
I'd say that there are some other options to be considered, too. That puppet based mechanism that calls hdfs-rsync will work, but it's maybe a bit of a legacy way to do it.
When we migrated the dumps v1 to Airflow recently, we needed to find a way to publish from the CephFS mount point /mnt/dumpsdata to the clouddumps hosts.We created a sync-utils container image and we then add specific tasks into our DAG that is responsible for publishing the files created.
In the case of the dumps, we found that the best option was to use parallel-rsync and specify both clouddumps1001 and clouddumps1002 as the targets.
This allows us to have one task that either successfully publishes to both target locations, or it fails.So for example, if you look at the current cirrussearch dumps, you will see that they have a sync_cirrussearch_dumps task, which calls parallel-rsync with these custom arguments.
There are many options around how you would schedule and trigger these publishing tasks, so they don't all need to be sequential like the example shown here.However, for this requirement it would be a little different, because the source files are presumably going to be created on HDFS, rather than CephFS.
This means that we won't be able to have the source directory mounted as a locally available file system.I think that we have at least a few ways that we could tackle this, though.
One that occurs to me immediately is that we could use rclone instead of parallel-rsyncrclone already has an hdfs remote capability built-in, so we could use this for one side of the connection and an sftp remote on the other side.
We would be able to give it access to the kerberos credential cache and Hadoop configuration files for the HDFS connection, and the SSH private key for the SFTP connection.Then we could just execute an rclone sync command and supply the source and destination paths.
@xcollazo - I can see this being a good option for T384381: Airflow jobs to do monthly XML dumps as well.
