Page MenuHomePhabricator

implement script to move data from P&T data lake to FR Tech data lake
Open, MediumPublic

Description

Now that the FR Tech airflow instance is set up, we need a way to move data between data lakes

related to T417213: Create FR Tech Airflow instance and T405360: Implement an Airflow operator for moving data from point A to B

Details

Other Assignee
BTullis

Event Timeline

xcollazo added subscribers: BTullis, xcollazo.

Here is an implementation idea that @AStein-WMF, @amastilovic and I discussed that can potentially solve this use case:

Use case details:

  • FR Tech needs to transform the data on DPE side first.
    • Proposed solution: FR Tech has an existing Airflow instance with an existing service user. They can develop and run Spark jobs running Spark SQL, or Spark Scala / pyspark. After transform, data can land on our existing HDFS instance, likely in a temp folder owned by FR Tech service user.
  • FR Tech will be moving small data (low GBs) and not big data (>= 50GB). FR Tech wants to land the data on their S3-compatible object storage. Data will be arbitrary, but will typically be parquet files.
    • Proposed solution: Elsewhere, we have used rclone to move data (T405360#11217710). And Ben had detailed on another thread how we may use rclone to connect to HDFS and how it can sync to S3-compatible endpoints (T366248#11152410, T405360#11209549). Thus we propose the following:
      • Develop a simple extension of an Airflow KubernetesPodOperator, let's call it for now the HdfsToS3 operator. This operator will authenticate against HDFS with the same user as the Airflow instance. We will pass to this operator a) The HDFS path to copy, and b) the S3-compatible bucket, and c) the coordinates to fetch the target S3-compatible credentials. Then the operator will instantiate rclone, passing the details down.
      • Options for the credentials to the S3-compatible target
        • Could be kept on HDFS owned by the service user and clamped down to 400 permissions. We use this pattern to access MariaDB replicas, for example.
        • Alternatively, we could use some more modern mechanism suggested by DPE SRE?
      • We will need to modify our k8s config to allow egress to the S3-compatible URL. It would be best if we did this in a way that is easy to add more S3-compatible targets later on. I am not sure of the details on this area.

Note this mechanism will only cover the use case of moving data from HDFS to S3-compatible targets, so much more constrained than T405360.


@AStein-WMF, @amastilovic did I miss something?

DPE SRE: Does the above idea sound reasonable? CC @BTullis.

This looks good to me! thanks for writing it up! I assume next step is for @BTullis to review and give his thoughts- but lmk if there's anything i can do to help!

Ahoelzl triaged this task as Medium priority.