Page MenuHomePhabricator

Support importing Pandas datasets into the Data Lake using Wmfdata
Closed, ResolvedPublic

Description

Currently, Wmfdata has a hive.load_csv function, which is most notably used to upload the canonical datasets into the Data Lake.

However, it has some limitations (T327983, T355847), and in any case Hive is deprecated and should be removed from Wmfdata in the future (T384541).

We can build a better replacement using Spark's saveAsTable function. Rather than focusing on delimited files, we should simply build a function that uploads a Pandas data frame. Users can use Pandas directly to read in local files in any of the many formats it supports.

Details

Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Support uploading a Pandas data frame to the Data Lakerepos/data-engineering/wmfdata-python!67nshahquinn-wmfwork/spark_upload_squashedmain
Customize query in GitLab

Event Timeline

nshahquinn-wmf triaged this task as Medium priority.
nshahquinn-wmf moved this task from Triage to Backlog on the Product-Analytics board.
nshahquinn-wmf renamed this task from Support importing a Pandas dataframe directly into HDFS to Support importing a Parquet file into HDFS using wmfdata-python.Feb 23 2021, 5:55 PM
nshahquinn-wmf updated the task description. (Show Details)

I've put up a draft pull request on GitHub. I still need to make some tweaks, so I haven't requested review yet.

I plan to finish the pull request in July, after I return from sabbatical. But if someone else wants to take over while I'm gone, that's fine with me!

nshahquinn-wmf lowered the priority of this task from Medium to Low.Sep 6 2022, 5:18 PM

The draft pull request is still there, but it seems unlikely that I'll be able to pick it back up in the near future.

nshahquinn-wmf edited projects, added Movement-Insights; removed Product-Analytics.
nshahquinn-wmf added a subscriber: fkaelin.

@fkaelin has put up MR 66 for this. I need to review it.

nshahquinn-wmf renamed this task from Support importing a Parquet file into HDFS using wmfdata-python to Support importing Parquet, delimited, and Pandas datasets into the Data Lake using Wmfdata.Aug 13 2025, 5:47 PM
nshahquinn-wmf updated the task description. (Show Details)
nshahquinn-wmf changed the task status from Open to In Progress.Aug 13 2025, 5:49 PM
nshahquinn-wmf raised the priority of this task from Low to Medium.

Increasing the priority since this blocks deprecating the Hive module (T384541).

nshahquinn-wmf renamed this task from Support importing Parquet, delimited, and Pandas datasets into the Data Lake using Wmfdata to Support importing Pandas datasets into the Data Lake using Wmfdata.Aug 15 2025, 6:59 PM
nshahquinn-wmf updated the task description. (Show Details)

I'm cutting out the part about loading Parquet and delimited files, because those functions ended up being nothing more than two line wrappers around the appropriate Pandas read function and our Pandas-upload function. No need to increase the API surface area for that.