Page MenuHomePhabricator

Support importing a Parquet file into HDFS using wmfdata-python
Open, LowPublic

Description

Currently, wmfdata-python has a hive.load_csv function. It would nice to extend this to importing Parquet files into Hive-indexed HDFS table. This would also save the user from having to type the fieldspec manually, since, unlike a CSV, a Pandas dataframe is aware of its own field names and data types.

Event Timeline

nshahquinn-wmf created this task.
nshahquinn-wmf moved this task from Triage to Backlog on the Product-Analytics board.
nshahquinn-wmf renamed this task from Support importing a Pandas dataframe directly into HDFS to Support importing a Parquet file into HDFS using wmfdata-python.Feb 23 2021, 5:55 PM
nshahquinn-wmf updated the task description. (Show Details)

I've put up a draft pull request on GitHub. I still need to make some tweaks, so I haven't requested review yet.

I plan to finish the pull request in July, after I return from sabbatical. But if someone else wants to take over while I'm gone, that's fine with me!

nshahquinn-wmf lowered the priority of this task from Medium to Low.Sep 6 2022, 5:18 PM

The draft pull request is still there, but it seems unlikely that I'll be able to pick it back up in the near future.