Support importing a Parquet file into HDFS using wmfdata-python
Open, LowPublic
Actions

Assigned To

None

Authored By

	nshahquinn-wmf
	Jan 28 2021, 4:10 PM

Description

Currently, wmfdata-python has a hive.load_csv function. It would nice to extend this to importing Parquet files into Hive-indexed HDFS table. This would also save the user from having to type the fieldspec manually, since, unlike a CSV, a Pandas dataframe is aware of its own field names and data types.

Related Objects

Mentioned In: T355847: Null fields in canonical data are uploaded as empty strings

Event Timeline

nshahquinn-wmf triaged this task as Medium priority.Jan 28 2021, 4:10 PM

nshahquinn-wmf created this task.

nshahquinn-wmf moved this task from Triage to Backlog on the Product-Analytics board.

nshahquinn-wmf renamed this task from Support importing a Pandas dataframe directly into HDFS to Support importing a Parquet file into HDFS using wmfdata-python.Feb 23 2021, 5:55 PM

nshahquinn-wmf updated the task description. (Show Details)

nshahquinn-wmf moved this task from Backlog to Upcoming Quarter on the Product-Analytics board.Dec 7 2021, 10:43 PM

nshahquinn-wmf edited projects, added Product-Analytics (Kanban); removed Product-Analytics.Dec 9 2021, 11:36 PM

nshahquinn-wmf claimed this task.Dec 10 2021, 8:41 PM

nshahquinn-wmf moved this task from Next 2 weeks to Doing on the Product-Analytics (Kanban) board.

I've put up a draft pull request on GitHub. I still need to make some tweaks, so I haven't requested review yet.

nshahquinn-wmf moved this task from Doing to Next 2 weeks on the Product-Analytics (Kanban) board.Jan 11 2022, 9:14 PM

nshahquinn-wmf moved this task from Next 2 weeks to Needs Review on the Product-Analytics (Kanban) board.Jan 29 2022, 7:06 PM

Sorry, wrong task.

Adding Data-Engineering to all Wmfdata-Python tasks, as requested by Dan and Andrew.

• EChetty moved this task from Incoming (new tickets) to Event Platform Backlog on the Data-Engineering board.Mar 24 2022, 3:05 PM

nshahquinn-wmf edited projects, added Product-Analytics; removed Product-Analytics (Kanban).Mar 25 2022, 11:35 PM

I plan to finish the pull request in July, after I return from sabbatical. But if someone else wants to take over while I'm gone, that's fine with me!

• EChetty moved this task from Event Platform Backlog to WMF-Data on the Data-Engineering board.Mar 28 2022, 10:53 AM

kzimmerman moved this task from Upcoming Quarter to Current Quarter on the Product-Analytics board.Jul 5 2022, 8:24 PM

nshahquinn-wmf lowered the priority of this task from Medium to Low.Sep 6 2022, 5:18 PM

nshahquinn-wmf moved this task from Current Quarter to Upcoming Quarter on the Product-Analytics board.

nshahquinn-wmf moved this task from Upcoming Quarter to Backlog on the Product-Analytics board.Apr 12 2023, 12:14 AM

JArguello-WMF moved this task from WMF-Data to Data Products & Metrics on the Data-Engineering board.Jun 29 2023, 10:42 PM

The draft pull request is still there, but it seems unlikely that I'll be able to pick it back up in the near future.

lbowmaker moved this task from Data Products & Metrics to Icebox (not considered in current quarter) on the Data-Engineering board.Nov 10 2023, 2:42 PM

nshahquinn-wmf mentioned this in T355847: Null fields in canonical data are uploaded as empty strings.Jan 25 2024, 2:25 AM

Support importing a Parquet file into HDFS using wmfdata-pythonOpen, LowPublicActions

Description

Related Objects

Event Timeline

Support importing a Parquet file into HDFS using wmfdata-python
Open, LowPublic
Actions