Maniphest T299059

Write an Airflow job converting commons structured data dump to Hive
Closed, ResolvedPublic
Actions

Description

The airflow job should

be run weekly on Mondays.
wait for source data to be available:
- source folder is of form hdfs://analytics-hadoop/wmf/data/raw/commons/dumps/mediainfo-json/YYYYMMDD
- source folder contains a file named _IMPORTED when the source data has been succesfully imported in the folder
run a spark job reading the source data and writing it to hive
- the spark job is in the refinery-job.jar archive, we need to have it as a dependency for the job
- the spark job class is org.wikimedia.analytics.refinery.job.structureddata.jsonparse.JsonDumpConverter
- main parameters of the job are the input folder, the output hive table and the snapshot (time partition) being created. The output hive table will be structured_data.commons_entity and the snapshot will be in the form YYYY-MM-DD. See the class for the detailed list of parameters :)

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		cchen	T252443 Create dashboard to show growth of structured data on Commons over time
		Resolved		JAllemandou	T258834 Create a Commons equivalent of the wikidata_entity table in the Data Lake
		Resolved		Snwachukwu	T299059 Write an Airflow job converting commons structured data dump to Hive

Event Timeline

JAllemandou created this task.Jan 12 2022, 3:06 PM

Restricted Application removed a project: Patch-For-Review. · View Herald TranscriptJan 12 2022, 3:06 PM

Gehel moved this task from Incoming to Current work on the Wikidata-Query-Service board.Jan 17 2022, 4:16 PM

odimitrijevic triaged this task as High priority.Jan 17 2022, 4:56 PM

odimitrijevic moved this task from Incoming (new tickets) to Datasets on the Data-Engineering board.

Snwachukwu claimed this task.Jan 18 2022, 5:26 PM

Snwachukwu moved this task from Next Up to In Progress on the Data-Engineering-Kanban board.

Snwachukwu moved this task from In Progress to Next Up on the Data-Engineering-Kanban board.

Snwachukwu moved this task from Next Up to In Progress on the Data-Engineering-Kanban board.

ldelench_wmf moved this task from Triage to Tracking on the Product-Analytics board.Jan 18 2022, 6:08 PM

Cparle subscribed.Jan 19 2022, 11:30 AM

Cparle mentioned this in T296814: [EPIC] Article-level image suggestions data pipeline.Jan 19 2022, 11:36 AM

Cparle mentioned this in T299789: [XL] Store a list of unillustrated articles with suggested images in hdfs.Jan 21 2022, 5:55 PM

Gehel moved this task from Incoming to Blocked/Waiting on the Discovery-Search (Current work) board.Jan 24 2022, 4:58 PM

Gehel edited projects, added Discovery-Search; removed Discovery-Search (Current work).

Gehel moved this task from needs triage to watching / waiting on the Discovery-Search board.

CBogen moved this task from Triage to Tracking on the Structured-Data-Backlog board.Jan 24 2022, 5:39 PM

Snwachukwu moved this task from In Progress to In Code Review on the Data-Engineering-Kanban board.Feb 7 2022, 4:13 PM

mforns added a project: Data Pipelines.Feb 11 2022, 4:09 PM

mforns moved this task from Backlog to In Review on the Data Pipelines board.

Snwachukwu moved this task from In Code Review to Ready to Deploy on the Data-Engineering-Kanban board.Feb 28 2022, 4:08 PM

Snwachukwu moved this task from Ready to Deploy to Done on the Data-Engineering-Kanban board.Feb 28 2022, 4:12 PM

mforns moved this task from In Review to Done on the Data Pipelines board.Feb 28 2022, 4:39 PM

JAllemandou closed this task as Resolved.Apr 8 2022, 1:03 PM