Page MenuHomePhabricator

Write an Airflow job converting commons structured data dump to Hive
Closed, ResolvedPublic

Description

The airflow job should

  • be run weekly on Mondays.
  • wait for source data to be available:
    • source folder is of form hdfs://analytics-hadoop/wmf/data/raw/commons/dumps/mediainfo-json/YYYYMMDD
    • source folder contains a file named _IMPORTED when the source data has been succesfully imported in the folder
  • run a spark job reading the source data and writing it to hive
    • the spark job is in the refinery-job.jar archive, we need to have it as a dependency for the job
    • the spark job class is org.wikimedia.analytics.refinery.job.structureddata.jsonparse.JsonDumpConverter
    • main parameters of the job are the input folder, the output hive table and the snapshot (time partition) being created. The output hive table will be structured_data.commons_entity and the snapshot will be in the form YYYY-MM-DD. See the class for the detailed list of parameters :)

Event Timeline

Snwachukwu moved this task from Next Up to In Progress on the Data-Engineering-Kanban board.
Snwachukwu moved this task from In Progress to Next Up on the Data-Engineering-Kanban board.
Snwachukwu moved this task from Next Up to In Progress on the Data-Engineering-Kanban board.