Page MenuHomePhabricator

[Airflow] Migrate Oozie's mediawiki_history_load jobs to Airflow
Closed, ResolvedPublic

Description

Migrate all mediawiki_history_load jobs to Airflow.

  • All jobs live under refinery/oozie/mediawiki/history/load.
  • They are responsible to repair each one of the mediawiki tables (adding the missing partition meta-data) after they have been imported from mariaDB by sqoop. Basically they execute MSCK REPAIR TABLE <tablename> for each table.
  • The job also creates a success file named _PARTITIONED inside each table's partition directory, so that subsequent jobs know when the Hive partition meta-data is complete.
  • Probably it is possible to create just 1 Airflow DAG that iterates over a list of datasets and creates the corresponding tasks. Likely, for each dataset: 1) A sensor 2) The repair table (SparkSQLOperator?) and 3) A URLTouchOperator that generates the _PARTITIONED flag.

Event Timeline

Change 808888 had a related patch set uploaded (by NOkafor; author: NOkafor):

[analytics/refinery@master] Added repair_partitions.hql file to hql path

https://gerrit.wikimedia.org/r/808888

Change 808888 merged by Mforns:

[analytics/refinery@master] Removed hive.mapred.mode = nonstrict Updated the usage from hive to spark2-sql Added repair_partitions.hql file to hql path

https://gerrit.wikimedia.org/r/808888