The first task of my airflow job was supposed to launch a simple Skein app that would have used the hdfs.keytab (available on an-launcher).
The resulting Skein app on Yarn, with the hdfs.keytab was:
- fetching the fsimage from a namenode with hdfs dfsadmin -fetchImage fsimage - result is ~10GB - super user needed
- converting the fsimage to XML with hdfs oiv -i fsimage -o fsimage.xml -p XML - result is ~40GB - regular user OK
- sending the fsimage.xml to hdsf:///wmf/data/raw/hdfs_xml_fsimage
The fetch and the XML conversion are not consuming CPU, but consuming disk space. Also, having a pipeline as independent as possible would have been nice.
Alas, the analytics Unix user on an-launcher can't sudo the hdfs user and can't read the hdfs.keytab.
So what I propose now is:
- tweaking the already existing backup job defined in modules/profile/manifests/hadoop/backup/namenode.pp, to make it send the backup fsimage to hdfs (Monday only).
- Adding a sensor in the Airflow dag for lookup for this file
- modifying the current Airflow task Skein Job
- to use regular analytics.keytab
- and fetch the raw fsimage from hdfs, not the namenode directly
- and proceed with the conversion
and remove the raw fsimage file on hdfs(Moved to: https://phabricator.wikimedia.org/T325103)