Page MenuHomePhabricator

HDFS data usage pipeline change following deployment (Airflow has no access to hdfs.keytab)
Closed, ResolvedPublic

Description

The first task of my airflow job was supposed to launch a simple Skein app that would have used the hdfs.keytab (available on an-launcher).
The resulting Skein app on Yarn, with the hdfs.keytab was:

  • fetching the fsimage from a namenode with hdfs dfsadmin -fetchImage fsimage - result is ~10GB - super user needed
  • converting the fsimage to XML with hdfs oiv -i fsimage -o fsimage.xml -p XML - result is ~40GB - regular user OK
  • sending the fsimage.xml to hdsf:///wmf/data/raw/hdfs_xml_fsimage

The fetch and the XML conversion are not consuming CPU, but consuming disk space. Also, having a pipeline as independent as possible would have been nice.

Alas, the analytics Unix user on an-launcher can't sudo the hdfs user and can't read the hdfs.keytab.

So what I propose now is:

  • tweaking the already existing backup job defined in modules/profile/manifests/hadoop/backup/namenode.pp, to make it send the backup fsimage to hdfs (Monday only).
  • Adding a sensor in the Airflow dag for lookup for this file
  • modifying the current Airflow task Skein Job
    • to use regular analytics.keytab
    • and fetch the raw fsimage from hdfs, not the namenode directly
    • and proceed with the conversion
    • and remove the raw fsimage file on hdfs (Moved to: https://phabricator.wikimedia.org/T325103)

Event Timeline

Antoine_Quhen removed Antoine_Quhen as the assignee of this task.
Antoine_Quhen claimed this task.
Antoine_Quhen moved this task from Ready to In Progress on the Data Pipelines (Sprint 05-06) board.
Antoine_Quhen renamed this task from HDFS data usage pipeline change before production to HDFS data usage pipeline change following deployment (Airflow has no access to hdfs.keytab).Dec 9 2022, 2:42 PM

One question: can the analytics change group ownership to analytics-admins - Other than this- all good! Thanks @Antoine_Quhen

The analytics Unix user can write a file in a folder owned by analytics:analytics-admins, and this precise ownership is preserved on the just written file:

aqu@an-launcher1002:~$ hdfs dfs -ls /wmf/data/raw
drwxr-x---   - analytics analytics-admins                     0 2022-12-08 13:32 /wmf/data/raw/hdfs_xml_fsimage

aqu@an-launcher1002:~$ sudo -u analytics kerberos-run-command analytics hdfs dfs -touchz /wmf/data/raw/hdfs_xml_fsimage/test.xml

aqu@an-launcher1002:~$ hdfs dfs -ls /wmf/data/raw/hdfs_xml_fsimage
-rw-r-----   3 analytics analytics-admins           0 2022-12-09 18:16 /wmf/data/raw/hdfs_xml_fsimage/test.xml

OK

But it can't explicitly change the owner group of a file to analytics-admins:

aqu@an-launcher1002:~$ sudo -u analytics kerberos-run-command analytics hdfs dfs -touchz /tmp/aqu-test.xml

aqu@an-launcher1002:~$ hdfs dfs -ls /tmp/aqu-test.xml
-rw-r-----   3 analytics hdfs          0 2022-12-09 18:30 /tmp/aqu-test.xml

aqu@an-launcher1002:~$ sudo -u analytics kerberos-run-command analytics hdfs dfs -chown analytics:analytics-admins /tmp/aqu-test.xml
chown: changing ownership of '/tmp/aqu-test.xml': User analytics does not belong to analytics-admins

KO, but I don't need it.

Yet it can remove a file owned by analytics:analytics-admins:

aqu@an-launcher1002:~$ sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chown analytics:analytics-admins /tmp/aqu-test.xml

aqu@an-launcher1002:~$ hdfs dfs -ls /tmp/aqu-test.xml
-rw-r-----   3 analytics analytics-admins          0 2022-12-09 18:32 /tmp/aqu-test.xml

aqu@an-launcher1002:~$ sudo -u analytics kerberos-run-command analytics hdfs dfs -rm /tmp/aqu-test.xml
22/12/09 18:33:55 INFO fs.TrashPolicyDefault: Moved: 'hdfs://analytics-hadoop/tmp/aqu-test.xml' to trash at: hdfs://analytics-hadoop/user/analytics/.Trash/Current/tmp/aqu-test.xml1670610835692

OK and I need it.

So for my use case, it looks fine.

Change 866650 had a related patch set uploaded (by Aqu; author: Aqu):

[operations/puppet@production] HDFS FSImage is backed up to HDFS on monday

https://gerrit.wikimedia.org/r/866650

Change 867185 had a related patch set uploaded (by Aqu; author: Aqu):

[analytics/refinery@master] Extract to a namenode the creation of the raw FSImage

https://gerrit.wikimedia.org/r/867185

At review, we decided:

  • to compress all backups locally before sending them to HDFS. It saves some disk space.
  • to rename the FSImage files to include a timestamp eg: fsimage_2022-12-01.gz
  • to send the backups to hdfs:///wmf/data/raw/hdfs/fsimage
  • to put the fsimage XMLs into their own folder hdfs:///wmf/data/raw/hdfs/xml_fsimage

Change 867185 merged by Aqu:

[analytics/refinery@master] Extract to a NameNode the creation of the raw FSImage

https://gerrit.wikimedia.org/r/867185

Change 866650 merged by Ottomata:

[operations/puppet@production] Backing up HDFS FSImage to HDFS on Monday morning

https://gerrit.wikimedia.org/r/866650

Change 866650 merged by Ottomata:

[operations/puppet@production] Backing up HDFS FSImage to HDFS on Monday morning

https://gerrit.wikimedia.org/r/866650

@Antoine_Quhen and @Ottomata - It looks like there is something not quite right about the systemd timer spec for this job.
Since merging the patch we have a failure of the timer on an-master1002.

image.png (95×3 px, 55 KB)

btullis@an-master1002:~$ journalctl -u hadoop-namenode-backup-fetchimage.service |tail -n 1
Dec 15 09:29:41 an-master1002 systemd[1]: /lib/systemd/system/hadoop-namenode-backup-fetchimage.service:7: Failed to resolve unit specifiers in +%Y-%m-%d): Unknown error -57

Change 868397 had a related patch set uploaded (by Aqu; author: Aqu):

[operations/puppet@production] Fix systemd syntax in hadoop-namenode-backup-fetchimage

https://gerrit.wikimedia.org/r/868397

I have a fix. I didn't take into account systemd constraints:

  • % should be double escaped
  • full path to commands should be provided
  • systemd does not provide a bash shell to launch the command

Change 868397 merged by Ottomata:

[operations/puppet@production] Fix systemd syntax in hadoop-namenode-backup-fetchimage

https://gerrit.wikimedia.org/r/868397

Change 868753 had a related patch set uploaded (by Aqu; author: Aqu):

[analytics/refinery@master] Fix typo in bash script used by HDFS usage pipeline

https://gerrit.wikimedia.org/r/868753

Change 869166 had a related patch set uploaded (by Aqu; author: Aqu):

[operations/puppet@production] Test to debug missing scripts on standby namenode

https://gerrit.wikimedia.org/r/869166

Change 868753 merged by Aqu:

[analytics/refinery@master] Fix typo in bash script used by HDFS usage pipeline

https://gerrit.wikimedia.org/r/868753

Change 869166 merged by Btullis:

[operations/puppet@production] Fix missing script in HDFS usage dataset pipeline

https://gerrit.wikimedia.org/r/869166