HDFS data usage pipeline change following deployment (Airflow has no access to hdfs.keytab)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Antoine_Quhen
	Dec 9 2022, 2:38 PM

Description

The first task of my airflow job was supposed to launch a simple Skein app that would have used the hdfs.keytab (available on an-launcher).
The resulting Skein app on Yarn, with the hdfs.keytab was:

fetching the fsimage from a namenode with hdfs dfsadmin -fetchImage fsimage - result is ~10GB - super user needed
converting the fsimage to XML with hdfs oiv -i fsimage -o fsimage.xml -p XML - result is ~40GB - regular user OK
sending the fsimage.xml to hdsf:///wmf/data/raw/hdfs_xml_fsimage

The fetch and the XML conversion are not consuming CPU, but consuming disk space. Also, having a pipeline as independent as possible would have been nice.

Alas, the analytics Unix user on an-launcher can't sudo the hdfs user and can't read the hdfs.keytab.

So what I propose now is:

tweaking the already existing backup job defined in modules/profile/manifests/hadoop/backup/namenode.pp, to make it send the backup fsimage to hdfs (Monday only).
Adding a sensor in the Airflow dag for lookup for this file
modifying the current Airflow task Skein Job
- to use regular analytics.keytab
- and fetch the raw fsimage from hdfs, not the namenode directly
- and proceed with the conversion
- ~~and remove the raw fsimage file on hdfs~~ (Moved to: https://phabricator.wikimedia.org/T325103)

Details

Subject	Repo	Branch	Lines +/-
Fix missing script in HDFS usage dataset pipeline	operations/puppet	production	+4 -3
Fix typo in bash script used by HDFS usage pipeline	analytics/refinery	master	+3 -3
Backing up HDFS FSImage to HDFS on Monday morning	operations/puppet	production	+100 -18
Fix systemd syntax in hadoop-namenode-backup-fetchimage	operations/puppet	production	+2 -1
Extract to a NameNode the creation of the raw FSImage	analytics/refinery	master	+70 -72

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Antoine_Quhen	T261283 Productionize HDFS fsimage data analysis job
		Resolved		Antoine_Quhen	T324850 HDFS data usage pipeline change following deployment (Airflow has no access to hdfs.keytab)

Event Timeline

Antoine_Quhen created this task.Dec 9 2022, 2:38 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 9 2022, 2:38 PM

Antoine_Quhen claimed this task.Dec 9 2022, 2:39 PM

Antoine_Quhen removed Antoine_Quhen as the assignee of this task.

Antoine_Quhen claimed this task.

Antoine_Quhen edited projects, added Data Pipelines (Sprint 05-06); removed Data Pipelines.

Antoine_Quhen moved this task from Ready to In Progress on the Data Pipelines (Sprint 05-06) board.

Antoine_Quhen renamed this task from HDFS data usage pipeline change before production to HDFS data usage pipeline change following deployment (Airflow has no access to hdfs.keytab).Dec 9 2022, 2:42 PM

Antoine_Quhen added a parent task: T261283: Productionize HDFS fsimage data analysis job.

Sounds good to me!

One question: can the analytics change group ownership to analytics-admins - Other than this- all good! Thanks @Antoine_Quhen

The analytics Unix user can write a file in a folder owned by analytics:analytics-admins, and this precise ownership is preserved on the just written file:

aqu@an-launcher1002:~$ hdfs dfs -ls /wmf/data/raw
drwxr-x---   - analytics analytics-admins                     0 2022-12-08 13:32 /wmf/data/raw/hdfs_xml_fsimage

aqu@an-launcher1002:~$ sudo -u analytics kerberos-run-command analytics hdfs dfs -touchz /wmf/data/raw/hdfs_xml_fsimage/test.xml

aqu@an-launcher1002:~$ hdfs dfs -ls /wmf/data/raw/hdfs_xml_fsimage
-rw-r-----   3 analytics analytics-admins           0 2022-12-09 18:16 /wmf/data/raw/hdfs_xml_fsimage/test.xml

But it can't explicitly change the owner group of a file to analytics-admins:

aqu@an-launcher1002:~$ sudo -u analytics kerberos-run-command analytics hdfs dfs -touchz /tmp/aqu-test.xml

aqu@an-launcher1002:~$ hdfs dfs -ls /tmp/aqu-test.xml
-rw-r-----   3 analytics hdfs          0 2022-12-09 18:30 /tmp/aqu-test.xml

aqu@an-launcher1002:~$ sudo -u analytics kerberos-run-command analytics hdfs dfs -chown analytics:analytics-admins /tmp/aqu-test.xml
chown: changing ownership of '/tmp/aqu-test.xml': User analytics does not belong to analytics-admins

KO, but I don't need it.

Yet it can remove a file owned by analytics:analytics-admins:

aqu@an-launcher1002:~$ sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chown analytics:analytics-admins /tmp/aqu-test.xml

aqu@an-launcher1002:~$ hdfs dfs -ls /tmp/aqu-test.xml
-rw-r-----   3 analytics analytics-admins          0 2022-12-09 18:32 /tmp/aqu-test.xml

aqu@an-launcher1002:~$ sudo -u analytics kerberos-run-command analytics hdfs dfs -rm /tmp/aqu-test.xml
22/12/09 18:33:55 INFO fs.TrashPolicyDefault: Moved: 'hdfs://analytics-hadoop/tmp/aqu-test.xml' to trash at: hdfs://analytics-hadoop/user/analytics/.Trash/Current/tmp/aqu-test.xml1670610835692

OK and I need it.

So for my use case, it looks fine.

Change 866650 had a related patch set uploaded (by Aqu; author: Aqu):

[operations/puppet@production] HDFS FSImage is backed up to HDFS on monday

https://gerrit.wikimedia.org/r/866650

gerritbot added a project: Patch-For-Review.Dec 9 2022, 9:56 PM

Change 867185 had a related patch set uploaded (by Aqu; author: Aqu):

[analytics/refinery@master] Extract to a namenode the creation of the raw FSImage

https://gerrit.wikimedia.org/r/867185

https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/194

JArguello-WMF moved this task from In Progress to In Review on the Data Pipelines (Sprint 05-06) board.Dec 12 2022, 5:04 PM

At review, we decided:

to compress all backups locally before sending them to HDFS. It saves some disk space.
to rename the FSImage files to include a timestamp eg: fsimage_2022-12-01.gz
to send the backups to hdfs:///wmf/data/raw/hdfs/fsimage
to put the fsimage XMLs into their own folder hdfs:///wmf/data/raw/hdfs/xml_fsimage

Antoine_Quhen moved this task from In Review to Ready to Deploy on the Data Pipelines (Sprint 05-06) board.Dec 13 2022, 9:18 PM

Antoine_Quhen updated the task description. (Show Details)Dec 13 2022, 9:26 PM

Antoine_Quhen updated the task description. (Show Details)

Change 867185 merged by Aqu:

[analytics/refinery@master] Extract to a NameNode the creation of the raw FSImage

https://gerrit.wikimedia.org/r/867185

https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/195/diffs

Change 866650 merged by Ottomata:

[operations/puppet@production] Backing up HDFS FSImage to HDFS on Monday morning

https://gerrit.wikimedia.org/r/866650

Maintenance_bot removed a project: Patch-For-Review.Dec 14 2022, 3:30 PM

In T324850#8466910, @gerritbot wrote:

Change 866650 merged by Ottomata:

[operations/puppet@production] Backing up HDFS FSImage to HDFS on Monday morning

https://gerrit.wikimedia.org/r/866650

@Antoine_Quhen and @Ottomata - It looks like there is something not quite right about the systemd timer spec for this job.
Since merging the patch we have a failure of the timer on an-master1002.

btullis@an-master1002:~$ journalctl -u hadoop-namenode-backup-fetchimage.service |tail -n 1
Dec 15 09:29:41 an-master1002 systemd[1]: /lib/systemd/system/hadoop-namenode-backup-fetchimage.service:7: Failed to resolve unit specifiers in +%Y-%m-%d): Unknown error -57

Change 868397 had a related patch set uploaded (by Aqu; author: Aqu):

[operations/puppet@production] Fix systemd syntax in hadoop-namenode-backup-fetchimage

https://gerrit.wikimedia.org/r/868397

gerritbot added a project: Patch-For-Review.Dec 15 2022, 12:53 PM

I have a fix. I didn't take into account systemd constraints:

% should be double escaped
full path to commands should be provided
systemd does not provide a bash shell to launch the command

Change 868397 merged by Ottomata:

[operations/puppet@production] Fix systemd syntax in hadoop-namenode-backup-fetchimage

https://gerrit.wikimedia.org/r/868397

Maintenance_bot removed a project: Patch-For-Review.Dec 15 2022, 2:30 PM

Change 868753 had a related patch set uploaded (by Aqu; author: Aqu):

[analytics/refinery@master] Fix typo in bash script used by HDFS usage pipeline

https://gerrit.wikimedia.org/r/868753

gerritbot added a project: Patch-For-Review.Dec 16 2022, 10:57 PM

Change 869166 had a related patch set uploaded (by Aqu; author: Aqu):

[operations/puppet@production] Test to debug missing scripts on standby namenode

https://gerrit.wikimedia.org/r/869166

Change 868753 merged by Aqu:

[analytics/refinery@master] Fix typo in bash script used by HDFS usage pipeline

https://gerrit.wikimedia.org/r/868753

Change 869166 merged by Btullis:

[operations/puppet@production] Fix missing script in HDFS usage dataset pipeline

https://gerrit.wikimedia.org/r/869166

Maintenance_bot removed a project: Patch-For-Review.Dec 19 2022, 11:30 AM

Antoine_Quhen moved this task from Ready to Deploy to Done on the Data Pipelines (Sprint 05-06) board.Dec 19 2022, 12:30 PM

• EChetty closed this task as Resolved.Jan 17 2023, 11:38 AM

	F35865575: image.png
	Dec 15 2022, 10:08 AM

HDFS data usage pipeline change following deployment (Airflow has no access to hdfs.keytab)Closed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

HDFS data usage pipeline change following deployment (Airflow has no access to hdfs.keytab)
Closed, ResolvedPublic
Actions

Related Objects
Search...