Page MenuHomePhabricator

Investigate Hadoop 3 container support with reference to Airflow deployment pipelines
Closed, ResolvedPublic

Description

Update August 2025

I have written a design document relating to this: Docker Support for YARN.

Original description follows:

Hadoop 3 supports running Docker containers.

  • Can Airflow use it to launch jobs in Hadoop?

Event Timeline

Container support has been backported to the version that we are using - 2.10

From scanning quickly through the docs, it seems this would require Docker to be installed on nodemanagers.

Also

Docker Container Executor runs in non-secure mode of HDFS and YARN. It will not run in secure mode, and will exit if it detects secure mode.

https://hadoop.apache.org/docs/r2.7.2/hadoop-yarn/hadoop-yarn-site/DockerContainerExecutor.html

Secure mode == kerberos.

So,

Can we do it?

I think not :/

OK, thanks for looking into this!

It won't work for our Hadoop YARN setup though, we'll still encounter the Kerberos barrier. Rootless docker MAY make it possible to run docker outside of Hadoop YARN though.

Perhaps hadoop 3 can do this with kerberos?

https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.0/data-operating-system/content/configure_yarn_for_running_docker_containers.html

Kerberos configurations are recommended for production

I can't find any docs on how this would work. But it may be worth investigating given https://phabricator.wikimedia.org/T296543#7538145

Reopening for now to discuss more.

@Ottomata: Per emails from Sep18 and Oct20 and https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup , I am resetting the assignee of this task because there has not been progress lately (please correct me if I am wrong!). Resetting the assignee avoids the impression that somebody is already working on this task. It also allows others to potentially work towards fixing this task. Please claim this task again when you plan to work on it (via Add Action...Assign / Claim in the dropdown menu) - it would be welcome. Thanks for your understanding!

I'm going to claim this task because I'm investigating container support as part of a Hadoop 3 upgrade plan.

Since this ticket was originally created, the guidance around using the docker daemon in production has been clarified.

Specific advice has been given as to what features should be avoided if a use-case for docker is compelling enough to want to proceed. Specifically, these are: https://wikitech.wikimedia.org/wiki/Docker#However...

  • For networking, the general answer is always run with --net=host
  • Don't use docker volumes, but rather just bind mount from the host the directories you want.
  • Be extremely explicit in which image tag you use in your configuration management
  • Hook up docker to configuration management via the proper systemd unit files

Stripping down docker like this is...

...turning it essentially into an application execution engine for binaries that are bundled in OCI images.

...which is precisely what we need. We wouldn't want any of the messy networking or volume management.

I think that there is also a strong case to be made for using docker to run the Hadoop binaries themselves, rather than using Debian packages for this.
We struggle to keep our bigtop packages in line with the operating system versions that are supported, so this would be made considerably easier with containers.

In terms of launching user workloads on YARN through containers, there are plenty of ways in which we would want to lock down the configuration. e.g.

I'll write this up.

BTullis renamed this task from SPIKE - Will Hadoop 3 container support help us for Airflow deployment pipelines? to Investigate Hadoop 3 container support with reference to Airflow deployment pipelines.Nov 11 2024, 10:55 AM
BTullis triaged this task as Medium priority.Nov 13 2024, 12:49 PM

I have written a design document for this: Docker Support for YARN.

It depends on the Hadoop 3.3.6 upgrade being completed first.

I will resolve this ticket, based on the fact that the research is done.
A design document has been created and is no shared for review and comments.