Page MenuHomePhabricator

Upgrade Hadoop to version 3.3.6 and Hive to version 4.0.1
Open, HighPublic

Description

We are still using Hadoop version 2.10.2 and Hive version 1.3.6, both of which are now considered EOL.

Currently there are two active release lines: (https://hadoop.apache.org/releases.html) of Hadoop, which are 3.3 and 3.4

Our Debian packages currently use Apache Bigtop (https://bigtop.apache.org/) but we have had to use a custom fork of this project to maintain a release of their version 1.5 branch, which builds for Debian bullseye.

The most recent release of Apache bigtop (https://bigtop.apache.org/download.html#releases) is: 3.3.0

The list of packages included in Bigtop 3.3 is here.

This includes:

  • Hadoop 3.3.6
  • Hive 3.1.3

We need to plan how and when to upgrade our two Hadoop clusters.

  • Will we continue to use bigtop packages, or should we look into some form of containerised approach?
  • How will we ensure that the data we have on the Hadoop datanodes is safe during the upgrade? Will we need to back some of it up beforehand?

Event Timeline

Will we continue to use bigtop packages, or should we look into some form of containerised approach?

The commercial world of Hadoop distributions, namely Hortonworks (which merged with Cloudera) and IBM BigInsigths (which was sold to Hortonworks), tried to containerize Hadoop many years ago. It did not go well, especially for HDFS. I wish they had made these learnings public but AFAICT, they didn't. But I was there, and it was not pretty. A lot of wasted cycles.

On the other hand, Uber did it:
https://www.uber.com/blog/hadoop-namenode-container/
https://www.uber.com/blog/hadoop-container-blog

My suggesting though would be to reuse as much as possible. Big top works fine, doesn't it?

Will we continue to use bigtop packages, or should we look into some form of containerised approach?

The commercial world of Hadoop distributions, namely Hortonworks (which merged with Cloudera) and IBM BigInsigths (which was sold to Hortonworks), tried to containerize Hadoop many years ago. It did not go well, especially for HDFS. I wish they had made these learnings public but AFAICT, they didn't. But I was there, and it was not pretty. A lot of wasted cycles.

Thanks @xcollazo - Good info. I was under the impression that bigtop allowed us to create both packages and containers for Hadoop (c.f. https://github.com/apache/bigtop/blob/master/provisioner/docker/README.md) but looking more closely at it, I don't think it's intended for anything more serious than smoke tests. They build the packages and install them into containersand manage the config with puppet, simply for running tests of various kinds.

On the other hand, Uber did it:

I will check this out, thanks.

Big top works fine, doesn't it?

Yes and no. The main problem with bigtop for us is that their release cycle and their list of supported Debian versions doesn't necessarily match up well with what we would like to use.

The bigtop 1.5 release, which we currently use, doesn't support bullseye, so we have to fork it to add support and build the packages ourselves.

Now that bullseye is Debian oldstable and we want to upgrade hosts to bookworm (i.e. stable), we are having trouble building bigtop 1.5 for bookworm. T378954: Build bigtop 1.5 packages for bookworm

The current bigtop release (3.3.0) (and also the master) branch only supports bullseye, so we are going to run into the same trouble again very quickly.

The Bigtop 3.4 BOM has yet to be finalised, but they're talking about dropping bullseye and replacing it with bookworm: https://issues.apache.org/jira/browse/BIGTOP-4218 - which is also potentially inconvenient.

So I'm just doing some initial thinking about whether we could use containers to de-couple the Hadoop runtime dependencies a little from the operating system version.

I have added a comment to the Bigtop Jira tracker for the 3.4.0 release, asking them to consider keeping two Debian versions in the support matrix.

I'd like to make a case that it would be really helpful to bigtop users if we could keep two Debian versions on the support matrix at once, rather than one.
Upgrading O/S major versions in production is difficult enough, without having to co-ordinate a bigtop version upgrade at the same time. Thanks for your consideration.

I may follow it up with a patch.

I have done some initial tests to build the Hadoop and Hive packages against version 3.3.0 of Bigtop.

btullis@marlin:~/wmf/bigtop$ docker run --name bigtop-3.3 -it -v `pwd`:/ws --workdir /ws bigtop/slaves:3.3.0-debian-11

root@ec4e4c9fcaed:/ws# source /etc/profile.d/bigtop.sh

root@ec4e4c9fcaed:/ws# ./gradlew allclean hadoop-pkg hive-pkg

root@ec4e4c9fcaed:/ws# find ./output -name *.deb
./output/hadoop/hadoop-httpfs_3.3.6-1_amd64.deb
./output/hadoop/hadoop-hdfs-datanode_3.3.6-1_amd64.deb
./output/hadoop/hadoop-client_3.3.6-1_amd64.deb
./output/hadoop/libhdfs0-dev_3.3.6-1_amd64.deb
./output/hadoop/hadoop-yarn-nodemanager_3.3.6-1_amd64.deb
./output/hadoop/hadoop-hdfs-dfsrouter_3.3.6-1_amd64.deb
./output/hadoop/hadoop-mapreduce_3.3.6-1_amd64.deb
./output/hadoop/hadoop-yarn-timelineserver_3.3.6-1_amd64.deb
./output/hadoop/hadoop-hdfs_3.3.6-1_amd64.deb
./output/hadoop/hadoop-hdfs-zkfc_3.3.6-1_amd64.deb
./output/hadoop/hadoop-hdfs-journalnode_3.3.6-1_amd64.deb
./output/hadoop/hadoop-yarn-router_3.3.6-1_amd64.deb
./output/hadoop/hadoop-yarn-resourcemanager_3.3.6-1_amd64.deb
./output/hadoop/hadoop-hdfs-namenode_3.3.6-1_amd64.deb
./output/hadoop/libhdfspp_3.3.6-1_amd64.deb
./output/hadoop/libhdfs0_3.3.6-1_amd64.deb
./output/hadoop/hadoop_3.3.6-1_amd64.deb
./output/hadoop/hadoop-conf-pseudo_3.3.6-1_amd64.deb
./output/hadoop/hadoop-yarn_3.3.6-1_amd64.deb
./output/hadoop/hadoop-mapreduce-historyserver_3.3.6-1_amd64.deb
./output/hadoop/hadoop-yarn-proxyserver_3.3.6-1_amd64.deb
./output/hadoop/hadoop-doc_3.3.6-1_all.deb
./output/hadoop/hadoop-hdfs-fuse_3.3.6-1_amd64.deb
./output/hadoop/libhdfspp-dev_3.3.6-1_amd64.deb
./output/hadoop/hadoop-hdfs-secondarynamenode_3.3.6-1_amd64.deb
./output/hadoop/hadoop-kms_3.3.6-1_amd64.deb
./output/hive/hive-jdbc_3.1.3-1_all.deb
./output/hive/hive-webhcat_3.1.3-1_all.deb
./output/hive/hive-server2_3.1.3-1_all.deb
./output/hive/hive-hbase_3.1.3-1_all.deb
./output/hive/hive-metastore_3.1.3-1_all.deb
./output/hive/hive_3.1.3-1_all.deb
./output/hive/hive-hcatalog-server_3.1.3-1_all.deb
./output/hive/hive-webhcat-server_3.1.3-1_all.deb
./output/hive/hive-hcatalog_3.1.3-1_all.deb

So this part is fine.

I notice that sqoop has been removed from Bigtop as well: https://issues.apache.org/jira/browse/BIGTOP-3770

I have ready that some peo[ple have managed to get Sqoop 1.4.7 to work with Hadoop 3 though.

This is a techblog post about how we switched from CDH to Bigtop in 2021.
https://techblog.wikimedia.org/2021/05/07/upgrading-hadoop-in-just-one-day/

At that time, we backed up 400 TB of unrecoverable data to a backup HDFS cluster using distcp
This time, we don't have a backup HDFS cluster, but we have 982 TB of (non-replicated) space free on the ceph cluster.

btullis@cephosd1001:~$ sudo ceph df
--- RAW STORAGE ---
CLASS      SIZE    AVAIL    USED  RAW USED  %RAW USED
hdd    1010 TiB  982 TiB  28 TiB    28 TiB       2.77
ssd     140 TiB  140 TiB  75 GiB    75 GiB       0.05
TOTAL   1.1 PiB  1.1 PiB  28 TiB    28 TiB       2.44

If we use an erasure coded pool, we could safely use about 80% of that - so in the region of 750 TB for temporary backups.

BTullis triaged this task as Medium priority.Nov 12 2024, 1:05 PM
BTullis added subscribers: Ottomata, JAllemandou.

Those Uber blogs are pretty fascinating! It does seem like they made a pretty good job of it, but they were able to take on the containerisation piece as a discrete project and tackle parts of it at different times.
I think that this highlights that it would be difficult for us to containerize the Hadoop daemons whilst trying to coordinate a Hadoop major version upgrade, at the same time.

BTullis renamed this task from Plan for a Hadoop and Hive upgrade for the Data Platform to Upgrade Hadoop to version 3.3.x and Hive to version 3.1.x.Nov 13 2024, 12:33 PM

Something not to forget about is that this upgrade is also about upgrading to Java 11!

@JAllemandou do we have to upgrade to JDK11 at the same time, or can we phase in the client JDK upgrades after we upgrade Hadoop itself?

Something not to forget about is that this upgrade is also about upgrading to Java 11!

OK, thanks @JAllemandou. As I understand it, we still need JDK 8 for compiling Hadoop 3.3.6 but for runtime, JDK 11 is recommended supported.

https://cwiki.apache.org/confluence/display/HADOOP/Hadoop+Java+Versions

BTullis raised the priority of this task from Medium to High.Nov 26 2024, 11:11 AM

A point of note is that Hive 3.x has now been classified as EOL as from 2024/10/08 - https://hive.apache.org/general/downloads/

image.png (701×1 px, 107 KB)

The 4.0.1 release is now considered the stable release.

Version 4.0.1 is slated for release in Bigtop 3.4.0 (BIGTOP-4218) but we don't yet know when that will be.

I wonder whether it would be possible for us to stop using Hive from Bigtop and using containers for this, instead.
As I understand it, we no longer wish to support the Hive/Mapreduce query engine in production, so our primary use case for Hive is the metastore service.

If we could build Hive 4.0.1 containers, using Hadoop 3.3.36, then we could migrate the Hive metastore on the dse-k8s cluster, instead of on dedicated Hadoop co-ordinator servers (currently an-coord100[3-4]).
@JAllemandou - @Ottomata - Does this sound feasible to you, or would we definitely need to keep installing hive and hive-hcatalog packages to every Hadoop worker node?

As I understand it, we no longer wish to support the Hive/Mapreduce query engine in production, so our primary use case for Hive is the metastore service.

+1

@JAllemandou - @Ottomata - Does this sound feasible to you, or would we definitely need to keep installing hive and hive-hcatalog packages to every Hadoop worker node?

Sounds feasible to me! I wonder if we would even migrate to another "metastore" engine ( OpenHouse, Project Nessie, databricks Unity, or what else).

Hive version 4.0.1 has been added to bigtop in this patch: https://github.com/apache/bigtop/commit/d8459aabfcc8d64bc3121bedc1b2c55ca3788270
We can probably cherry-pick this into our build.

BTullis renamed this task from Upgrade Hadoop to version 3.3.x and Hive to version 3.1.x to Upgrade Hadoop to version 3.3.6 and Hive to version 4.0.1.Nov 28 2024, 11:04 AM

we could migrate the Hive metastore on the dse-k8s cluster, instead of on dedicated Hadoop co-ordinator servers (currently an-coord100[3-4]).

+1 to removing support for Hive MR compute. We only need hive-metastore (or something compatible?) as a standalone service and doing that in k8s makes sense.

would we definitely need to keep installing hive and hive-hcatalog packages to every Hadoop worker node?

I really don't know what dependencies need to be installed on every worker node. As long as Spark etc. jobs can use the Hive metastore, then we are good!

Just bumped into this doc so wanted to share: https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=75978150#AdminManualMetastore3.0Administration-RunningtheMetastoreWithoutHive

Beginning in Hive 3.0, the Metastore is released as a separate package and can be run without the rest of Hive. This is referred to as standalone mode.

So it looks like we don't need to do anything fancy, just deploy the new Hive Metastore only artifacts, and some change in config.

This is great @xcollazo :)
I know some people still use hive. I guess when we remove it, people will be forced to move to Spark.