Page MenuHomePhabricator

Update to CDH 6 or other up-to-date Hadoop distribution
Open, MediumPublic

Description

We are currently running a Cloudera Hadoop distribution for the Analytics cluster, precisely CDH 5.10. This distribution has served us well but it showed some shortcomings:

  • Limited community support for reporting bugs when needed (and getting issues fixed upstream).
  • Absence of Debian source packages (limiting our ability to apply patches promptly, mostly for security CVEs).

Cloudera released some days ago CDH 6, a Hadoop 3.0 based distribution containing a lot of software upgrades (among all, Hive 2.1). Given the fact that we are running Hadoop 2.6.0 now, the jump to a new major version would require a lot of work and testing, likely doable only in multiple quarters.

This could be a good time to think if we want to keep going with CDH or change distribution, like:

  • Hortonworks
  • Apache Big Top

A bit more details about each distribution:

Hortonworks

The last 2.x series release seems to be 2.6.5, here the documentation about installing it manually. The repository seems to deny directory listing so it is difficult to explore, but as far as I can see the support is only up to Debian 7 (Debian Stretch is 9 to compare, so very old).

The last release is 3.1.0 and seems to support Debian Stretch.

Very nice that Apache Ambari and Ranger and integrated with the Distribution.

Apache BigTop

Version 1.4 supports Debian Stretch, and the upcoming 1.5 also supports Buster (but it jumps to Hadoop 3).

The Deb sources are available in https://github.com/apache/bigtop/tree/master/bigtop-packages/src/deb

In https://issues.apache.org/jira/browse/BIGTOP-3074 they (hopefully) temporary removed the oozie build support since it is not working with Hive 2.X (seems that upstream is working on it). (see https://issues.apache.org/jira/browse/BIGTOP-3099)

CDH 6

The release notes are very interesting to read. From the packages list it is clear though that Hadoop 3.0 is installed. From the requirements notes it seems thought that Debian is not officially supported (at least version 6.0.0) but only Ubuntu Xenial.

From this post it seems that Cloudera will not support Debian for CDH6. Moreover Cloudera does not offer Debian source packages for 5.X (the current distribution that we are running), that makes it difficult to patch things on the fly if needed (for example, a critical CVE that doesn't have a new Cloudera package ready). This would mean rebuilding the Ubuntu Xenial deb packages for Stretch each time that a release happens.

Note: it seems that Cloudera and Hortonworks will merge in one company very soon.

Note2: CDH 6.3 seems to support Java 11, that is the default on Debian Buster

Hopsworks Hadoop

Mentioning it to have a complete reference even if it is likely not a candidate for Production (the project is new and of course not as battle tested as Hadoop). More info in https://hops.readthedocs.io/en/latest/index.html. They have redesigned a lot of critical aspects of Hadoop, removing things like Zookeeper, Journal nodes, etc.. and replacing them with more fault tolerant and flexible solutions. They call themselves "Hadoop for humans", definitely a project to keep an eye on!

Event Timeline

fdans created this task.Sep 6 2018, 4:50 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 6 2018, 4:50 PM
fdans renamed this task from Update CDH to 6 to Update CDH to 6 or alternatives.Sep 6 2018, 4:52 PM
nshahquinn-wmf renamed this task from Update CDH to 6 or alternatives to Update to Cloudera CDH 6 or other Hadoop distribution.Sep 7 2018, 6:25 PM
nshahquinn-wmf renamed this task from Update to Cloudera CDH 6 or other Hadoop distribution to Update to CDH 6 or other Hadoop distribution.
nshahquinn-wmf renamed this task from Update to CDH 6 or other Hadoop distribution to Update to CDH 6 or other up-to-date Hadoop distribution.
elukey updated the task description. (Show Details)Sep 10 2018, 10:34 AM
elukey removed a subscriber: fdans.
elukey triaged this task as Medium priority.Sep 10 2018, 2:05 PM
elukey updated the task description. (Show Details)
elukey updated the task description. (Show Details)
elukey updated the task description. (Show Details)
elukey updated the task description. (Show Details)Sep 10 2018, 2:10 PM
elukey updated the task description. (Show Details)Sep 11 2018, 7:37 AM
elukey updated the task description. (Show Details)Sep 11 2018, 7:54 AM
elukey updated the task description. (Show Details)Sep 12 2018, 8:32 AM
elukey updated the task description. (Show Details)Sep 17 2018, 4:42 PM
elukey updated the task description. (Show Details)Oct 10 2018, 3:56 PM
As of 6.0 we (Cloudera) no longer support/build on debian:
https://www.cloudera.com/documentation/enterprise/6/release-notes/topics/rg_deprecated_items.html#concept_ylw_bc2_rbb
Sorry to be dissappoint.
We continue support for debian on 5.x though.

-Ben
elukey updated the task description. (Show Details)Nov 12 2018, 10:48 AM
elukey updated the task description. (Show Details)
elukey updated the task description. (Show Details)Nov 14 2018, 7:41 AM
elukey updated the task description. (Show Details)Nov 26 2018, 2:45 PM
elukey updated the task description. (Show Details)Nov 26 2018, 2:56 PM
elukey moved this task from Backlog to Stalled on the User-Elukey board.Dec 5 2018, 2:10 PM
elukey moved this task from Stalled to Backlog on the User-Elukey board.Feb 27 2019, 9:15 AM

Had an interesting chat with Moritz today about CDH6 and Debian. The following procedure could be used to import all the CDH6 packages and rebuild them for buster-wikimedia:

* on boron set the proxy
* dget $PATH_TO_CLOUDERA/foo.dsc
* bump the build versions with "dch"
* DIST=buster pdebuild
* on install1002: rsync rsync://boron.eqiad.wmnet/pbuilder-result/buster-wikimedia/*foo* .

CDH 6.2 seems to support both Ubuntu 16.04 and 18.04. Debian Buster and Ubuntu 18.04 use Java 11, so from CDH 6.2 onward we should be able to rebuild for Buster without a lot of trouble.

The above procedure is also a good test to see if we can rebuild the packages without any effort of if it requires manual intervention.

elukey updated the task description. (Show Details)Sep 12 2019, 9:02 AM
elukey updated the task description. (Show Details)Jan 3 2020, 4:56 PM

In https://issues.apache.org/jira/browse/BIGTOP-3123 there is a mention to the following upgrades:

hive                      2.3.3 => 3.1.2
hadoop                    2.8.5 => 3.2.1

And the support for Debian 10 (Buster).

elukey updated the task description. (Show Details)Jan 14 2020, 1:02 PM

Summary of current thoughts:

  • From https://lists.apache.org/thread.html/r9b588c1c9f693bd78549e7f3251004bc114c754b8d16f4edd796b828%40%3Cuser.bigtop.apache.org%3E BigTop seems a nice choice, even if there might be the possibility that they will not support Debian Buster (see link for more info). We'll have a way to interact and work with the dev community, that is also a big plus compared to the actual Cloudera model (basically just get what we offer and that's it). Moving to BigTop will immediately mean getting benefits, like moving to Hadoop 2.8 and Hive 2.x, but it might be tricky to keep compatibility with things like Hue (that is not provided by BigTop).
  • Hops is the new kid in town, on paper it looks amazing. A lot of new performance improvements, integration between GPUs and Hadoop, better automation to share datasets for users, etc.. But they basically forked from Hadoop 2.8, so choosing something that young might mean risking a lot. We definitely need to keep an eye on it, and possibly create a little cluster to play with Hops, but I don't see it a real candidate for the Analytics Hadoop cluster.
  • CDH 6 is also a good option, Hadoop 3 and no big changes from the current way of doing things. It will not support Debian, so we'll need to either use the Ubuntu packages or rebuild them for Buster. Their long term plan is not clear though, and I am not sure what is the direction that the CDH's open-source version will take in the future.

The best candidate in my opinion is BigTop.

Aklapper removed a project: Analytics.Jul 4 2020, 7:59 AM