Upgrade Hive to ≥ 2.0
Open, NormalPublic

Description

As of September 2018, the analytics cluster is running Hive 1.1, but we in Product-Analytics are very much interested in a newer version of Hive so that we can access to a bunch of UDFs which are currently not available.

These include:

  • substring_index (added in 2.0)
  • MD5 (added in 2.0)
  • SHA1 (added in 2.0)

When we met about it, @Nuria mentioned plans to upgrade to CDH6 from CDH5 once v6 leaves beta but informed us there's currently no release date and since v6 has been in beta long past Cloudera's ETA, AE has considered switching to Hortonworks as the distribution.

This ticket is to formally track the status of the upgrade :)

Thanks!

mpopov created this task.Sep 4 2018, 7:31 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 4 2018, 7:31 PM
mpopov updated the task description. (Show Details)Sep 4 2018, 7:45 PM
Tbayer awarded a token.Sep 4 2018, 8:09 PM
fdans triaged this task as Normal priority.
elukey added a subscriber: elukey.Sep 7 2018, 11:21 AM

Hi! Thanks a lot for this task, it triggered some useful discussions. A couple of notes after checking the CDH release details:

  • CDH 6.0 community edition (we don't use the enterprise/paid one) seems to come with Hive 2.1, even if it is a bit weird that Ubuntu is the only distribution listed in the download page.
  • CDH 5.15 (we are running 5.10 now) still offers Hive 1.10, nothing really would change from the one we are running now (minus a couple of patches that Cloudera included).

The main issue that I can see while checking CDH 6 is that it includes Hadoop 3, meanwhile we are still running 2.6.0 (plus some patches from Cloudera). The jump is big and it would probably require a lot of testing and work before attempting the upgrade, and we'd be among the first to try it and experiment bugs things-not-working etc.. My experience with Cloudera community support is not the best, so if we choose to go for CDH 6 I'd wait a couple of releases to let the community version be "stabilized" a bit more.

The road of packaging our own version of Hive to be deployed alongside the CDH packages might make sense, but we'd probably break stability among the CDH packages (since they are shipped as a whole thing). Cloudera sadly does not release the debian source packages, so we cannot even try to work on the packages ourselves to backport important patches (like UDFs etc..).

On a more general note, we are currently thinking if it would be worth to change distribution and move away from Cloudera (more details will appear in the parent task, T203693), but it would of course require a ton of time :)

After this looong post, I just want to say that we support this request and that we'll try to do everything that we can to upgrade asap, but it might require a couple (or more) quarters before we'll be able to hit it.

Let me know your thoughts!

mpopov renamed this task from Upgrade Hive to ≥1.3 or ≥2.1 to Upgrade Hive to ≥1.13 or ≥2.1.Sep 7 2018, 3:21 PM
mpopov updated the task description. (Show Details)

Whoops, realized I was missing a digit in the version.

mpopov added a comment.Sep 7 2018, 3:24 PM

Hi! Thanks a lot for this task, it triggered some useful discussions. A couple of notes after checking the CDH release details:

On a more general note, we are currently thinking if it would be worth to change distribution and move away from Cloudera (more details will appear in the parent task, T203693), but it would of course require a ton of time :)

After this looong post, I just want to say that we support this request and that we'll try to do everything that we can to upgrade asap, but it might require a couple (or more) quarters before we'll be able to hit it.

That sounds good! Nuria mentioned switching to the Hortonworks distribution from Cloudera rather than installing individual components of our data ecosystem, which we understand would not be a trivial endeavor :) Glad to hear you folks are on board with upgrading!

After this looong post, I just want to say that we support this request and that we'll try to do everything that we can to upgrade asap, but it might require a couple (or more) quarters before we'll be able to hit it.

That sounds totally reasonable to me too! Thanks for taking the time to think it through and explain your reasoning so clearly 😁

Whoops, realized I was missing a digit in the version.

Sorry you are right, I have been tricked by the title! Thanks for amending :)

That sounds good! Nuria mentioned switching to the Hortonworks distribution from Cloudera rather than installing individual components of our data ecosystem, which we understand would not be a trivial endeavor :) Glad to hear you folks are on board with upgrading!

We are currently considering various options:

  • Stick with CDH (no source package available, community support very limited, etc..)
  • Evaluate Hortonworks (but we don't know exactly what would change, we haven't done any dive deep/comparison yet).
  • Evaluate Apache BigTop (Apache top level project, full open source, easy to work with upstream, and it seems that CDH uses it as baseline from its references that I can see in the files deployed to our Hadoop machines).

My personal preference would be the last option since it is fully open source, but it is still not clear would it would miss from the other two..

As FYI, the last release of Big Top doesn't seem bad:

https://cwiki.apache.org/confluence/display/BIGTOP/Bigtop+1.2.1+Release

hive                      1.2.1
hadoop                    2.7.3

I'll write a summary in the parent task.

@mpopov, I'm actually confused now. I'm looking at Hive downloads page, which has the best version history I could find, and neither Hive 1.3 nor 1.13 seems to exist. Are you sure you put the right version? 🤔

@mpopov, I'm actually confused now. I'm looking at Hive downloads page, which has the best version history I could find, and neither Hive 1.3 nor 1.13 seems to exist. Are you sure you put the right version? 🤔

I'm confused too! I was going off https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF where substring_index (one of the UDFs that we're hoping to get from the upgrade) is available "as of Hive 1.3.0"

Same with most of the crypto functions:

¯\_(ツ)_/¯

Neil_P._Quinn_WMF added a comment.EditedSep 11 2018, 10:12 PM

I'm confused too! I was going off https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF where substring_index (one of the UDFs that we're hoping to get from the upgrade) is available "as of Hive 1.3.0"

Same with most of the crypto functions:

Huh, wow. I did some more sleuthing and it looks like the Hive 2.0 release notes actually mention adding all these functions. Also, if you browse the release notes dropdown in Hive's "configure release notes" page, you can see 1.3 listed under "unreleased versions". So it seems Hive 1.3 was planned, but it was eventually folded into 2.0.

Are there any more UDFs we want? If not, it sounds like we're actually requesting Hive 2.0. I'll update the task to reflect.

Neil_P._Quinn_WMF renamed this task from Upgrade Hive to ≥1.13 or ≥2.1 to Upgrade Hive to ≥ 2.0.Sep 11 2018, 10:14 PM
Neil_P._Quinn_WMF updated the task description. (Show Details)

Let's also not forget the potential speedups. In particular, the LLAP thing introduced in Hive 2.0 ("Hive Interactive Query") sounds interesting:

https://medium.com/sqooba/hive-llap-brings-interactive-sql-queries-on-hadoop-8f876ef116d8
("The main problem with a normal Hive job, is that every time a SQL job is submitted to the Hive Server, a YARN application will be started. This overhead is added on top of the SQL query itself [...] when the query is big, the startup time is amortised by the processing time, but when the query is small (interactive BI is one use case), the overhead becomes a real pain to provide a interactive experience to the data analyst running the query from her SQL bench application ... Out of all these trials and many more comes LLAP, a shared, re-usable SQL query layer on top of YARN.")

https://hortonworks.com/blog/announcing-apache-hive-2-1-25x-faster-queries-much/ ("Hive 2 with LLAP averages 26x faster than Hive 1")

Disclaimer: I don't know how well this would apply to our environment, e.g. whether / how much additional configuration work would be needed to get LLAP running.

LLAP seems to me (I am very ignorant on this front so don't take this as authoritative source :) very close to what Presto does to achieve its query speed, so it might be an interesting test to perform.

elukey added a comment.EditedSep 20 2018, 8:15 PM

So after today's offsite hacking we may have an interim solution to deploy Hive 2.11. This of course needs to be tested very carefully in labs but it may work.

These are the CDH dependencies for hive now:

elukey@analytics1003:~$ apt-cache show hive | grep Depends
Depends: adduser, hadoop-client, bigtop-utils (>= 0.7), zookeeper, hive-jdbc (= 1.1.0+cdh5.10.0+859-1.cdh5.10.0.p0.71~jessie-cdh5.10.0), avro-libs, parquet

And these are the Bigtop ones (for Hive 1.1):

Depends: adduser, hadoop-client, bigtop-utils (>= 0.7), zookeeper, hive-jdbc (= ${source:Version}), python

The Bigtop 1.3.1 release (should be out soon) will bump Hive version to 2.1.1, but the above deps should mostly stay the same. So we could simply replacing the hive* debian packages on analytics1003 (the analytics coordinator where Hive runs) with Bigtop ones, keeping the current hadoop-client that should work nicely. This should allow us to upgrade Hive before upgrading to other distro/etc..

No promises but we'll try to do some testing next quarter :)

Oooh, exciting!!! :D

Yesterday T209407 drained a lot of our time, the new upgrade to CDH 5.15 caused spark issues that took a long time to be fixed. In this case Hive was not upgraded on paper (same major/minor version for old/new packages, 1.1) but Cloudera probably backported some new stuff from 1.2 that caused problems. In this case, all the distribution was tested by Cloudera before the release, so we didn't have major surprises, but now I am wondering how much stability we'd sacrifice if we upgrade to Hive 2.11 like described above. It would be extremely useful to some people, but I fear that it might cause major headaches to all of us in the long term. Writing this note just to say that we didn't forget about this task, but only that it might not as easy as we thought.

Now I am wondering how much stability we'd sacrifice if we upgrade to Hive 2.11 like described above. It would be extremely useful to some people, but I fear that it might cause major headaches to all of us in the long term. Writing this note just to say that we didn't forget about this task, but only that it might not as easy as we thought.

That's a totally reasonable consideration. We know that not everything we want is possible—we just like to understand the constraints and the prioritization. Because of the conversation here, I feel like on this topic we do. Thank you for that! 😁

Keeping the task updated - in https://issues.apache.org/jira/browse/BIGTOP-3074 the BigTop Apache distribution removed the oozie packaging since it seems not compatible (yet) with Hive 2.x. The CDH6 distribution seems to have Hive 2.1 and Oozie 5.x beta though :)