Page MenuHomePhabricator

Upgrade Hive to ≥ 2.0
Open, NormalPublic

Description

As of September 2018, the analytics cluster is running Hive 1.1, but we in Product-Analytics are very much interested in a newer version of Hive so that we can access to a bunch of new features.
These include:

  • support for DATE types in Parquet tables (added in 1.2)
  • substring_index() (added in 2.0)
  • MD5() (added in 2.0)
  • SHA1() (added in 2.0)
  • potential speedups and better interactive querying with LLAP

Event Timeline

mpopov created this task.Sep 4 2018, 7:31 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 4 2018, 7:31 PM
mpopov updated the task description. (Show Details)Sep 4 2018, 7:45 PM
fdans triaged this task as Normal priority.Sep 6 2018, 4:43 PM
fdans moved this task from Incoming to Operational Excellence on the Analytics board.
elukey added a subscriber: elukey.Sep 7 2018, 11:21 AM

Hi! Thanks a lot for this task, it triggered some useful discussions. A couple of notes after checking the CDH release details:

  • CDH 6.0 community edition (we don't use the enterprise/paid one) seems to come with Hive 2.1, even if it is a bit weird that Ubuntu is the only distribution listed in the download page.
  • CDH 5.15 (we are running 5.10 now) still offers Hive 1.10, nothing really would change from the one we are running now (minus a couple of patches that Cloudera included).

The main issue that I can see while checking CDH 6 is that it includes Hadoop 3, meanwhile we are still running 2.6.0 (plus some patches from Cloudera). The jump is big and it would probably require a lot of testing and work before attempting the upgrade, and we'd be among the first to try it and experiment bugs things-not-working etc.. My experience with Cloudera community support is not the best, so if we choose to go for CDH 6 I'd wait a couple of releases to let the community version be "stabilized" a bit more.

The road of packaging our own version of Hive to be deployed alongside the CDH packages might make sense, but we'd probably break stability among the CDH packages (since they are shipped as a whole thing). Cloudera sadly does not release the debian source packages, so we cannot even try to work on the packages ourselves to backport important patches (like UDFs etc..).

On a more general note, we are currently thinking if it would be worth to change distribution and move away from Cloudera (more details will appear in the parent task, T203693), but it would of course require a ton of time :)

After this looong post, I just want to say that we support this request and that we'll try to do everything that we can to upgrade asap, but it might require a couple (or more) quarters before we'll be able to hit it.

Let me know your thoughts!

mpopov renamed this task from Upgrade Hive to ≥1.3 or ≥2.1 to Upgrade Hive to ≥1.13 or ≥2.1.Sep 7 2018, 3:21 PM
mpopov updated the task description. (Show Details)

Whoops, realized I was missing a digit in the version.

mpopov added a comment.Sep 7 2018, 3:24 PM

Hi! Thanks a lot for this task, it triggered some useful discussions. A couple of notes after checking the CDH release details:
On a more general note, we are currently thinking if it would be worth to change distribution and move away from Cloudera (more details will appear in the parent task, T203693), but it would of course require a ton of time :)
After this looong post, I just want to say that we support this request and that we'll try to do everything that we can to upgrade asap, but it might require a couple (or more) quarters before we'll be able to hit it.

That sounds good! Nuria mentioned switching to the Hortonworks distribution from Cloudera rather than installing individual components of our data ecosystem, which we understand would not be a trivial endeavor :) Glad to hear you folks are on board with upgrading!

After this looong post, I just want to say that we support this request and that we'll try to do everything that we can to upgrade asap, but it might require a couple (or more) quarters before we'll be able to hit it.

That sounds totally reasonable to me too! Thanks for taking the time to think it through and explain your reasoning so clearly 😁

Whoops, realized I was missing a digit in the version.

Sorry you are right, I have been tricked by the title! Thanks for amending :)

That sounds good! Nuria mentioned switching to the Hortonworks distribution from Cloudera rather than installing individual components of our data ecosystem, which we understand would not be a trivial endeavor :) Glad to hear you folks are on board with upgrading!

We are currently considering various options:

  • Stick with CDH (no source package available, community support very limited, etc..)
  • Evaluate Hortonworks (but we don't know exactly what would change, we haven't done any dive deep/comparison yet).
  • Evaluate Apache BigTop (Apache top level project, full open source, easy to work with upstream, and it seems that CDH uses it as baseline from its references that I can see in the files deployed to our Hadoop machines).

My personal preference would be the last option since it is fully open source, but it is still not clear would it would miss from the other two..

As FYI, the last release of Big Top doesn't seem bad:

https://cwiki.apache.org/confluence/display/BIGTOP/Bigtop+1.2.1+Release

hive                      1.2.1
hadoop                    2.7.3

I'll write a summary in the parent task.

@mpopov, I'm actually confused now. I'm looking at Hive downloads page, which has the best version history I could find, and neither Hive 1.3 nor 1.13 seems to exist. Are you sure you put the right version? 🤔

@mpopov, I'm actually confused now. I'm looking at Hive downloads page, which has the best version history I could find, and neither Hive 1.3 nor 1.13 seems to exist. Are you sure you put the right version? 🤔

I'm confused too! I was going off https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF where substring_index (one of the UDFs that we're hoping to get from the upgrade) is available "as of Hive 1.3.0"

Same with most of the crypto functions:

¯\_(ツ)_/¯

Neil_P._Quinn_WMF added a comment.EditedSep 11 2018, 10:12 PM

I'm confused too! I was going off https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF where substring_index (one of the UDFs that we're hoping to get from the upgrade) is available "as of Hive 1.3.0"
Same with most of the crypto functions:

Huh, wow. I did some more sleuthing and it looks like the Hive 2.0 release notes actually mention adding all these functions. Also, if you browse the release notes dropdown in Hive's "configure release notes" page, you can see 1.3 listed under "unreleased versions". So it seems Hive 1.3 was planned, but it was eventually folded into 2.0.

Are there any more UDFs we want? If not, it sounds like we're actually requesting Hive 2.0. I'll update the task to reflect.

Neil_P._Quinn_WMF renamed this task from Upgrade Hive to ≥1.13 or ≥2.1 to Upgrade Hive to ≥ 2.0.Sep 11 2018, 10:14 PM
Neil_P._Quinn_WMF updated the task description. (Show Details)

Let's also not forget the potential speedups. In particular, the LLAP thing introduced in Hive 2.0 ("Hive Interactive Query") sounds interesting:

https://medium.com/sqooba/hive-llap-brings-interactive-sql-queries-on-hadoop-8f876ef116d8
("The main problem with a normal Hive job, is that every time a SQL job is submitted to the Hive Server, a YARN application will be started. This overhead is added on top of the SQL query itself [...] when the query is big, the startup time is amortised by the processing time, but when the query is small (interactive BI is one use case), the overhead becomes a real pain to provide a interactive experience to the data analyst running the query from her SQL bench application ... Out of all these trials and many more comes LLAP, a shared, re-usable SQL query layer on top of YARN.")

https://hortonworks.com/blog/announcing-apache-hive-2-1-25x-faster-queries-much/ ("Hive 2 with LLAP averages 26x faster than Hive 1")

Disclaimer: I don't know how well this would apply to our environment, e.g. whether / how much additional configuration work would be needed to get LLAP running.

LLAP seems to me (I am very ignorant on this front so don't take this as authoritative source :) very close to what Presto does to achieve its query speed, so it might be an interesting test to perform.

elukey added a comment.EditedSep 20 2018, 8:15 PM

So after today's offsite hacking we may have an interim solution to deploy Hive 2.11. This of course needs to be tested very carefully in labs but it may work.

These are the CDH dependencies for hive now:

elukey@analytics1003:~$ apt-cache show hive | grep Depends
Depends: adduser, hadoop-client, bigtop-utils (>= 0.7), zookeeper, hive-jdbc (= 1.1.0+cdh5.10.0+859-1.cdh5.10.0.p0.71~jessie-cdh5.10.0), avro-libs, parquet

And these are the Bigtop ones (for Hive 1.1):

Depends: adduser, hadoop-client, bigtop-utils (>= 0.7), zookeeper, hive-jdbc (= ${source:Version}), python

The Bigtop 1.3.1 release (should be out soon) will bump Hive version to 2.1.1, but the above deps should mostly stay the same. So we could simply replacing the hive* debian packages on analytics1003 (the analytics coordinator where Hive runs) with Bigtop ones, keeping the current hadoop-client that should work nicely. This should allow us to upgrade Hive before upgrading to other distro/etc..

No promises but we'll try to do some testing next quarter :)

Oooh, exciting!!! :D

Yesterday T209407 drained a lot of our time, the new upgrade to CDH 5.15 caused spark issues that took a long time to be fixed. In this case Hive was not upgraded on paper (same major/minor version for old/new packages, 1.1) but Cloudera probably backported some new stuff from 1.2 that caused problems. In this case, all the distribution was tested by Cloudera before the release, so we didn't have major surprises, but now I am wondering how much stability we'd sacrifice if we upgrade to Hive 2.11 like described above. It would be extremely useful to some people, but I fear that it might cause major headaches to all of us in the long term. Writing this note just to say that we didn't forget about this task, but only that it might not as easy as we thought.

Now I am wondering how much stability we'd sacrifice if we upgrade to Hive 2.11 like described above. It would be extremely useful to some people, but I fear that it might cause major headaches to all of us in the long term. Writing this note just to say that we didn't forget about this task, but only that it might not as easy as we thought.

That's a totally reasonable consideration. We know that not everything we want is possible—we just like to understand the constraints and the prioritization. Because of the conversation here, I feel like on this topic we do. Thank you for that! 😁

Keeping the task updated - in https://issues.apache.org/jira/browse/BIGTOP-3074 the BigTop Apache distribution removed the oozie packaging since it seems not compatible (yet) with Hive 2.x. The CDH6 distribution seems to have Hive 2.1 and Oozie 5.x beta though :)

Neil_P._Quinn_WMF added a comment.EditedMar 5 2019, 11:18 AM

It looks like CDH 6.1, which includes Hive 2.1.1, was released in December.

@elukey, what's the current thinking about deploying this? I'm sure there are many complexities: going from CDH 5 to 6 sounds like a tricky upgrade, I know there's been discussion of switching from CDH to Hortonworks or BigTop, and I've heard the larger plan is to move away from Hive and towards Presto anyway.

It looks like CDH 6.1, which includes Hive 2.1.1, was released in December.
@elukey, what's the current thinking about deploying this? I'm sure there are many complexities: going from CDH 5 to 6 sounds like a tricky upgrade, I know there's been discussion of switching from CDH to Hortonworks or BigTop, and I've heard the larger plan is to move away from Hive and towards Presto anyway.

Is that indeed the plan? (The linked page doesn't mention whether we intend to actually abandon Hive altogether, or just to add Presto as an alternative for certain use cases like the Public Data Lake.)
If yes, what is the anticipated timeframe for this? Depending on how long we are still going to use Hive, an upgrade would still seem worthwhile.

BTW, regarding this (entirely correct) observation given on the Presto page as our motivation for not using Hive for the Public Data Lake:

Hive has significant issues when it comes to performance, there is a significant time-overhead for launching jobs and relying on MapReduce for computation makes the ratio of job-duration to data-size very bad for small-ish data.

... I'm also wondering how much the improvements promised for this very issue in Hive 2.0 (T203498#4575743 ) could mitigate the need for migration.

I've heard the larger plan is to move away from Hive and towards Presto

Is that indeed the plan?

Not quite! We'd love to be able to use Presto in production one day, and it may well work better than hive, but Presto actually uses the Hive metastore (e.g. databases and tables), so we wouldn't get rid of Hive. Presto would just give us a potentially better performance for some queries on Hive tables.

elukey added a comment.Mar 6 2019, 8:39 AM

It looks like CDH 6.1, which includes Hive 2.1.1, was released in December.
@elukey, what's the current thinking about deploying this? I'm sure there are many complexities: going from CDH 5 to 6 sounds like a tricky upgrade, I know there's been discussion of switching from CDH to Hortonworks or BigTop, and I've heard the larger plan is to move away from Hive and towards Presto anyway.

Hi! So we currently have no defined plan for CDH6, since we have a lot more pressing projects to complete:

  • Public Data Lake in cloud/labs - this is essentially a new Hadoop cluster (only with HDFS, no Yarn) on which we'll deploy Presto.
  • Improve Authentication and encryption of sensitive data in the Analytics Hadoop cluster (this seems easy but it is basically a year long project, will keep going probably for other two quarters).

The tricky part of migrating to CDH6 (assuming that we'll chose this distribution, see T203693) is that Hadoop is upgraded from 2.x to 3.x, and a ton of things will change (config files, settings, assumptions, etc..). We now have a Hadoop testing cluster in production that will surely help, but it is a project that will need a ton of hours in testing. Last but not the least, Cloudera does not support Debian for CDH6, but only Ubuntu. In theory the Ubuntu deb packages should work fine on Debian, in practice we can get to a state in which we are broken and there is no community support :)

To summarize, I see migrating to CDH6 as long term goal/plan. What we could do, if Presto will turn up to be super fast/good in the Public Data Lake (and flexible as Hive from the Data Analysts point of view), could be to deploy Presto in the Hadoop Analytics Cluster and offer it as alternative to Hive. This will help in comparing performances, but probably not with other concerns like more UDFs etc..

Hope that the answer makes sense, we can discuss more how to prioritize this, and I'll need to have a longer chat with my team to get their opinion/thoughts too :)

I see migrating to CDH6 as long term goal/plan.

CDH6 might be the goal, but if ends up being a brand new install (no clear upgrade path), I think it would still be worth considering other distributions, e.g. Bigtop or Hortonworks or even that cool new Hadoop distribution with better security primitives that I can't remember the name of.

Anyway ya this is humongo project indeed :)

elukey added a comment.Mar 6 2019, 2:24 PM

I see migrating to CDH6 as long term goal/plan.

CDH6 might be the goal, but if ends up being a brand new install (no clear upgrade path), I think it would still be worth considering other distributions, e.g. Bigtop or Hortonworks or even that cool new Hadoop distribution with better security primitives that I can't remember the name of.
Anyway ya this is humongo project indeed :)

Yep yep I agree, T203693 will eventually have a winner :) Hops is the brand new distribution (https://www.hops.io/), but I am not sure about its community/stability/etc.. Hortonworks has been bought by Cloudera, and after following for some months BigTop's dev mailing list I am not sure that the project is better than CDH. My initial opposition to CDH was the absence of deb sources, but I was wrong :)

Oh interesting right, I think you told me all of ^. Cool. :)

@Ottomata and @elukey, thanks for the great context! Everything you said makes sense, so I'll keep in mind not to expect this anytime soon 😁

! In T203498#5004073, @elukey wrote:
What we could do, if Presto will turn up to be super fast/good in the Public Data Lake (and flexible as Hive from the Data Analysts point of view), could be to deploy Presto in the Hadoop Analytics Cluster and offer it as alternative to Hive. This will help in comparing performances, but probably not with other concerns like more UDFs etc..

Well, as I understand it, it might actually help with those concerns! Since Presto isn't packaged with CDH, our choice of version isn't be as constrained as with Hive. So hopefully we would get Presto's latest set of functions, which from a quick browse look like a significant improvement on the Hive ones we currently have.

Tbayer updated the task description. (Show Details)Mar 19 2019, 1:23 AM
Tbayer moved this task from Triage to Tracking on the Product-Analytics board.