Page MenuHomePhabricator

Hive join fails when using a HiveServer2 client
Closed, ResolvedPublic3 Estimated Story Points

Description

Attempting to run the following query using a HiveServer2 client (such as Beeline, Hue, or Impyla) fails with the message "Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapredLocalTask". However, it works perfectly using the hive client.

select
    ug_user,
    log_type
from wmf_raw.mediawiki_user_groups ug
join wmf_raw.mediawiki_logging log
on ug_user = log_user
where
    log.wiki_db = "guwiki" and
    ug.wiki_db = "guwiki" and
    log.snapshot = "2018-08" and
    ug.snapshot = "2018-08"

The full output when using beeline in verbose mode is:

INFO  : Compiling command(queryId=hive_20181005005555_e6e1c3cc-e854-4cf2-8fed-8a5f990189b6): 
{{query}}
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:ug_user, type:bigint, comment:null), FieldSchema(name:log_type, type:string, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=hive_20181005005555_e6e1c3cc-e854-4cf2-8fed-8a5f990189b6); Time taken: 0.137 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=hive_20181005005555_e6e1c3cc-e854-4cf2-8fed-8a5f990189b6): 
{{query}}
INFO  : Query ID = hive_20181005005555_e6e1c3cc-e854-4cf2-8fed-8a5f990189b6
INFO  : Total jobs = 1
INFO  : Starting task [Stage-4:MAPREDLOCAL] in serial mode
ERROR : Execution failed with exit status: 1
ERROR : Obtaining error information
ERROR : 
Task failed!
Task ID:
  Stage-4

Logs:

ERROR : /var/log/hive/hive-server2.log
ERROR : FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapredLocalTask
INFO  : Completed executing command(queryId=hive_20181005005555_e6e1c3cc-e854-4cf2-8fed-8a5f990189b6); Time taken: 1.069 seconds
Getting log thread is interrupted, since query is done!
Error: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapredLocalTask (state=08S01,code=1)
java.sql.SQLException: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapredLocalTask
	at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:284)
	at org.apache.hive.beeline.Commands.executeInternal(Commands.java:986)
	at org.apache.hive.beeline.Commands.execute(Commands.java:1158)
	at org.apache.hive.beeline.Commands.sql(Commands.java:1072)
	at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:1172)
	at org.apache.hive.beeline.BeeLine.execute(BeeLine.java:1003)
	at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:915)
	at org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:511)
	at org.apache.hive.beeline.BeeLine.main(BeeLine.java:494)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:136)

As @Ottomata suggested, running SET hive.auto.convert.join=false; (as described here) before the query makes it work properly.

It would be good to know what exactly went wrong and that setting fixes it so we can avoid problems in the future.

Event Timeline

Restricted Application changed the subtype of this task from "Deadline" to "Task". · View Herald TranscriptOct 5 2018, 1:08 AM

Some more info on the issue as I understand it: Given the data size to work and the fact that hive.auto.convert.join is set to true by default, Hive converts the join to happen into a Map side join, and even more, into a LOCAL map side join. This means that the join happens locally on the client instead of happening onto a cluster container. When using the hive client, local memory can be configured using HADOOP_HEAPSIZE en variable. However when using hue or beeline, local actually means in HiveServer2.
The solution provided above (SET hive.auto.convert.join=false;) will work in any case, but it could interesting to try to raise HiveServer2 available memory and see if it solves the issue.

@Neil_P._Quinn_WMF, can you please use hive client instead of beeline.
We know it says it's deprecated in favor of beeline, but we still believe that hive is better for now.
At least better than the beeline version that we have installed.
We'll follow up with that.

I will decline this task, please reopen if your query continues to fail on the hive client. Thanks!

@Neil_P._Quinn_WMF, can you please use hive client instead of beeline.
We know it says it's deprecated in favor of beeline, but we still believe that hive is better for now.
At least better than the beeline version that we have installed.
We'll follow up with that.

@mforns unfortunately, the problems affects not just beeline but all HiveServer2 clients.

So this is also a problem for querying through SWAP or Hue (which I use much more than either command line client), and it would be ideal to fix it at the source. If that's not possible, then you can decline it again and we can use the workaround, but just be aware that the cost is non-trivial.

Some more info on the issue as I understand it: Given the data size to work and the fact that hive.auto.convert.join is set to true by default, Hive converts the join to happen into a Map side join, and even more, into a LOCAL map side join. This means that the join happens locally on the client instead of happening onto a cluster container. When using the hive client, local memory can be configured using HADOOP_HEAPSIZE en variable. However when using hue or beeline, local actually means in HiveServer2.
The solution provided above (SET hive.auto.convert.join=false;) will work in any case, but it could interesting to try to raise HiveServer2 available memory and see if it solves the issue.

Thanks @JAllemandou! Do you think you could add this information to wikitech:Analytics/Systems/Cluster/Hive/Queries#Troubleshooting?

FWIW, I encountered the same kind of error in beeline around the same time last week (October 2 I believe). Below is the query and the log (reproduced today). The same query works fine in Hive, and SET hive.auto.convert.join=false;fixes it in beeline as well.

This error surprised me a bit, considering that the query didn't seem to involve a very extreme amount of data: one partition from each table, with just 40 rows for the first table and 16 million rows for the second. But perhaps the JOIN was still expensive?

SELECT rd.year, rd.month, rd.day, rd.hour,
pi.pageToken, rd.PageToken FROM (
  SELECT year, month, day, hour, event.pageToken AS pageToken
  FROM event.pageissues 
  WHERE year = 2018 AND month = 9 AND day = 30 AND hour = 0
  AND event.action = 'pageLoaded') AS pi
JOIN (
  SELECT year, month, day, hour, event.pageToken AS pageToken
  FROM event.readingdepth
  WHERE year = 2018 AND month = 9 AND day = 30 AND hour = 0
  AND event.action = 'pageLoaded'
  AND ( event.page_issues_a_sample OR event.page_issues_b_sample )) AS rd
ON (pi.hour = rd.hour 
AND pi.day = rd.day
AND pi.month = rd.month
AND pi.year = rd.year
AND pi.pageToken = rd.PageToken)
LIMIT 20;

INFO  : Compiling command(queryId=hive_20181012024646_e7d46866-2abb-492b-9692-096d74109307): SELECT rd.year, rd.month, rd.day, rd.hour,
pi.pageToken, rd.PageToken FROM (
  SELECT year, month, day, hour, event.pageToken AS pageToken
  FROM event.pageissues 
  WHERE year = 2018 AND month = 9 AND day = 30 AND hour = 0
  AND event.action = 'pageLoaded') AS pi
JOIN (
  SELECT year, month, day, hour, event.pageToken AS pageToken
  FROM event.readingdepth
  WHERE year = 2018 AND month = 9 AND day = 30 AND hour = 0
  AND event.action = 'pageLoaded'
  AND ( event.page_issues_a_sample OR event.page_issues_b_sample )) AS rd
ON (pi.hour = rd.hour 
AND pi.day = rd.day
AND pi.month = rd.month
AND pi.year = rd.year
AND pi.pageToken = rd.PageToken)
LIMIT 20
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:rd.year, type:bigint, comment:null), FieldSchema(name:rd.month, type:bigint, comment:null), FieldSchema(name:rd.day, type:bigint, comment:null), FieldSchema(name:rd.hour, type:bigint, comment:null), FieldSchema(name:pi.pagetoken, type:string, comment:null), FieldSchema(name:rd.pagetoken, type:string, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=hive_20181012024646_e7d46866-2abb-492b-9692-096d74109307); Time taken: 0.113 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=hive_20181012024646_e7d46866-2abb-492b-9692-096d74109307): SELECT rd.year, rd.month, rd.day, rd.hour,
pi.pageToken, rd.PageToken FROM (
  SELECT year, month, day, hour, event.pageToken AS pageToken
  FROM event.pageissues 
  WHERE year = 2018 AND month = 9 AND day = 30 AND hour = 0
  AND event.action = 'pageLoaded') AS pi
JOIN (
  SELECT year, month, day, hour, event.pageToken AS pageToken
  FROM event.readingdepth
  WHERE year = 2018 AND month = 9 AND day = 30 AND hour = 0
  AND event.action = 'pageLoaded'
  AND ( event.page_issues_a_sample OR event.page_issues_b_sample )) AS rd
ON (pi.hour = rd.hour 
AND pi.day = rd.day
AND pi.month = rd.month
AND pi.year = rd.year
AND pi.pageToken = rd.PageToken)
LIMIT 20
INFO  : Query ID = hive_20181012024646_e7d46866-2abb-492b-9692-096d74109307
INFO  : Total jobs = 1
INFO  : Starting task [Stage-4:MAPREDLOCAL] in serial mode
ERROR : Execution failed with exit status: 1
ERROR : Obtaining error information
ERROR : 
Task failed!
Task ID:
  Stage-4

Logs:

ERROR : /var/log/hive/hive-server2.log
ERROR : FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapredLocalTask
INFO  : Completed executing command(queryId=hive_20181012024646_e7d46866-2abb-492b-9692-096d74109307); Time taken: 0.766 seconds
Error: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapredLocalTask (state=08S01,code=1)

Info added in wikitech for future reference!

fdans subscribed.

@fdans thank you! Is it worth pursuing @joal's suggestion? ("it could interesting to try to raise HiveServer2 available memory and see if it solves the issue.")

Otherwise, now that it's documented, I don't know of anything else to do, so this could probably be closed.

Neil, today we added (finally! \o/) JVM metrics to the Hive server, you can see them in https://grafana.wikimedia.org/dashboard/db/hive?orgId=1

It doesn't seem that we are suffering a JVM heap exhaustion, so Joseph is digging into more in depth Hadoop settings that might cause the problem.

Reopening because we have an idea:

We will set SET hive.auto.convert.join=false; in hive-server2 only, (not for hive CLI clients). This should make the default for clients that use hive-server (like notebooks) to have this turned off.

Reopening because we have an idea:

We will set SET hive.auto.convert.join=false; in hive-server2 only, (not for hive CLI clients). This should make the default for clients that use hive-server (like notebooks) to have this turned off.

Tis is the prefered solution as of now. I have tracked down the problem as being a hive-internal issue:
When issuing an MAPREDLOCAL task (a mapreduce task processed locally on the machine instead of on the cluster, for performance optimization), the hive CLI logs shows:

2018-10-25 19:28:46,993 INFO  mr.MapredLocalTask (MapredLocalTask.java:executeInChildVM(318)) - Executing: /usr/lib/hadoop/bin/hadoop jar /usr/lib/hive/lib/hive-exec-1.1.0-cdh5.10.0.jar org.apache.hadoop.hive.ql.exec.mr.ExecDriver -localtask -plan  ......

while the HiveServer2 logs shows:

2018-10-25 11:33:40,559 INFO  mr.MapredLocalTask (MapredLocalTask.java:executeInChildVM(318)) - Executing: /usr/lib/hadoop/bin/hadoop jar /usr/lib/hive/lib/hive-common-1.1.0-cdh5.10.0.jar org.apache.hadoop.hive.ql.exec.mr.ExecDriver -localtask  -plan .......

The thing to notice is the difference in jar file: CLI uses hive-exec-1.1.0-cdh5.10.0.jar while HiveServer2 uses hive-common-1.1.0-cdh5.10.0.jar.
The problem for HiveServer2 is that the stated jar doesn't contain the class it tries to run.
It would be interesting to track the problem further, but the simple solution of disabling join auto-convert is really a no-brainer for today.

JAllemandou moved this task from Next Up to Done on the Analytics-Kanban board.
JAllemandou set the point value for this task to 3.