Page MenuHomePhabricator

Refine should accept principal name for hive2 jdbc connection for DDL
Closed, ResolvedPublic5 Estimated Story Points

Description

We need to pass the Kerberos principal to DataFrameToHive.prepareHiveTable in order to get Refine working with Kerberos.

e.g.

jdbc:hive2://analytics1030.eqiad.wmnet:10000/default;principal=hive/_HOST@WIKIMEDIA

Event Timeline

Ottomata created this task.Jul 17 2019, 3:43 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 17 2019, 3:43 PM
Ottomata updated the task description. (Show Details)Jul 17 2019, 3:43 PM
Milimetric triaged this task as High priority.Jul 18 2019, 4:32 PM
Milimetric moved this task from Incoming to Operational Excellence on the Analytics board.

@elukey I was going to test this today...but I can't remember my kerberos password! Can you reset it?

@Ottomata deleted and recreated your principal, you should have an email with the tmp password to reset :)

Change 526670 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[analytics/refinery/source@master] DataFrameToHive: Set hive_server_url using full URL

https://gerrit.wikimedia.org/r/526670

Change 526742 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[analytics/refinery/scap@master] Deploy refinery to an-tool1006 in Hadoop test cluster

https://gerrit.wikimedia.org/r/526742

Change 526743 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Deploy refinery to an-tool1006 in Hadoop test cluster

https://gerrit.wikimedia.org/r/526743

Change 526742 merged by Ottomata:
[analytics/refinery/scap@master] Deploy refinery to an-tool1006 in Hadoop test cluster

https://gerrit.wikimedia.org/r/526742

Change 526743 merged by Ottomata:
[operations/puppet@production] Deploy refinery to an-tool1006 in Hadoop test cluster

https://gerrit.wikimedia.org/r/526743

Ottomata set the point value for this task to 5.
elukey added a comment.Aug 6 2019, 9:52 AM

I was finally able to test on analytic1030 Refine. The missing bit was --conf spark.executorEnv.LD_LIBRARY_PATH=/usr/lib/hadoop/lib/native passed to spark-submit, but I am still not sure why it is needed (note that the same added to the workers' spark-defaults seems not working, if the client doesn't specify it).

Anyway, the refinery version deployed on the host leads to:

Failure(org.wikimedia.analytics.refinery.job.refine.RefineTargetException: Failed refinement of hdfs://analytics-test-hadoop/wmf/data/raw/eventlogging/eventlogging_NavigationTiming/hourly/2019/08/02/09 -> `event`.`NavigationTiming` (year=2019,month=8,day=2,hour=9). Original exception: java.sql.SQLException: Could not open client transport with JDBC Uri: jdbc:hive2://analytics1030.eqiad.wmnet:10000/default;user=analytics;password=: Peer indicated failure: Unsupported mechanism type PLAIN)

That is expected (but it was a good sanity check in my opinion). I tried to use the refinery jar in /home/otto but I get only:

19/08/06 09:20:48 ERROR Refine: Failed refinement of dataset hdfs://analytics-test-hadoop/wmf/data/raw/eventlogging/eventlogging_NavigationTiming/hourly/2019/08/02/09 -> `event`.`NavigationTiming` (year=2019,month=8,day=2,hou
r=9).
java.lang.NullPointerException
        at org.wikimedia.analytics.refinery.spark.connectors.DataFrameToHive$.prepareHiveTable(DataFrameToHive.scala:246)
        at org.wikimedia.analytics.refinery.spark.connectors.DataFrameToHive$.apply(DataFrameToHive.scala:131)
        at org.wikimedia.analytics.refinery.job.refine.Refine$$anonfun$refineTargets$1.apply(Refine.scala:498)
        at org.wikimedia.analytics.refinery.job.refine.Refine$$anonfun$refineTargets$1.apply(Refine.scala:493)

I am probably not using the right jar, so I'll wait for instructions :)

I've been using an-tool1006, so the most recent .jar is there.

Without kinit, I get stuff like:

19/08/06 13:41:21 WARN Client: Exception encountered while connecting to the server : javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
19/08/06 13:41:21 INFO ConfiguredRMFailoverProxyProvider: Failing over to analytics1029-eqiad-wmnet
19/08/06 13:41:21 INFO RetryInvocationHandler: Exception while invoking getClusterMetrics of class ApplicationClientProtocolPBClientImpl over analytics1029-eqiad-wmnet after 1 fail over attempts. Trying to fail over after sleeping for 2962ms.
java.net.ConnectException: Call From an-tool1006/10.64.5.32 to analytics1029.eqiad.wmnet:8032 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused

Without spark.executorEnv.LD_LIBRARY_PATH=/usr/lib/hadoop/lib/native, I get (as you did):

java.lang.UnsatisfiedLinkError: org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy()Z

With LD_LIBRARY_PATH set, I get

19/08/06 13:46:46 WARN TransportChannelHandler: Exception in connection from /10.64.36.132:55320
java.lang.IllegalArgumentException: Frame length should be positive: -3237550063764322596
	at org.spark_project.guava.base.Preconditions.checkArgument(Preconditions.java:119)
	at org.apache.spark.network.util.TransportFrameDecoder.decodeNext(TransportFrameDecoder.java:134)
...
19/08/06 13:46:46 ERROR TransportRequestHandler: Error while invoking RpcHandler#receive() on RPC id 7201341704792560703
java.io.EOFException
	at java.io.DataInputStream.readFully(DataInputStream.java:197)
...
	at org.apache.spark.rpc.netty.RequestMessage$.readRpcAddress(NettyRpcEnv.scala:593)
...
	at org.apache.spark.network.crypto.AuthRpcHandler.receive(AuthRpcHandler.java:85)
...

19/08/06 13:46:46 ERROR TransportRequestHandler: Error sending result RpcFailure{requestId=7543663174060120626, errorString=java.io.EOFException
	at java.io.DataInputStream.readFully(DataInputStream.java:197)
...
} to /10.64.36.133:55882; closing connection
java.io.IOException: Broken pipe
...

when the job transitions from ACCEPTED to RUNNING. The job does eventually succeed:

19/08/06 13:49:31 INFO Refine: Finished refinement of dataset hdfs://analytics-test-hadoop/wmf/data/raw/eventlogging/eventlogging_NavigationTiming/hourly/2019/08/05/13 -> `otto`.`NavigationTiming` (year=2019,month=8,day=5,hour=13). (# refined records: 50847)
19/08/06 13:49:31 INFO Refine: Successfully refined 1 of 1 dataset partitions into table `otto`.`NavigationTiming` (total # refined records: 50847)

... but only after what seems to be a long stuck executor spawn loop with messages like:

19/08/06 13:47:37 INFO ExecutorAllocationManager: New executor 45 has registered (new total is 42)
...
19/08/06 13:48:51 INFO ExecutorAllocationManager: Removing executor 45 because it has been idle for 60 seconds (new desired total will be 164)
...
19/08/06 13:48:51 ERROR TransportRequestHandler: Error while invoking RpcHandler#receive() on RPC id 6730307001701133966
java.io.EOFException

I think maybe in yarn the executors can't RPC with the master process?

If I disable RPC auth with --conf "spark.authenticate=false" --conf "spark.shuffle.service.enabled=false" --conf "spark.dynamicAllocation.enabled=false" --conf "spark.network.crypto.enabled=false" --conf "spark.authenticate.enableSaslEncryption=false", everything works*!

19/08/06 14:04:08 INFO Refine: Successfully refined 1 of 1 dataset partitions into table `otto`.`NavigationTiming` (total # refined records: 53482)

*I did get a Container killed by YARN for exceeding memory limits. 2.0 GB of 2 GB physical memory used. on the way, but the executor was relaunched and the job finished. I think this is just a small test cluster problem, not a functionality problem.

Ah, but those confs disable dynamicAllocation and shuffle service...which we don't really want to do. If I only do --conf "spark.authenticate=false" --conf "spark.network.crypto.enabled=false" --conf "spark.authenticate.enableSaslEncryption=false", I get:

19/08/06 14:06:34 ERROR YarnScheduler: Lost executor 1 on analytics1040.eqiad.wmnet: Unable to create executor due to Unable to register with external shuffle server due to : java.lang.IllegalStateException: Expected SaslMessage, received something else (maybe your client does not have SASL enabled?)

and the job fails.

elukey added a comment.Aug 6 2019, 2:09 PM

The last failure is expected, since in the testing cluster the shuffler wants authentication via spark RPC native or SASL, and if disabled it should fail.

Can you add the command in here so I can repro?

/usr/bin/spark2-submit \
--name otto_refine0 \
--master yarn \
--class org.wikimedia.analytics.refinery.job.refine.Refine \
--conf "spark.authenticate=false" --conf "spark.shuffle.service.enabled=false" --conf "spark.dynamicAllocation.enabled=false" --conf "spark.network.crypto.enabled=false" --conf "spark.authenticate.enableSaslEncryption=false" \
--conf spark.executorEnv.LD_LIBRARY_PATH=/usr/lib/hadoop/lib/native \
--conf spark.driver.extraClassPath=/usr/lib/hadoop-mapreduce/hadoop-mapreduce-client-common.jar:/srv/deployment/analytics/refinery/artifacts/hive-jdbc-1.1.0-cdh5.10.0.jar:/srv/deployment/analytics/refinery/artifacts/hive-service-1.1.0-cdh5.10.0.jar \
--driver-java-options='-Dhttp.proxyHost=webproxy.eqiad.wmnet -Dhttp.proxyPort=8080 -Dhttps.proxyHost=webproxy.eqiad.wmnet -Dhttps.proxyPort=8080 -Drefinery.log.level=DEBUG' \
/home/otto/refinery-source/refinery-job/target/refinery-job-0.0.97-SNAPSHOT.jar \
--database=otto \
--input_path=/wmf/data/raw/eventlogging \
--input_path_regex='eventlogging_(.+)/hourly/(\d+)/(\d+)/(\d+)/(\d+)' \
--input_path_regex_capture_groups='table,year,month,day,hour' \
--output_path=/user/otto/external/eventlogging14 \
--schema_base_uri=eventlogging  \
--table_whitelist_regex='^NavigationTiming$' \
--transform_functions='org.wikimedia.analytics.refinery.job.refine.deduplicate_eventlogging,org.wikimedia.analytics.refinery.job.refine.geocode_ip,org.wikimedia.analytics.refinery.job.refine.eventlogging_filter_is_allowed_hostname' \
--since=24 --until 4 \
--limit 1 \
--ignore_failure_flag=true --ignore_success_flag=true

After https://gerrit.wikimedia.org/r/c/operations/puppet/+/528483, yarn client, yarn cluster and local mode all work great!

Change 526670 merged by Ottomata:
[analytics/refinery/source@master] Refine: infer hiveServerUrl from config

https://gerrit.wikimedia.org/r/526670

Nuria closed this task as Resolved.Aug 25 2019, 8:54 AM

Change 554473 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::refinery::job::spark_job: allow to pass a keytab

https://gerrit.wikimedia.org/r/554473

Change 554473 merged by Elukey:
[operations/puppet@production] profile::refinery::job::spark_job: allow to pass a keytab

https://gerrit.wikimedia.org/r/554473