Page MenuHomePhabricator

No queries run in Hue
Closed, ResolvedPublic

Description

@elukey

The exact same problem recently fixed in T238560: Doubts and questions about Kerberos and Hadoop has started happening again in Hue, and I am unable to run any queries.

Error while compiling statement: FAILED: SemanticException java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

Event Timeline

Sigh :( Let's re-analyze logs then!

The error starts from the Hive2 server:

2020-01-09 02:17:20,476 WARN  security.UserGroupInformation (UserGroupInformation.java:doAs(1927)) - PriviledgedActionException as:mmiller (auth:PROXY) via hive/an-coord1001.eqiad.wmnet@WIKIMEDIA (auth:KERBEROS) cause:org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: SemanticException java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
2020-01-09 02:17:20,476 WARN  thrift.ThriftCLIService (ThriftCLIService.java:ExecuteStatement(510)) - Error executing statement:
org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: SemanticException java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
[..]
Caused by: MetaException(message:Could not connect to meta store using any of the URIs provided. Most recent failure: org.apache.thrift.transport.TTransportException: Peer indicated failure: DIGEST-MD5: IO error acquiring password
[..]

The error on the Metastore side is:

2020-01-09 02:17:19,471 ERROR transport.TSaslTransport (TSaslTransport.java:open(315)) - SASL negotiation failure
javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password [Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: token expired or does not exist: HIVE_DEL
EGATION_TOKEN owner=mmiller, renewer=mmiller, realUser=hive/an-coord1001.eqiad.wmnet@WIKIMEDIA, issueDate=1578413485670, maxDate=1579018285670, sequenceNumber=184, masterKeyId=2]
        at com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:598)
        at com.sun.security.sasl.digest.DigestMD5Server.evaluateResponse(DigestMD5Server.java:244)
        at org.apache.thrift.transport.TSaslTransport$SaslParticipant.evaluateChallengeOrResponse(TSaslTransport.java:539)
        at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:283)
        at org.apache.thrift.transport.TSaslServerTransport.open(TSaslServerTransport.java:41)
        at org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:216)
        at org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge$Server$TUGIAssumingTransportFactory$1.run(HadoopThriftAuthBridge.java:794)
        at org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge$Server$TUGIAssumingTransportFactory$1.run(HadoopThriftAuthBridge.java:791)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:360)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1904)
        at org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge$Server$TUGIAssumingTransportFactory.getTransport(HadoopThriftAuthBridge.java:791)
        at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:269)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hadoop.security.token.SecretManager$InvalidToken: token expired or does not exist: HIVE_DELEGATION_TOKEN owner=mmiller, renewer=mmiller, realUser=hive/an-coord1001.eqiad.wmnet@WIKIMEDIA, issueDate=1578413485670, maxDate=1579018285670, sequenceNumber=184, masterKeyId=2
        at org.apache.hadoop.hive.thrift.TokenStoreDelegationTokenSecretManager.retrievePassword(TokenStoreDelegationTokenSecretManager.java:114)
        at org.apache.hadoop.hive.thrift.TokenStoreDelegationTokenSecretManager.retrievePassword(TokenStoreDelegationTokenSecretManager.java:56)
        at org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge$Server$SaslDigestCallbackHandler.getPassword(HadoopThriftAuthBridge.java:608)
        at org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge$Server$SaslDigestCallbackHandler.handle(HadoopThriftAuthBridge.java:639)
        at com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:589)
        ... 15 more
2020-01-09 02:17:19,471 ERROR server.TThreadPoolServer (TThreadPoolServer.java:run(297)) - Error occurred during processing of message.
java.lang.RuntimeException: org.apache.thrift.transport.TTransportException: DIGEST-MD5: IO error acquiring password
        at org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:219)
        at org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge$Server$TUGIAssumingTransportFactory$1.run(HadoopThriftAuthBridge.java:794)
        at org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge$Server$TUGIAssumingTransportFactory$1.run(HadoopThriftAuthBridge.java:791)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:360)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1904)
        at org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge$Server$TUGIAssumingTransportFactory.getTransport(HadoopThriftAuthBridge.java:791)
        at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:269)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.thrift.transport.TTransportException: DIGEST-MD5: IO error acquiring password
        at org.apache.thrift.transport.TSaslTransport.sendAndThrowMessage(TSaslTransport.java:232)
        at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:316)
        at org.apache.thrift.transport.TSaslServerTransport.open(TSaslServerTransport.java:41)
        at org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:216)
        ... 10 more

This bit is interesting:

issueDate=1578413485670, maxDate=1579018285670

issueDate = Tuesday, January 7, 2020 4:11:25.670 PM
maxDate = Tuesday, January 14, 2020 4:11:25.670 PM

The token should not be expired, but I guess that this falls into the issue highlighted in https://mapr.com/support/s/article/Intermittent-Hue-Hiv-connectivity-issue-in-MapR-SASL-secure-cluster?language=en_US:

By default Hue doesn't close HS2 sessions. It's because Hue will not be able to get query result after session get closed (as a result, Hue will not be able to get query results for long running queries). When Hive metastore is restarted, the delegation token information will be lost for existing connection by Hue-Hs2 connection as delegation token is stored in memory by default. Hue only can restart session if it gets from HS2 message that session is expired ("Invalid SessionHandle", "Invalid session" or "Client session expired") and will fail with exception mentioned when old session is being used.

In theory the fix in https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/562474/3/modules/cdh/templates/hive/hive-site.xml.erb should have fixed the problem, no idea why this is still happening..

To remove variables I am reverting https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/563144/

This was added last week to limit huge Hue/Hive queries and prevent memory exhaustion of the Hue's host, but the timeline is suspiciously matching the start of the issue reported in this task.

Mentioned in SAL (#wikimedia-analytics) [2020-01-09T10:30:00Z] <elukey> revert hue's hive query limit and restart hue - T242306

@MMiller_WMF Hue now works for me, but as we discovered before the errors affect only some users so let me know if it works now or not. I reverted a change that might be related, let's see if it works.

elukey triaged this task as High priority.
elukey added a project: User-Elukey.

@elukey -- thanks for working on this. I am still receiving the same error.

I can see the following in the an-coord1001's db:

MariaDB [hive_metastore]> select count(*) from DELEGATION_TOKENS;
+----------+
| count(*) |
+----------+
|     1927 |
+----------+
1 row in set (0.00 sec)

So in theory now delegation tokens are correctly stored in the database (as opposed to in memory).

@MMiller_WMF I found a way in Hue's UI to force the re-creation of the Hive session, that seemed working for me (I was able to repro):

Screen Shot 2020-01-10 at 11.29.09 AM.png (526×926 px, 33 KB)

You'd need to find the 3 dots at the top-right corner of the Hue's Hive Editor panel (I am using Hue 3's UI, if you are using 4 it will be displayed in a similar place) and hit "recreate" (or even "Close"). This should force Hue to request another Hive delegation token, making the query to work.

I have no idea if this needs to be done only once or not, but it is a workaround for the moment. Any feedback from you is appreciated to debug further :)

@elukey -- clicking "recreate" makes it work! Thank you.

@elukey -- clicking "recreate" makes it work! Thank you.

Great, please report in this task if the problem re-occurs! I hope it will not, but not sure..

I have restarted yesterday Hue to add (again) some hive query limits to avoid out of memory issues, let me know if the error re-appears.

Adding @Seddon to this task since he reported the issue in #wikimedia-analytics :)

@elukey -- no re-occurrence. I am still doing fine in Hue. Thanks!

elukey set Final Story Points to 5.
elukey moved this task from Next Up to Done on the Analytics-Kanban board.

FYI this happened again to me today, but I recreated the session using the instructions above and now it is working again.