Page MenuHomePhabricator

PyHive ignores SET statements with a leading newline
Closed, DeclinedPublic

Description

Test case

Try running this snippet on one of the analytics clients (make sure to kinit first):

import os
import pwd

from pyhive import hive

connect_kwargs = {
    "host": "analytics-hive.eqiad.wmnet",
    "auth": "KERBEROS",
    "username": pwd.getpwuid(os.getuid()).pw_name,
    "kerberos_service_name": "hive",
}

with hive.connect(**connect_kwargs) as conn:
    cursor = conn.cursor()
    
    cursor.execute("""
    SET a = 2
    """)
    
    cursor.execute("""
    SELECT '${hiveconf:a}'
    """)
    
    result = cursor.fetchall()
    
result

The result is [('${hiveconf:a}',)], as if the SET statement never happened.

On the other hand, try the same thing but change the SET statement to remove the leading newline:

cursor.execute("""SET a = 2
""")

Now the result is [('2',)].

Solution

The most straightforward solution would be for Wmfdata to strip unnecessary whitespace from queries before submitting them.

PyHive is no longer maintained, so it could be broadly useful to to reimplement Wmfdata's Hive module using another library (avoiding pandas.read_sql in order to address T324135). Alternatively, we could simply deprecate the Hive module altogether. Hive's SQL-on-MapReduce functionality is officially deprecated, and it's likely that Presto and Spark are sufficient for our needs.

Event Timeline

Alternatively, we could simply deprecate the Hive module altogether. Hive's SQL-on-MapReduce functionality is officially deprecated, and it's likely that Presto and Spark are sufficient for our needs.

+1 to this approach. We are encouraging folks to move out of Hive and into Spark, so this is a good way to incentivize that further.

mpopov moved this task from Triage to Backlog on the Product-Analytics board.

We will be deprecating Hive soon (T384541).