Test case
Try running this snippet on one of the analytics clients (make sure to kinit first):
import os import pwd from pyhive import hive connect_kwargs = { "host": "analytics-hive.eqiad.wmnet", "auth": "KERBEROS", "username": pwd.getpwuid(os.getuid()).pw_name, "kerberos_service_name": "hive", } with hive.connect(**connect_kwargs) as conn: cursor = conn.cursor() cursor.execute(""" SET a = 2 """) cursor.execute(""" SELECT '${hiveconf:a}' """) result = cursor.fetchall() result
The result is [('${hiveconf:a}',)], as if the SET statement never happened.
On the other hand, try the same thing but change the SET statement to remove the leading newline:
cursor.execute("""SET a = 2 """)
Now the result is [('2',)].
Solution
The most straightforward solution would be for Wmfdata to strip unnecessary whitespace from queries before submitting them.
PyHive is no longer maintained, so it could be broadly useful to to reimplement Wmfdata's Hive module using another library (avoiding pandas.read_sql in order to address T324135). Alternatively, we could simply deprecate the Hive module altogether. Hive's SQL-on-MapReduce functionality is officially deprecated, and it's likely that Presto and Spark are sufficient for our needs.