Page MenuHomePhabricator

Kerberize Superset to allow Presto queries
Closed, ResolvedPublic

Description

We are using Superset to display data quality metrics via Presto. Since Presto will be kerberized, like Hive, we'll need to kerberize superset to allow Presto queries.

Pyhive is the pypi package used by Superset to query Hive and Presto, and luckily version 0.6.2 contains the fix to allow kerberos with Presto. On the other side, the last Pypi release seems to be 0.6.1, afterwards the project has been dead. For example, see https://github.com/dropbox/PyHive/issues/288

There is https://pypi.org/project/HerePyHive that looks like a fork of the project, but all the links go back to Dropbox's Pyhive so it is probably a one-off build. This seems to be a problem in the long term for our usage of Superset with Presto..

Event Timeline

We could ping dropbox, to see if they want to upgrade pypi?
We could also ping superset if they want to change lib? <-- Maybe better bet.

mforns triaged this task as High priority.Dec 5 2019, 5:56 PM

In theory superset should be able to move to https://github.com/prestodb/presto-python-client, we might be able to fix it ourself and propose a pull request?

Oh! Cool.
Thanks for looking into this, @elukey!

Opened also https://github.com/prestodb/presto/issues/13910 to see if Presto upstream has any advice about how to proceed.

Sent an email to dev@superset: https://lists.apache.org/thread.html/rfd9d61e017bc643c898fba6add57c13e85037e1102414ecfb9df7a49%40%3Cdev.superset.apache.org%3E

It seems that a Superset committer works at DropBox, hopefully we'll see something moving during the next days.

Somebody from Dropbox is willing to revive the repository! He asked for some help, so I filed a pull request for https://github.com/dropbox/PyHive/issues/288 to unblock the status of PyHive.

Things are moving, my PR to fix the Travis CI has been merged. Adding in here what it is needed in my opinion:

PyHive

Superset

The following works even on Notebooks:

# !pip install git+https://github.com/dropbox/PyHive.git@437eefa7bceda1fd27051e5146e66cb8e4bdfea1
# !pip install requests-kerberos

import os
import socket
from pyhive import presto

def get_presto_cursor(
    host="an-coord1001.eqiad.wmnet",
    port=8281,
    catalog='analytics_hive',
    protocol='https',
    source='Presto Jupyter - {}@{}'.format(os.geteuid, socket.gethostname()),
    ca_crt='/etc/presto/ca.crt.pem',
    KerberosConfigPath="/etc/krb5.conf",
    KerberosPrincipal="{}@WIKIMEDIA".format(os.getenv('USER')),
    KerberosRemoteServiceName="presto",
    KerberosCredentialCachePath='/tmp/krb5cc_{}'.format(os.geteuid())
):

    requests_kwargs = {
        'proxies': {
            'http': None, 'https': None
        }
    }
    if ca_crt:
        requests_kwargs['verify'] = ca_crt


    presto_cursor = presto.Cursor(
        host=host,
        port=port,
        catalog=catalog,
        protocol=protocol,
        requests_kwargs=requests_kwargs,
        KerberosConfigPath=KerberosConfigPath,
        KerberosPrincipal=KerberosPrincipal,
        KerberosRemoteServiceName=KerberosRemoteServiceName,
        KerberosCredentialCachePath=KerberosCredentialCachePath
    )

    return presto_cursor

cursor = get_presto_cursor()
cursor.execute('SHOW TABLES from event')
for row in cursor.fetchall():
    print(row)

PyHive 0.6.2 is not released yet but it looks promising.

SAML support seems to have not got traction in FAB (used by superset): https://github.com/dpgaspar/Flask-AppBuilder/issues/1028

Change 576861 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add kerberos credentials/config for Superset Staging

https://gerrit.wikimedia.org/r/576861

Change 576861 merged by Elukey:
[operations/puppet@production] Add kerberos credentials/config for Superset Staging

https://gerrit.wikimedia.org/r/576861

Change 576863 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add Presto Kerberos settings to Superset Staging

https://gerrit.wikimedia.org/r/576863

Change 576863 merged by Elukey:
[operations/puppet@production] Add Presto Kerberos settings to Superset Staging

https://gerrit.wikimedia.org/r/576863

Change 576907 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::superset: allow to deploy Presto TLS CA

https://gerrit.wikimedia.org/r/576907

Change 576907 merged by Elukey:
[operations/puppet@production] profile::superset: allow to deploy Presto TLS CA

https://gerrit.wikimedia.org/r/576907

I was able to force the last version of the PyHive package in the Superset Staging venv on an-tool1005, and with the following config Superset seems working fine with Presto + Kerberos + HTTPs with self signed certs:

presto://an-coord1001.eqiad.wmnet:8281/analytics_hive?protocol=https

{
    "metadata_params": {},
    "engine_params": {
    "connect_args": {
             "KerberosConfigPath": "/etc/krb5.conf",
             "KerberosKeytabPath": "/etc/security/keytabs/superset/superset.keytab",
             "KerberosPrincipal": "superset/an-tool1005.eqiad.wmnet@WIKIMEDIA",
             "KerberosRemoteServiceName": "presto",
             "requests_kwargs": {
                 "verify": "/etc/superset/presto_ca/ca.crt.pem"
             }
       }
    },
    "metadata_cache_timeout": {},
    "schemas_allowed_for_csv_upload": [],
}

There is also an option in Superset when configuring the Datasource that forces Presto/Hive to impersonate users, and it seems to work as expected. I tailed the hdfs audit log on an-master1001 enabling/disabling the option, and I found out that the user proxied to Presto changes (my username / superset's one).

Once https://github.com/dropbox/PyHive/issues/288 is completed and PyHive 0.6.2 is released, then it will be a matter of re-building the Superset venv with the new dependency and deploy!

Nuria added a subscriber: cchen.

Pinging @cchen when this ticket is done let's get together so we can show you how to do dashboards on top of data on hadoop/hive (without it having to be on superset) (will not work for gigantic tables but would work for most)

Change 580359 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::analytics_cluster::superset: add kerberos settings

https://gerrit.wikimedia.org/r/580359

Change 580359 merged by Elukey:
[operations/puppet@production] role::analytics_cluster::superset: add kerberos settings

https://gerrit.wikimedia.org/r/580359

Change 580363 had a related patch set uploaded (by Elukey; owner: Elukey):
[analytics/superset/deploy@master] Update dependencies with PyHive 0.6.2

https://gerrit.wikimedia.org/r/580363

Change 580363 merged by Elukey:
[analytics/superset/deploy@master] Update dependencies with PyHive 0.6.2

https://gerrit.wikimedia.org/r/580363

Everything deployed on Superset, the dashboards that were broken are now working fine!

Also tried to use the SQL Lab and it works really great. It could easily become a replacement for Hue on this front.

Authentication flow:

  • User authenticates via LDAP to superset.wikimedia.org
  • User is proxied by Superset to Presto (first impersonation)
  • User is proxied by Presto to HDFS (second impersonation)

Hence User needs to be in a group like 'analytics-privatedata-users` to be able to show PII data on Superset.

Verified on HDFS logs:

ugi=elukey (auth:PROXY) via presto/an-presto1002.eqiad.wmnet@WIKIMEDIA (auth:KERBEROS)

@MoritzMuehlenhoff adding you to let you know this use case, eventually it would be great to move to CAS.

elukey moved this task from In Progress to Done on the Analytics-Kanban board.