Page MenuHomePhabricator

Access to HUE for Mayakpwiki
Closed, ResolvedPublic1 Estimated Story Points

Description

Wikitech username: Mayakpwiki
preferred shell username: Mayakpwiki
developer access username / Instance shell account name in preferences: Mayakpwiki
Full name: Maya Kampurath

REQUEST : I would like to get access to HUE to be able to explore and query our Data Lake. I am a contractor working as a Data Quality Analyst in the Product Analytics team and Kate Zimmerman is my manager.

I have signed the NDA with Legal and have also been added to the NDA group.

Reference task where my initial access was set up: T227633: Requesting access to machines [stat1004, stat1007, stat1006, notebook1003, and notebook1004] and groups for Mayakpwiki

Event Timeline

@Mayakp.wiki the nda group will give you access to hue, best place to do your work is probably jupyter notebooks as they are intended as a repository of queries and work to share with others

@Nuria: It was worked out on IRC that they probably need their Hue account created, since they already have NDA LDAP access, see: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Access#HTTP_Access

@Mayakp.wiki just a comment on hue. It might not be the best tool for querying the data lake. We (as in the analytics team) prefer using either hive/beeline directly or jupyter notebooks.

Hue's UI is so bad.

fdans moved this task from Incoming to Operational Excellence on the Analytics board.
fdans moved this task from Operational Excellence to Ops Week on the Analytics board.

This seems to be something that the Analytics team needs to handle directly, rather than ops clinic duty, as the directions for HUE require someone who is already an Admin on it to grant other access.

(If this isn't the case, and it should be handled by clinic duty, please state such!)

@fdans : yes I will be using Jupyter notebooks for the most part but would like to get HUE access for simple queries like validate the metrics values on Turnilo/Superset dashboards. Also, I feel the Hue UI is good for being able to see sample data in a table. It would be beneficial to do these small checks via HUE.
Thanks and please let me know where we stand on the access.

@Mayakp.wiki hue has no ability to connect to druid (which is the data that powers both superset and turnilo), it can only connect to the hive datastore;

To see sampling data in a table this is all the code needed on a jupyter notebook to connect to hive

from pyspark.sql.types import ArrayType, StringType
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession

sc = SparkContext.getOrCreate()
spark = SparkSession(sc)

df = spark.sql("select from blah desc")
df.show(20, False)

@Mayakp.wiki please give a try to jupyter and let me see on my end what is needed for access

Thanks @Nuria for the query and suggestion. I will use Jupyter and Beeline in the meantime. Please let me know whenever my HUE access is granted. https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Access#HTTP_Access
Thanks for your help with this.

Following up on this request : I have been able to use Jupyter notebooks for some of my work. However, I would still like to get access to HUE for running small, simple queries on hive tables. Thanks!

Let me just support Maya's request here. I work primarily in JupyterLab, but I still use Hue frequently for various things:

  • Running quick queries or exploring the Data Lake (since Hue has a nice graphical table explorer, autocompletion, and a query history)
  • Checking Oozie workflows and jobs

From a security standpoint, there is no difference since Maya already has full data access via Jupyter/SSH.

Action has been taken that should have granted access to shell username Mayakpwiki.
@Mayakp.wiki can you test please? :)

Checked connection and ran queries against mediawiki history. Access is working as expected. Thanks @JAllemandou and @Nuria for your help !

JAllemandou set the point value for this task to 1.