Page MenuHomePhabricator

Integrate Superset with DataHub
Closed, ResolvedPublic3 Estimated Story Points

Assigned To
Authored By
BTullis
Apr 26 2022, 2:11 PM
Referenced Files
F35318056: image.png
Jul 15 2022, 2:49 PM
F35178630: image.png
May 26 2022, 10:03 AM
F35178637: image.png
May 26 2022, 10:03 AM
F35178628: image.png
May 26 2022, 10:03 AM
F35178650: image.png
May 26 2022, 10:03 AM

Description

It would be really useful to be able to ingest from Superset into our DataHub MVP.

https://datahubproject.io/docs/metadata-ingestion/source_docs/superset

This creates chart and dashboard objects in DataHub with associated metadata.

Event Timeline

EChetty set the point value for this task to 3.May 5 2022, 4:16 PM

I did a little work on this, but it is a little complicated by the way that we have authentication configured for Superset.

With superset we set AUTH_TYPE = REMOTE_USER here which means that we trust the gunicorn web server to authenticate the user and then set the REMOTE_USER header to the shell login name.

gunicorn itself offloads authentication to Apereo-CAS for single-sign-on.

So our users are created in the superset database, but have no password assigned.

The superset plugin for datahub supports the db provider for passwords, or the ldap provider it seems. I can't find a way to either bypass the authentication requirements or to select something that is compatibel with our setup.

I tried manually modifying the configuration of the staging environment on an-tool1005 (in /etc/superset/superset_config.py) in order to make it work, by:

  • Adding from flask_appbuilder.security.manager import AUTH_DB to the top of the file
  • Changing AUTH_TYPE = AUTH_REMOTE_USER to AUTH_TYPE = AUTH_DB
  • Removing ENABLE_PROXY_FIX = True and ADDITIONAL_MIDDLEWARE = [RemoteUserMiddleware, ]

Then restarting the superset service and using back-to-back SSH tunnels to allow my ingestion recipe to contact an-tool1005 port 9080

There is an admin user in the database and I even tried resetting the password for that user, but I couldn't log in.


I then tried an alternative approach, which is to modify the superset plugin to add the X-Remote-User header and set it to btullis

Once again this didn't work. There might be a way, but I would need to experiment a little more.

I can also ask on the DataHub Slack to see if anyone has already tackled this.

Good progress on this now. I have performed a manual ingestion from Superset. We now have metadata about 2,001 charts and 221 dashboards, as well as references to presto sources and lineage information.

image.png (868×975 px, 94 KB)

image.png (639×840 px, 83 KB)

Clicking through to a dashboard shows which charts are included.
image.png (898×615 px, 80 KB)

We also have lineage information, showing the presto->chart->dashboard chain.
image.png (850×1 px, 96 KB)

There were a few 400 errors on import, which I will log for analysis.
I will also note the procedure so that we can work out how best to automate it.

The approach I took to carry out the ingestion was this:

Run a local superset instance
  • activate my stacked conda environment on stat1008 - this is the same environment where I was using the datahub CLI, but it needn't be
  • configure the http proxy as per the instructions
  • enable the temporary ca-certificates workaround as per the instructions (export REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt)
  • install superset with: pip install apache-superset==1.4.2
  • add the mysqlclient dependency with: pip install mysqlclient
  • create a ~/superset_config.py in my home directory, containing the variables: SECRET_KEY and SQLALCHEMY_DATABASE_URI that were copied from an-tool1010:/etc/superset/superset_config.py
  • execute superset run -p 8088 --with-threads and leave this running in the foreground until the ingestion is complete

This instance of Superset connects to the same database, but uses AUTH_TYPE=AUTH_DB and therefore allows us to connect as the admin user.

Run the ingestion

The recipe for ingestion was as follows.

source:
  type: "superset"
  config:
    connect_uri: "http://localhost:8088"
    username: admin
    password: <obtained from private repo /srv/private/hieradata/role/common/analytics_cluster/ui/superset.yaml>
    provider: db

sink:
  type: "datahub-rest"
  config:
    server: 'https://datahub-gms.discovery.wmnet:30443'

One notable issue with this method is that the View in Superset link doesn't work correctly. It references the connect_uri of http://localhost:8088 instead of https://superset.wikimedia.org - I will work on fixing this.

BTullis triaged this task as Medium priority.May 26 2022, 10:39 AM
BTullis moved this task from Next Up to Done on the Data-Engineering-Kanban board.

I have created a paste with the errors that occurred during ingestion: P28592

There were some 500 errors with invalid character sequences, but there were also some 400 errors.

The 400 errors referenced the charts and dashboards that failed to ingest, which include:

Most of these exhibit an error similar to this:
urn:li:dataset:(urn:li:dataPlatform:presto,presto_analytics_hive.blahblah,PROD)" is invalid: Failed to convert urn to entity key: urns parts and key fields do not have same length

From a quick look at it, the common factor seems to be something related to commas somewhere in the metadata, resulting in an invalid urn.

Idea:

What about running a second instance of superset on the same node, accessible only via backend network, but with AUTH_TYPE = AUTH_DB in its superset_config.py? This instance could be used for internal API access only.

Change 812023 had a related patch set uploaded (by Milimetric; author: Milimetric):

[analytics/datahub@wmf] Add base_uri config parameter to Superset source

https://gerrit.wikimedia.org/r/812023

Ah, it wasn't showing up 'cause it's closed :)

Idea:

What about running a second instance of superset on the same node, accessible only via backend network, but with AUTH_TYPE = AUTH_DB in its superset_config.py? This instance could be used for internal API access only.

Thanks, yes I think that this is a possible solution. That might be easier than spinning up a local superset process along with the ingester each time.

I think we will still need the base_uri option that @Milimetric has been working on, because the connect_uri for this private service would be different from https://superset.wikimedia.org

We've decided that we will test the functionality with a manual ingestion run from a stat box - using the patch that @Milimetric has made https://gerrit.wikimedia.org/r/c/812023

If that works, rather than merging it to our fork, we will send a PR upstream to DataHub. The reason for this is that it will be much easier for us to use the python module if DataHub themselves merge it to master and release a new module version to PyPi.

The benefit is that we still have a fully ingested up-to-date Superset with working URLs, but it will be easier for us to do the datahub-cli environment in the Airflow conda-dist environment.

In the meantime, I will work on the deployment of Superset that will be used for the ingestion.

I tried in vain to get this working. Our environment is very painful, I'll explain a bit below.

  • I have a patch, and the way to build it is with gradle. Which I can do locally on my machine or on stat1008
    • On stat1008, gradlew fails because it launches a java process and I couldn't find a way to hack it to pass the web proxy through, so it can't build.
    • On my local machine, I can install whatever I need and build
  • Once I build it on my local machine, I would need to access both superset (running on stat1008) and the datahub (through https://datahub-gms.discovery.wmnet:30443)
    • superset is fine, I can tunnel and access it
    • can't figure out any way to tunnel or get to https://datahub-gms.discovery.wmnet:30443
      • I'm frustrated and I want to stop
      • I figure maybe if I can replicate the same environment I have on my machine, I can zip up the already-built venv and try and run it on stat1008?! Kind of silly, but it just feels like a bunch of closed doors. I tried different things for a while here and I couldn't make any progress. I can't figure out how to install a different version of python with conda (I need 3.9). So maybe I should've built it with a different version on my local machine, but I don't know if I have to hack the gradle build for that... etc.

This is all just too painful. If we're going to develop in these ecosystems we need simple foolproof ways to do it. I think we spend most of our time fighting with stuff like this instead of getting work done.

On stat1008, gradlew fails because it launches a java process and I couldn't find a way to hack it to pass the web proxy through, so it can't build.

With mvwn, I set the usual web proxy env vars, and then pass -Djava.net.useSystemProxies=true. According to https://stackoverflow.com/a/22666646, it looks like this should work for gradle, and if not, you might be able to do JAVA_OPTS=-Djava.net.useSystemProxies=true also?

If we're going to develop in these ecosystems we need simple foolproof ways to do it. I think we spend most of our time fighting with stuff like this instead of getting work done.

You're not wrong, but what you are encountering is I think python env issues. Python is nice to program in but painful in many ways when it comes to deploying and standardizing environments. Think of how much time we've put into conda stuff!

k8s/docker would help, but we have the kerbarrier. The pain we often feel is a consequence of the safety and privacy constraints we put on ourselves for philosophical reasons.

I think that there is an easier way to test this patch, given that we don't need the whole toolchain. We just need to patch the python script that we are using to run the test.

In my case the superset ingestion plugin was installed into my conda environment on stat1008 with the command:

pip install 'acryl-datahub[superset]'

The resulting files are in this directory:

~/.conda/envs/2022-04-14T15.32.43_btullis/lib/python3.7/site-packages/datahub/ingestion/source

I'll go into that directory and take a backup of the superset.py file:

btullis@stat1008:~$ cd ~/.conda/envs/2022-04-14T15.32.43_btullis/lib/python3.7/site-packages/datahub/ingestion/source
btullis@stat1008:~/.conda/envs/2022-04-14T15.32.43_btullis/lib/python3.7/site-packages/datahub/ingestion/source$ cp superset.py ~/

Then I can download the patch from gerrit and apply it directly to the superset.py file

curl -s https://gerrit.wikimedia.org/r/changes/analytics%2Fdatahub~812023/revisions/1/patch?download|base64 -d|patch superset.py -

Output from the command looks good:

patching file superset.py

I can diff this against the backup to make sure that it looks right.

(2022-04-14T15.32.43_btullis) btullis@stat1008:~/.conda/envs/2022-04-14T15.32.43_btullis/lib/python3.7/site-packages/datahub/ingestion/source$ diff superset.py ~/superset.py
7c7
< from pydantic.class_validators import validator, root_validator
---
> from pydantic.class_validators import validator
62d61
<     base_uri: str = Field(default=None, description="optional URL to use in links (if `connect_uri` is only for ingestion)")
76c75
<     @validator("connect_uri", "base_uri")
---
>     @validator("connect_uri")
80,85d78
<     @root_validator
<     def default_base_uri_to_connect_uri(cls, values):
<         base = values.get('base_uri')
<         if base is None:
<             values.set('base_uri', values.get('connect_uri'))
<
218c211
<         dashboard_url = f"{self.config.base_uri}{dashboard_data.get('url', '')}"
---
>         dashboard_url = f"{self.config.connect_uri}{dashboard_data.get('url', '')}"
290c283
<         chart_url = f"{self.config.base_uri}{chart_data.get('url', '')}"
---
>         chart_url = f"{self.config.connect_uri}{chart_data.get('url', '')}"

Seems legit, so I'm going to try adding the base_uri parameter to my recipe and ingesting again. I wonder if I should delete all previous superset entities first, or try a stateful ingestion?

I have tried a stateful ingestion but it fails validation:

btullis@stat1008:~/src/datahub/ingestion$ datahub ingest -c superset.yml
[2022-07-15 14:16:23,916] INFO     {datahub.cli.ingest_cli:99} - DataHub CLI version: 0.8.38
1 validation error for SupersetConfig
stateful_ingestion
  extra fields not permitted (type=value_error.extra)

The recipe looked like this:

btullis@stat1008:~/src/datahub/ingestion$ cat superset.yml | grep -v password
source:
  type: "superset"
  config:
    connect_uri: "http://localhost:8088"
    base_uri: "https://superset.wikimedia.org"
    username: admin
    provider: db
    stateful_ingestion:
        enabled: True
        remove_stale_metadata: True

sink:
  type: "datahub-rest"
  config:
    server: 'https://datahub-gms.discovery.wmnet:30443'

I will try deleting the superset data and ingesting again.

Deleted the entities associated with superset.

(2022-04-14T15.32.43_btullis) btullis@stat1008:~/src/datahub/ingestion$ datahub delete --entity_type dashboard --platform superset --hard
This will permanently delete data from DataHub. Do you want to continue? [y/N]: y
[2022-07-15 14:19:58,653] INFO     {datahub.cli.delete_cli:234} - datahub configured with https://datahub-gms.discovery.wmnet:30443
[2022-07-15 14:19:58,756] INFO     {datahub.cli.delete_cli:248} - Filter matched 221 entities. Sample: ['urn:li:dashboard:(superset,32)', 'urn:li:dashboard:(superset,276)', 'urn:li:dashboard:(superset,270)', 'urn:li:dashboard:(superset,115)', 'urn:li:dashboard:(superset,190)']
This will delete 221 entities. Are you sure? [y/N]: y
100% (221 of 221) |#######################################################################################################################################################| Elapsed Time: 0:00:03 Time:  0:00:03
Took 11.786 seconds to hard delete 884 rows for 221 entities
(2022-04-14T15.32.43_btullis) btullis@stat1008:~/src/datahub/ingestion$ datahub delete --entity_type chart --platform superset --hard
This will permanently delete data from DataHub. Do you want to continue? [y/N]: y
[2022-07-15 14:20:17,714] INFO     {datahub.cli.delete_cli:234} - datahub configured with https://datahub-gms.discovery.wmnet:30443
[2022-07-15 14:20:18,062] INFO     {datahub.cli.delete_cli:248} - Filter matched 2001 entities. Sample: ['urn:li:chart:(superset,1206)', 'urn:li:chart:(superset,945)', 'urn:li:chart:(superset,1171)', 'urn:li:chart:(superset,1550)', 'urn:li:chart:(superset,1502)']
This will delete 2001 entities. Are you sure? [y/N]: y
100% (2001 of 2001) |#####################################################################################################################################################| Elapsed Time: 0:00:32 Time:  0:00:32
Took 38.028 seconds to hard delete 8004 rows for 2001 entities

Looking good. After one small modification to the patch (https://gerrit.wikimedia.org/r/c/analytics/datahub/+/812023/2..1) the ingestion worked and the link to Superset is working.

image.png (693×978 px, 58 KB)

I think that this is good enough to call our MVP integration done, but we know that we still want to productionize this and schedule it.

@Milimetric is going to upstream the patch to DataHub. I'll create a follow-up ticket to work on the puppet code to run an instance of Superset specifically for this ingestion.

Change 812023 abandoned by Milimetric:

[analytics/datahub@wmf] Add base_uri config parameter to Superset source

Reason:

upstreamed instead

https://gerrit.wikimedia.org/r/812023