Done is:
- Airflow scheduled job updates Superset lineage in DataHub (previously ingested as a one off)
| • EChetty | |
| May 31 2022, 2:47 PM |
| F65787743: Screenshot 2025-08-19 at 16.09.11.png | |
| Aug 19 2025, 2:09 PM |
| F65751228: Screenshot 2025-08-14 at 17.39.02.png | |
| Aug 14 2025, 3:40 PM |
| F65751217: Screenshot 2025-08-14 at 17.36.34.png | |
| Aug 14 2025, 3:38 PM |
Done is:
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Open | None | T369756 [EPIC] Datahub Improvements | |||
| Resolved | brouberol | T309622 Create Airflow Pipeline for Ingesting/Updating Superset data into DataHub | |||
| Resolved | BTullis | T316336 Upgrade DataHub to v0.8.43 |
Change 811986 had a related patch set uploaded (by Ottomata; author: Ottomata):
[operations/puppet@production] Add support for airflow filesystem backend variables
For context: we're pausing this until https://github.com/datahub-project/datahub/pull/5408 gets merged and deployed by upstream.
Dear Ben, they released our patch! Could you please upgrade to https://github.com/datahub-project/datahub/releases/tag/v0.8.42 or https://github.com/datahub-project/datahub/releases/tag/v0.8.43? Then I can help set up the ingestion.
Yes, it would be great to get this done.
There is still one part about the task which is on the more difficult side.
The main problem we had was that we couldn't intereact with https://superset.wikimedia.org itself because of the CAS/SSO authentication and the apache reverse proxy configuration.
I ended up getting around this by using a local instance of superset, running in a conda environnment on a stat server, but crucially connecting to the production instance's database.
That apprach is documented here: T306903#7959985
That's the reason that @Milimetric had to write the patch to the datahub ingester for superset, which allowed us to query superset via localhost, but substitute the URL with superset.wikimedia.org.
So that was where it was left. We can run another manual ingestion of Superset data at any time, but making it an automated pipeline requires working our how, when, and where to run this additional instance.
To make matters a little more complicated, we have also decided to do T347710: Migrate the Analytics Superset instances to our DSE Kubernetes cluster this quarter, in which we plan to migrate superset (and superset-next) from an-tool1010 (and an-tool1005) to the dse-k8s cluster.
On the plus side, this would be a great way to run a dedicated instance of superset that bypasses CAS/SSO and is used for metadata updating.
On the bad side, this would potentially make it more difficult for the existing airflow-scheduler running on an-launcher1002 to launch a datahub CLI process to talk to it
@BTullis @Gehel - could I ask this to be prioritized please or let me know if this is something someone from DE could do. We discussed this in the DPE sync and it seems this might be possible now that Superset is migrated to k8s.
It’s not urgent but end of calendar year would be good (lower priority than MP).
This would be in support of our APP KR SDS1.3 around lineage:
cc - @Ahoelzl
One additional request - can we please update the description of this task so its clear what is being done here.
IIUC isn't Superset already integrated with Datahub? I can see the list of charts (but not dashboards) created on superset. what will this pipeline help to do exactly?
@Mayakp.wiki - updated the description, previous work was done as a one off load. We will look at automating this load now on a more regular basis.
Ohh ok. got it. thanks @lbowmaker
will this enable us to get dashboards info as well, more regularly ?
@Mayakp.wiki I think so, chatted to Ben briefly about this and we probably need to play around with the job when we pick this up as lost of things have changed but ideally we would want charts and dashboards ingested with table lineage. Something like this (but hopefully with the tables used to create the charts):
Chiming in on this after a good year of inactivity. I've run a quick test that, I believe, should allow us to deliver on this quite quickly. The crux of the difficulty here is to be able to have Datahub use the Superset API, Superset login relying on OAuth and CAS.
Superset supports 2 types of user identity provider to log into its API: db and ldap (source).
I've simply created a user with Alpha role in the Superset next DB, and was able to log into the Superset API with it:
runuser@superset-staging-6cdf896c85-vbw6x:/app$ superset fab create-user \
--role Alpha \
--username testapi \
--firstname testapi \
--lastname testapi \
--email superset-next.testapi@wikimedia.org \
--password [REDACTED]
Loaded your LOCAL configuration at [/etc/superset/superset_config.py]
logging was configured successfully
2025-08-14 07:50:14,065:INFO:superset.utils.logging_configurator:logging was configured successfully
2025-08-14 07:50:14,072:INFO:root:Configured event logger of type <class 'superset.utils.log.DBEventLogger'>
User testapi created.runuser@superset-staging-6cdf896c85-vbw6x:/app$ python3 Python 3.9.2 (default, Feb 28 2021, 17:03:44) [GCC 10.2.1 20210110] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import os, requests >>> os.environ['REQUESTS_CA_BUNDLE'] = '/etc/ssl/certs/ca-certificates.crt' >>> requests.post("https://superset-next.discovery.wmnet:30443/api/v1/security/login", json={"provider": "db", "refresh": True, "username": "testapi", "password": "[REDACTED]"}) <Response [200]>
So in theory, datahub could use that identity to ingest data from Superset.
Change #1178828 had a related patch set uploaded (by Brouberol; author: Brouberol):
[operations/deployment-charts@master] airflow: make it possible to inject datahub ingestion config files in secrets
Change #1178829 had a related patch set uploaded (by Brouberol; author: Brouberol):
[operations/deployment-charts@master] airflow-main: define a custom superset datahub ingestion configuration
I've run a Superset -> Datahub import DAG in an airflow devenv, and works, until it fails 😄
We seem to be hitting a bug in the datahub CLI itself:
airflow@airflow-dev-brouberol-hadoop-shell-5b9685b7ff-twm8l:/opt/airflow$ yarn logs -appOwner brouberol -applicationId application_1754906949114_86953
25/08/14 13:38:37 INFO ZlibFactory: Successfully loaded & initialized native-zlib library
25/08/14 13:38:37 INFO CodecPool: Got brand-new decompressor [.deflate]
Container: container_e139_1754906949114_86953_01_000001 on an-worker1158.eqiad.wmnet_8041_1755178500371
LogAggregationType: AGGREGATED
=======================================================================================================
LogType:container-localizer-syslog
LogLastModifiedTime:Thu Aug 14 13:35:00 +0000 2025
LogLength:184
LogContents:
2025-08-14 13:34:13,910 INFO [main] org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer: Disk Validator: yarn.nodemanager.disk-validator is loaded.
End of LogType:container-localizer-syslog
*******************************************************************************************
Container: container_e139_1754906949114_86953_01_000001 on an-worker1158.eqiad.wmnet_8041_1755178500371
LogAggregationType: AGGREGATED
=======================================================================================================
LogType:application.driver.log
LogLastModifiedTime:Thu Aug 14 13:35:00 +0000 2025
LogLength:11464
LogContents:
[2025-08-14 13:34:46,191] INFO {datahub.cli.ingest_cli:173} - DataHub CLI version: 0.10.4
[2025-08-14 13:34:46,299] INFO {datahub.ingestion.run.pipeline:210} - Sink configured successfully. DataHubRestEmitter: configured to talk to https://datahub-gms.discovery.wmnet:30443
[2025-08-14 13:34:46,921] INFO {datahub.ingestion.run.pipeline:227} - Source configured successfully.
[2025-08-14 13:34:46,922] INFO {datahub.cli.ingest_cli:129} - Starting metadata ingestion
[2025-08-14 13:34:48,693] INFO {datahub.cli.ingest_cli:135} - Source (superset) report:
{'aspects': {'dashboard': {'dashboardInfo': 147, 'status': 147}},
'entities': {'dashboard': ['urn:li:dashboard:(superset,562)',
'urn:li:dashboard:(superset,501)',
'urn:li:dashboard:(superset,230)',
'urn:li:dashboard:(superset,166)',
'urn:li:dashboard:(superset,344)',
'urn:li:dashboard:(superset,232)',
'urn:li:dashboard:(superset,219)',
'urn:li:dashboard:(superset,165)',
'urn:li:dashboard:(superset,70)',
'urn:li:dashboard:(superset,49)',
'... sampled of 147 total elements']},
'events_produced': 147,
'events_produced_per_sec': 68,
'failures': {},
'running_time': '2.14 seconds',
'soft_deleted_stale_entities': [],
'start_time': '2025-08-14 13:34:46.558125 (2.14 seconds ago)',
'warnings': {}}
[2025-08-14 13:34:48,694] INFO {datahub.cli.ingest_cli:138} - Sink (datahub-rest) report:
{'current_time': '2025-08-14 13:34:48.693863 (now)',
'failures': [],
'gms_version': '',
'pending_requests': 0,
'records_written_per_second': 60,
'start_time': '2025-08-14 13:34:46.277835 (2.42 seconds ago)',
'total_duration_in_seconds': 2.42,
'total_records_written': 147,
'warnings': []}
[2025-08-14 13:34:58,706] ERROR {datahub.entrypoints:199} - Command failed: 'NoneType' object has no attribute 'startswith'
Traceback (most recent call last):
File "/var/lib/hadoop/data/a/yarn/local/usercache/brouberol/appcache/application_1754906949114_86953/container_e139_1754906949114_86953_01_000001/environment/lib/python3.7/site-packages/datahub/entrypoints.py", line 186, in main
sys.exit(datahub(standalone_mode=False, **kwargs))
File "/var/lib/hadoop/data/a/yarn/local/usercache/brouberol/appcache/application_1754906949114_86953/container_e139_1754906949114_86953_01_000001/environment/lib/
...
File "/var/lib/hadoop/data/a/yarn/local/usercache/brouberol/appcache/application_1754906949114_86953/container_e139_1754906949114_86953_01_000001/environment/lib/python3.7/site-packages/datahub/ingestion/source/sql/sql_common.py", line 107, in <lambda>
return platform, lambda x: x.startswith(
AttributeError: 'NoneType' object has no attribute 'startswith'
End of LogType:application.driver.log
***************************************************************************************
Container: container_e139_1754906949114_86953_01_000001 on an-worker1158.eqiad.wmnet_8041_1755178500371
LogAggregationType: AGGREGATED
=======================================================================================================
LogType:application.master.log
LogLastModifiedTime:Thu Aug 14 13:35:00 +0000 2025
LogLength:1794
LogContents:
25/08/14 13:34:24 INFO skein.ApplicationMaster: Starting Skein version 0.8.2
25/08/14 13:34:24 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/08/14 13:34:24 INFO skein.ApplicationMaster: Running as user brouberol@WIKIMEDIA
25/08/14 13:34:24 INFO conf.Configuration: resource-types.xml not found
25/08/14 13:34:24 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
25/08/14 13:34:24 INFO resource.ResourceUtils: Adding resource type - name = memory-mb, units = Mi, type = COUNTABLE
25/08/14 13:34:24 INFO resource.ResourceUtils: Adding resource type - name = vcores, units = , type = COUNTABLE
25/08/14 13:34:24 INFO skein.ApplicationMaster: Application specification successfully loaded
25/08/14 13:34:25 INFO skein.ApplicationMaster: gRPC server started at an-worker1158.eqiad.wmnet:33885
25/08/14 13:34:25 INFO skein.ApplicationMaster: WebUI server started at an-worker1158.eqiad.wmnet:34817
25/08/14 13:34:25 INFO skein.ApplicationMaster: Registering application with resource manager
25/08/14 13:34:25 INFO skein.ApplicationMaster: Starting application driver
25/08/14 13:34:58 INFO skein.ApplicationMaster: Shutting down: Application driver failed with exit code 1, see logs for more information.
25/08/14 13:34:58 INFO skein.ApplicationMaster: Unregistering application with status FAILED
25/08/14 13:34:58 INFO impl.AMRMClientImpl: Waiting for application to be successfully unregistered.
25/08/14 13:34:58 INFO skein.ApplicationMaster: Deleted application directory hdfs://analytics-hadoop/user/brouberol/.skein/application_1754906949114_86953
25/08/14 13:34:58 INFO skein.ApplicationMaster: WebUI server shut down
25/08/14 13:34:58 INFO skein.ApplicationMaster: gRPC server shut down
End of LogType:application.master.log
***************************************************************************************
End of LogType:prelaunch.err
******************************************************************************
Container: container_e139_1754906949114_86953_01_000001 on an-worker1158.eqiad.wmnet_8041_1755178500371
LogAggregationType: AGGREGATED
=======================================================================================================
LogType:prelaunch.out
LogLastModifiedTime:Thu Aug 14 13:35:00 +0000 2025
LogLength:70
LogContents:
Setting up env variables
Setting up job resources
Launching container
End of LogType:prelaunch.out
******************************************************************************I'd like to see how we could rely on a datahub CLI matching the current version of datahub we're running (0.13.3).
Huh, we _already_ have the right version of the library/CLI in the airflow image itself. The whole thing should be able to be refactored into BashOperator tasks, I think.
airflow@airflow-dev-brouberol-hadoop-shell-5b9685b7ff-twm8l:/opt/airflow$ pip3 list | grep datahub acryl-datahub 0.13.3 acryl-datahub-airflow-plugin 0.13.3 airflow@airflow-dev-brouberol-hadoop-shell-5b9685b7ff-twm8l:/opt/airflow$ which datahub /tmp/pyenv/shims/datahub airflow@airflow-dev-brouberol-hadoop-shell-5b9685b7ff-twm8l:/opt/airflow$ datahub ingest --help Usage: datahub ingest [OPTIONS] COMMAND [ARGS]... Ingest metadata into DataHub. Options: --help Show this message and exit. Commands: run* Ingest metadata into DataHub. deploy Deploy an ingestion recipe to your DataHub instance. list-runs List recent ingestion runs to datahub mcps Ingest metadata from a mcp json file or directory of files. rollback Rollback a provided ingestion run to datahub show Describe a provided ingestion run to datahub
And both services are reachable:
>>> import os >>> import requests >>> os.environ['REQUESTS_CA_BUNDLE'] = '/etc/ssl/certs/ca-certificates.crt' >>> requests.get("https://superset.discovery.wmnet:30443", timeout=1) <Response [200]> >>> requests.get("https://datahub-gms.discovery.wmnet:30443", timeout=1) <Response [404]>
That seems to be working!
airflow@airflow-dev-brouberol-scheduler-5b589b7cd8-rz6pp:/opt/airflow$ datahub ingest run -c /opt/airflow/secrets/datahub_ingest_superset.yaml --dry-run
[2025-08-14 14:46:06,315] INFO {datahub.cli.ingest_cli:147} - DataHub CLI version: 0.13.3
[2025-08-14 14:46:06,499] INFO {datahub.ingestion.run.pipeline:257} - Sink configured successfully. DataHubRestEmitter: configured to talk to https://datahub-gms.discovery.wmnet:30443
[2025-08-14 14:46:08,400] INFO {datahub.ingestion.run.pipeline:281} - Source configured successfully.
[2025-08-14 14:46:08,401] INFO {datahub.cli.ingest_cli:128} - Starting metadata ingestion
/
Cli report:
{'cli_version': '0.13.3',
'cli_entry_location': '/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/datahub/__init__.py',
'models_version': 'bundled',
'py_version': '3.10.15 (main, Jun 11 2025, 12:33:45) [GCC 10.2.1 20210110]',
'py_exec_path': '/tmp/pyenv/versions/3.10.15/bin/python3.10',
'os_details': 'Linux-6.1.0-37-amd64-x86_64-with-glibc2.31',
'mem_info': '95.32 MB',
'peak_memory_usage': '95.32 MB',
'disk_info': {'total': '216.1 GB', 'used': '22.56 GB', 'used_initally': '22.56 GB', 'free': '182.49 GB'},
'peak_disk_usage': '22.56 GB',
'thread_count': 3,
'peak_thread_count': 3}
Source (superset) report:
{'events_produced': 362,
'events_produced_per_sec': 39,
'entities': {'dashboard': ['urn:li:dashboard:(superset,618)',
'urn:li:dashboard:(superset,562)',
'urn:li:dashboard:(superset,552)',
'urn:li:dashboard:(superset,334)',
'urn:li:dashboard:(superset,526)',
'urn:li:dashboard:(superset,495)',
'urn:li:dashboard:(superset,344)',
'urn:li:dashboard:(superset,161)',
'urn:li:dashboard:(superset,32)',
'urn:li:dashboard:(superset,22)',
'... sampled of 148 total elements'],
'chart': ['urn:li:chart:(superset,3358)',
'urn:li:chart:(superset,407)',
'urn:li:chart:(superset,231)',
'urn:li:chart:(superset,636)',
'urn:li:chart:(superset,601)',
'urn:li:chart:(superset,577)',
'urn:li:chart:(superset,586)',
'urn:li:chart:(superset,541)',
'urn:li:chart:(superset,3777)',
'urn:li:chart:(superset,3786)',
'... sampled of 214 total elements']},
'aspects': {'dashboard': {'status': 148, 'dashboardInfo': 148}, 'chart': {'status': 214, 'chartInfo': 214}},
'aspect_urn_samples': {'dashboard': {'status': ['urn:li:dashboard:(superset,601)',
'urn:li:dashboard:(superset,450)',
'urn:li:dashboard:(superset,369)',
'urn:li:dashboard:(superset,408)',
'urn:li:dashboard:(superset,232)',
'urn:li:dashboard:(superset,161)',
'urn:li:dashboard:(superset,138)',
'urn:li:dashboard:(superset,68)',
'urn:li:dashboard:(superset,16)',
'urn:li:dashboard:(superset,11)',
'... sampled of 148 total elements'],
'dashboardInfo': ['urn:li:dashboard:(superset,568)',
'urn:li:dashboard:(superset,318)',
'urn:li:dashboard:(superset,501)',
'urn:li:dashboard:(superset,373)',
'urn:li:dashboard:(superset,391)',
'urn:li:dashboard:(superset,75)',
'urn:li:dashboard:(superset,344)',
'urn:li:dashboard:(superset,345)',
'urn:li:dashboard:(superset,217)',
'urn:li:dashboard:(superset,22)',
'... sampled of 148 total elements']},
'chart': {'status': ['urn:li:chart:(superset,3933)',
'urn:li:chart:(superset,3901)',
'urn:li:chart:(superset,3916)',
'urn:li:chart:(superset,230)',
'urn:li:chart:(superset,491)',
'urn:li:chart:(superset,501)',
'urn:li:chart:(superset,547)',
'urn:li:chart:(superset,3662)',
'urn:li:chart:(superset,3878)',
'urn:li:chart:(superset,3864)',
'... sampled of 214 total elements'],
'chartInfo': ['urn:li:chart:(superset,66)',
'urn:li:chart:(superset,82)',
'urn:li:chart:(superset,397)',
'urn:li:chart:(superset,677)',
'urn:li:chart:(superset,555)',
'urn:li:chart:(superset,543)',
'urn:li:chart:(superset,3778)',
'urn:li:chart:(superset,3786)',
'urn:li:chart:(superset,3753)',
'urn:li:chart:(superset,3756)',
'... sampled of 214 total elements']}},
'warnings': {},
'failures': {},
'soft_deleted_stale_entities': [],
'last_state_non_deletable_entities': [],
'start_time': '2025-08-14 14:46:07.913375 (9.12 seconds ago)',
'running_time': '9.12 seconds'}
Sink (datahub-rest) report:
{'total_records_written': 2,
'records_written_per_second': 0,
'warnings': [],
'failures': [],
'start_time': '2025-08-14 14:46:06.331379 (10.7 seconds ago)',
'current_time': '2025-08-14 14:46:17.034133 (now)',
'total_duration_in_seconds': 10.7,
'max_threads': 15,
'gms_version': 'v0.13.3',
'pending_requests': 0,
'main_thread_blocking_timer': '0.005 seconds'}
⏳ Pipeline running successfully so far; produced 2 events in 9.12 seconds.
...I've been able to run the task in a devenv, which has inserted a large amount of MAEs into Kafka, which datahub is currently re-processing
.For example https://datahub.wikimedia.org/dashboard/urn:li:dashboard:(superset,148)/Charts?is_lineage_mode=false represents a Superset chart created <3mo ago, and we have a correct link to https://superset.wikimedia.org/superset/dashboard/148/, thanks to @Milimetric's patch.
Having redeployed datahub with
global: managed_ingestion: enabled: true
we can now see CLI ingestion runs in the UI:
Change #811986 abandoned by Ottomata:
[operations/puppet@production] Add support for airflow filesystem backend variables
Reason:
not needed in k8s
Change #1180077 had a related patch set uploaded (by Brouberol; author: Brouberol):
[operations/deployment-charts@master] Enable visibility of ingestion runs in the datahub UI
Change #1178828 merged by jenkins-bot:
[operations/deployment-charts@master] airflow: make it possible to inject custom files in secrets
Change #1178829 merged by jenkins-bot:
[operations/deployment-charts@master] airflow-main: define a custom superset datahub ingestion configuration
Change #1180077 merged by jenkins-bot:
[operations/deployment-charts@master] Enable visibility of ingestion runs in the datahub UI
brouberol merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1605
main: ingest Superset data into datahub
@brouberol or others.. can you pls show us how this would look on Superset?
pls ignore this comment, I was able to see the visual lineage on datahub. thanks for this !