Evaluate DataHub as a Data Catalog
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• razzi
	Jan 20 2022, 8:40 PM

Description

Conclusions

DataHub is our leading candidate.

Pros

Good UX, lots of features we want like data quality reporting and social
Push ingestion architecture, Kafka useful for deployment flexibility (public catalog)
Most robust experience so far, setting up and running ingestion

Cons

Dependency on Confluent Schema, see T299703#7671668
Possible tendency to lean towards other Confluent ecosystem pieces
Neo4j or ElasticSearch can be the graph backend, but Neo4j Community is limited to a single node and ElasticSearch may not perform as well as we need it to for this use case, even with a small graph. Good migration guide here.

Run

Start all the services mentioned in T299703#7660150
Tunnel with ssh -N -L9002:localhost:9002 stat1008.eqiad.wmnet
browse http://localhost:9002/ (username and password are both: datahub)

Steps to Reproduce Installation

Detailed below in comments from Ben. Datahub runs in docker by default but we don't have docker on the test cluster, so we'll have to try setting up the individual dependency services:

Elasticsearch (we'll use the latest OpenSearch)
Mysql (already have this)
kafka (we have a test kafka cluster, so connecting that may be easier than spinning up a new one)

Ingestion

Ingest Hive Metastore: T299703#7662929
Ingest Kafka: T299703#7663941
Ingest both Druid clusters: T299703#7664039

(see also this thread from the Datahub slack)

Details

Other Assignee: • razzi

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Milimetric	T293643 Data Catalog Technical Evaluation
		Resolved		BTullis	T299703 Evaluate DataHub as a Data Catalog

Event Timeline

• razzi created this task.Jan 20 2022, 8:40 PM

As mentioned in this issue: https://github.com/linkedin/datahub/issues/3504

...Datahub does not officially support a non-docker based installation but I would recommend taking a look at how the docker image is running the datahub containers. https://github.com/linkedin/datahub/blob/master/docker/datahub-gms/Dockerfile You can use gradlew to build the war executable and run it directly in your local environment.

I have started this process, by building a war file using gradle as demonstrated here:
https://github.com/linkedin/datahub/blob/master/docker/datahub-gms/Dockerfile#L33

./gradlew :metadata-service:war:build -x test

I had to use Java 1.8 in order to build it.

I've copied the resulting war.war file to an-test-coord1001 in case you would like to work with it @razzi

The next thing I would do is to look at setting up jetty with this war and following the other steps in the Dockerfile and the start.sh script.

• EChetty moved this task from Backlog to Next Up on the Data-Catalog board.Jan 24 2022, 12:40 PM

I have now realised that this build process is considerably more convoluted than I had anticipated, but it is progressing.

I've switched my development host from an-test-coord1001 to stat1008, because one part of the build process requires the python3-env module, which is not installed on an-test-coord1001.

Currently I have cloned the datahub source from GitHub
Checked out the v0.8.24 tag
Executed the gradle build, skipping tests and two problematic tasks.

git clone https://github.com/linkedin/datahub.git && cd datahub
git checkout v0.8.24
./gradlew -Dhttp.proxyHost=webproxy -Dhttp.proxyPort=8080 -Dhttps.proxyHost=webproxy -Dhttps.proxyPort=8080 "-Dhttp.nonProxyHosts=127.0.0.1|localhost|*.wmnet" build -x test -x docs-website:yarnGenerate -x docs-website:yarnBuild

I am currently working to interpret the guidance on the folowing documentation pages:

Local development: https://datahubproject.io/docs/developers
Metadata service (GMS): https://datahubproject.io/docs/metadata-service
Frontend proxy: https://datahubproject.io/docs/datahub-frontend - [[
React web app: https://datahubproject.io/docs/datahub-web-react
MCE consumer job: https://datahubproject.io/docs/metadata-jobs/mce-consumer-job
MAE consumer job: https://datahubproject.io/docs/metadata-jobs/mae-consumer-job

...along with the Dockerfile for each relevant service:

We also need to be able to run the required services, as mentioned in the docker-compose.yaml file.

MySQL 5.7 - I believe that we can create a database on an-test-coord1001 for this
Kafka and zookeeper - I believe that we can use the kafka-test cluster for this
Elasticsearch 7.9.3 - I will try to use OpenSearch 1.2.4 binary release for this from https://artifacts.opensearch.org/releases/bundle/opensearch/1.2.4/opensearch-1.2.4-linux-x64.tar.gz
Neo4J 4.0.6 - I will download this version from https://neo4j.com/download-thanks/?edition=community&release=4.0.6&flavour=unix

The most problematic element might be the Schema Registry, since this is a required component, but is not available under a suitable license: https://docs.confluent.io/platform/current/schema-registry/index.html#license

I have asked the DataHub team about this on Slack and the most helpful comment has been to point me to Karapace which claims to be a drop-in replacement for Schema Registry and is Apache 2.0 licensed. I will give this a go.

Karapace Setup

Build egg

cd src/datahub
git clone https://github.com/aiven/karapace.git && cd karapace
git checkout 2.0.1
python3 setup.py bdist_egg

Create virtual environment and install

python3 -mvenv venv
source venv/bin/activate
pip install wheel
python setup.py bdist_wheel
easy_install dist/karapace-2.0.1-py3.7.egg
pip install -r requirements-dev.txt

Running

venv/bin/karapace ./karapace.config.json
Currently exiting with an error, but might not be too far away.

BTullis moved this task from Next Up to In Progress on the Data-Catalog board.Jan 26 2022, 3:43 PM

BTullis moved this task from Next Up to In Progress on the Data-Engineering-Kanban board.Jan 28 2022, 10:35 AM

As per the previous update: T299703#7648663 I have started using stat1008 for development, but I am doing all work as my own unprivileged user account.
No root. No installing packages. No containers.

I have checked out the datahub source to: /home/btullis/datahub/datahub and I am working on tag 0.8.24
I built the code with:

./gradlew -Dhttp.proxyHost=webproxy -Dhttp.proxyPort=8080 -Dhttps.proxyHost=webproxy -Dhttps.proxyPort=8080 "-Dhttp.nonProxyHosts=127.0.0.1|localhost|*.wmnet" build -x test -x docs-website:yarnGenerate -x docs-website:yarnBuild

The only additional configuration required was to set the following in gradle.properties

# Configure heap
org.gradle.jvmargs=-Xmx8g

Now I have succeeded in getting all of the required dependencies to run as well.
The following are systemd user services:

opensearch.service - Running opensearch 1.2.4 - from /home/btullis/src/datahub/opensearch/
neo4j.service - Running neo4j 4.0.6 - from /home/btullis/src/datahub/neo4j/
kafka-broker.service - Running kafka-broker from confluent-platform 5.4 - from /home/btullis/src/datahub/confluent/
zookeeper.service - Running zookeeper from confluent-plaform 5.4 - from /home/btullis/src/datahub/confluent/
schema-registry.service - Running schema-registry from confluent-plaform 5.4 - from /home/btullis/src/datahub/confluent/

I have created a database on an-test-coord1001 called datahub with a user account for access: datahub@%
The password for this mysql user is in /home/btullis/src/datahub/datahub/docker/datahub-gms/start-local.sh
I have loaded the initial database schema from: https://github.com/linkedin/datahub/blob/master/docker/mysql-setup/init.sql

I have configured the password for Neo4j user neo4j to be datahub - with the command JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64 bin/neo4j-admin set-initial-password datahub

The next tasks are:

Preconfiguring Kafka - which is normally carried out by the kafka-setup container.
Preconfiguring Opensearch - which is normally carried out by the elasticsearch-setup container.

Getting the datahub-gms service to run.
Getting the datahub-frontend service to run.

The kafka topic names are as follows, from: https://github.com/linkedin/datahub/blob/master/docker/kafka-setup/Dockerfile

ENV METADATA_AUDIT_EVENT_NAME="MetadataAuditEvent_v4"
ENV METADATA_CHANGE_EVENT_NAME="MetadataChangeEvent_v4"
ENV FAILED_METADATA_CHANGE_EVENT_NAME="FailedMetadataChangeEvent_v4"
ENV DATAHUB_USAGE_EVENT_NAME="DataHubUsageEvent_v1"
ENV METADATA_CHANGE_LOG_VERSIONED_TOPIC="MetadataChangeLog_Versioned_v1"
ENV METADATA_CHANGE_LOG_TIMESERIES_TOPIC="MetadataChangeLog_Timeseries_v1"
ENV METADATA_CHANGE_PROPOSAL_TOPIC="MetadataChangeProposal_v1"
ENV FAILED_METADATA_CHANGE_PROPOSAL_TOPIC="FailedMetadataChangeProposal_v1"

Creating those topics as per the steps in: https://github.com/linkedin/datahub/blob/master/docker/kafka-setup/kafka-setup.sh

bin/kafka-topics --create --if-not-exists --zookeeper localhost --partitions 1 --replication-factor 1 --topic MetadataAuditEvent_v4
bin/kafka-topics --create --if-not-exists --zookeeper localhost --partitions 1 --replication-factor 1 --topic MetadataChangeEvent_v4
bin/kafka-topics --create --if-not-exists --zookeeper localhost --partitions 1 --replication-factor 1 --topic FailedMetadataChangeEvent_v4
bin/kafka-topics --create --if-not-exists --zookeeper localhost --partitions 1 --replication-factor 1 --topic MetadataChangeLog_Versioned_v1
bin/kafka-topics --create --if-not-exists --zookeeper localhost --partitions 1 --replication-factor 1 --config retention.ms=7776000000 --topic MetadataChangeLog_Timeseries_v1
bin/kafka-topics --create --if-not-exists --zookeeper localhost --partitions 1 --replication-factor 1 --topic MetadataChangeProposal_v1
bin/kafka-topics --create --if-not-exists --zookeeper localhost --partitions 1 --replication-factor 1 --topic FailedMetadataChangeProposal_v1
bin/kafka-topics --create --if-not-exists --zookeeper localhost --partitions 1 --replication-factor 1 --topic DataHubUsageEvent_v1

I can verify that they have all been created with the following:

btullis@stat1008:~/src/datahub/confluent$ kafkacat -L -b localhost:9092
Metadata for all topics (from broker -1: localhost:9092/bootstrap):
 1 brokers:
  broker 0 at stat1008.eqiad.wmnet:9092 (controller)
 11 topics:
  topic "MetadataChangeLog_Timeseries_v1" with 1 partitions:
    partition 0, leader 0, replicas: 0, isrs: 0
  topic "MetadataAuditEvent_v4" with 1 partitions:
    partition 0, leader 0, replicas: 0, isrs: 0
  topic "MetadataChangeEvent_v4" with 1 partitions:
    partition 0, leader 0, replicas: 0, isrs: 0
  topic "_schemas" with 1 partitions:
    partition 0, leader 0, replicas: 0, isrs: 0
  topic "MetadataChangeProposal_v1" with 1 partitions:
    partition 0, leader 0, replicas: 0, isrs: 0
  topic "FailedMetadataChangeEvent_v4" with 1 partitions:
    partition 0, leader 0, replicas: 0, isrs: 0
  topic "FailedMetadataChangeProposal_v1" with 1 partitions:
    partition 0, leader 0, replicas: 0, isrs: 0
  topic "MetadataChangeLog_Versioned_v1" with 1 partitions:
    partition 0, leader 0, replicas: 0, isrs: 0
  topic "DataHubUsageEvent_v1" with 1 partitions:
    partition 0, leader 0, replicas: 0, isrs: 0
  topic "__confluent.support.metrics" with 1 partitions:
    partition 0, leader 0, replicas: 0, isrs: 0
  topic "MetadataAuditEvent" with 1 partitions:
    partition 0, leader 0, replicas: 0, isrs: 0

I have pre-configured opensearch, but ran into one smalll issue that I hope will not cause a problem at this stage.

The setup script has two parts:

Install an ILM (index lifecycle management) policy.
Install an index template

The command to install the ILM policy fails, because the ILM feature is not available in opensearch.

btullis@stat1008:~/src/datahub/datahub/metadata-service/restli-servlet-impl/src/main/resources/index/usage-event$ curl -XPUT http://localhost:9200/_ilm/policy/datahub_usage_event_policy -H 'Content-Type: application/json' --data @policy.json 
{"error":{"root_cause":[{"type":"invalid_index_name_exception","reason":"Invalid index name [_ilm], must not start with '_', '-', or '+'","index":"_ilm","index_uuid":"_na_"}],"type":"invalid_index_name_exception","reason":"Invalid index name [_ilm], must not start with '_', '-', or '+'","index":"_ilm","index_uuid":"_na_"},"status":400}

It's possible that this will work with the ISM feature of opensearch, but I thought it easier to skip this requirement at the moment.
The policy just said, roll over the indices at 7 days and keep for at least 60 days.

btullis@stat1008:~/src/datahub/datahub/metadata-service/restli-servlet-impl/src/main/resources/index/usage-event$ cat policy.json | jq .
{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": {
            "max_age": "7d"
          }
        }
      },
      "delete": {
        "min_age": "60d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}

Installing the index template required a small modification of the JSON file, because it referred to the non-existent ILM policy. This was the original template:

btullis@stat1008:~/src/datahub/datahub/metadata-service/restli-servlet-impl/src/main/resources/index/usage-event$ jq . index_template.json 
{
  "index_patterns": [
    "*PREFIXdatahub_usage_event*"
  ],
  "data_stream": {},
  "priority": 500,
  "template": {
    "mappings": {
      "properties": {
        "@timestamp": {
          "type": "date"
        },
        "type": {
          "type": "keyword"
        },
        "timestamp": {
          "type": "date"
        },
        "userAgent": {
          "type": "keyword"
        },
        "browserId": {
          "type": "keyword"
        }
      }
    },
    "settings": {
      "index.lifecycle.name": "PREFIXdatahub_usage_event_policy"
    }
  }
}

I simply removed the settings component and uploaded it.

curl -XPUT http://localhost:9200/_index_template/datahub_usage_event_index_template -H 'Content-Type: application/json' --data @index_template_local.json

The datahost-gms service is up and running.

I've now got all services up and running so datahub is accessible on port 9002 of stat1008 via SSH tunneling:
ssh -N -L9002:localhost:9002 stat1008.eqiad.wmnet
followed by http://localhost:9002/
username and password are both: datahub

They are all running as systemd user services.

btullis@stat1008:~/src/datahub$ systemctl --user -t service --state=running --no-legend --no-pager
datahub-frontend.service         loaded active running DataHub Frontend Server        
datahub-gms.service              loaded active running DataHub Metadata Server        
datahub-mae-consumer-job.service loaded active running DataHub MAE Consumer Job Server
datahub-mce-consumer-job.service loaded active running DataHub MCE Consumer Job Server
kafka-broker.service             loaded active running Kafka Broker server            
neo4j.service                    loaded active running Neo4J server                   
opensearch.service               loaded active running OpenSearch server              
schema-registry.service          loaded active running Schema Registry server         
zookeeper.service                loaded active running Zookeeper server

I am now moving onto ingestion.

To begin with, I have installed the datahub client in a conda environment with:

source conda-activate-stacked
pip install 'acryl-datahub[datahub-rest]'

This gives me the command-line client.

btullis@stat1008:~/src/datahub$ datahub
Usage: datahub [OPTIONS] COMMAND [ARGS]...

Options:
  --debug / --no-debug
  --version             Show the version and exit.
  --help                Show this message and exit.

Commands:
  check      Helper commands for checking various aspects of DataHub.
  delete     Delete metadata from datahub using a single urn or a combination of filters
  docker     Helper commands for setting up and interacting with a local DataHub instance using Docker.
  get        Get metadata for an entity with an optional list of aspects to project
  ingest     Ingest metadata into DataHub.
  init       Configure which datahub instance to connect to
  put        Update a single aspect of an entity
  telemetry  Toggle telemetry.
  version    Print version number and exit.

I have installed the hive connector and run the first ingestion.

First I had to install pyhive in my conda environment by following the guidelines here:
https://wikitech.wikimedia.org/wiki/Analytics/Systems/Anaconda#Installing_packages_that_need_compilation

conda install -c conda-forge cyrus-sasl
export CPPFLAGS="${CPPFLAGS} -isystem ${CONDA_PREFIX}/include"
pip install pyhive[hive]

Then I could install the datahub hive connector by following the guidelines here: https://datahubproject.io/docs/metadata-ingestion/source_docs/hive#setup

pip install 'acryl-datahub[hive]'

Then I made a receipe file for hive containing this:

source:
  type: "hive"
  config:
    host_port: analytics-hive.eqiad.wmnet:10000
    options:
      connect_args:
        auth: 'KERBEROS'
        kerberos_service_name: hive
sink:
  type: "datahub-rest"
  config:
    server: 'http://localhost:8080'

I executed the ingestion like this:

datahub ingest -c hive.yml

It's still importing, but it's currently showing 502 datasets and countng.

I have:

installed the kafka plugin with: pip install 'acryl-datahub[kafka]'
created a basic recipe for importing topics from the kafka-jumbo cluster:

source:
  type: "kafka"
  config:
    connection:
      bootstrap: "kafka-jumbo1001.eqiad.wmnet:9092"
      schema_registry_url: http://localhost:8081

sink:
  type: "datahub-rest"
  config:
    server: 'http://localhost:8080'

Run an import with: datahub ingest -c kafka.yml

I have also ingested from both the analytics and public druid clusters.

btullis@stat1008:~/src/datahub/ingestion$ cat druid.yml an-druid.yml 
source:
  type: "druid"
  config:
    host_port: "druid1004.eqiad.wmnet:8082"

sink:
  type: "datahub-rest"
  config:
    server: 'http://localhost:8080'
source:
  type: "druid"
  config:
    host_port: "an-druid1001.eqiad.wmnet:8082"

sink:
  type: "datahub-rest"
  config:
    server: 'http://localhost:8080'

BTullis moved this task from Incoming (new tickets) to Security & Governance on the Data-Engineering board.Feb 1 2022, 9:24 AM

@BTullis has gotten this working nicely; the only thing that isn't going to translate directly to production, as I understand, is that we're currently using the Confluent schema registry, but we have 2 paths forward: wait for the requirement to be dropped entirely via https://feature-requests.datahubproject.io/b/Developer-Experience/p/remove-required-dependency-on-confluent-schema-registry, or use Karapace.

BTullis claimed this task.Feb 3 2022, 2:08 PM

BTullis triaged this task as High priority.

BTullis updated Other Assignee, added: • razzi.Feb 3 2022, 4:32 PM

Milimetric mentioned this in T293643: Data Catalog Technical Evaluation.Feb 7 2022, 10:04 PM

Milimetric updated the task description. (Show Details)Feb 7 2022, 10:55 PM

BTullis moved this task from In Progress to Done on the Data-Engineering-Kanban board.Feb 9 2022, 3:02 PM

ttaylor subscribed.Feb 9 2022, 5:37 PM

• razzi mentioned this in T301385: Deploy DataHub in MVP phase.Feb 9 2022, 6:04 PM

Milimetric renamed this task from Run Datahub on test cluster to Evaluate DataHub as a Data Catalog.Feb 9 2022, 9:21 PM

Milimetric closed this task as Resolved.Feb 9 2022, 9:29 PM

BTullis mentioned this in T301459: Configure MariaDB database for DataHub on an-coord1001.Feb 10 2022, 12:14 PM

Milimetric updated the task description. (Show Details)Feb 17 2022, 4:34 PM

Milimetric mentioned this in T299897: Connect MVP to Hive metastore [Mile Stone 4].Apr 14 2022, 9:35 PM

In T299703#7663941, @BTullis wrote:

source:
  type: "kafka"
  config:
    connection:
      bootstrap: "kafka-jumbo1001.eqiad.wmnet:9092"
      schema_registry_url: http://localhost:8081

sink:
  type: "datahub-rest"
  config:
    server: 'http://localhost:8080'

Feeding the archives, this turned out to be useful in source.config above, to keep topics like -l and -L from being ingested and messing things up:

topic_patterns:
  deny:
    - "^-.*"

	F34937951: image.png
	Jan 31 2022, 3:07 PM

	F34937653: image.png
	Jan 31 2022, 10:40 AM

	F34934392: image.png
	Jan 28 2022, 5:22 PM

Evaluate DataHub as a Data CatalogClosed, ResolvedPublicActions