Page MenuHomePhabricator

Evaluate OpenMetadata as a Data Catalog
Closed, ResolvedPublic

Description

Conclusions

OpenMetadata is a relatively new offering from Uber's data team. They seem to emphasize a concise set of features and good UX, along with connectivity with more modern data stacks.

Pros
  • Incredibly responsive community, on slack and github
  • Fast and simple UI
  • very easy to install
  • built on good standards like JSONSchema where DataHub uses Pegasus
Cons
  • Hive connector is in a very early alpha stage
  • MySQL 8+ required, hard for us to set up here
  • too much manual metadata wrangling allowed by UI

Run

To run this, make sure the three services below are running, tunnel and visit http://localhost:8585

# everything was set up under user milimetric, but should be easily copy-able
ssh an-test-client1001.eqiad.wmnet
systemctl --user start mysql8
systemctl --user start opensearch
systemctl --user start openmetadata

ssh -N an-test-client1001.eqiad.wmnet -L 8585:127.0.0.1:8585

Steps to Reproduce Installation

  • MySQL 8+ (the versions of MariaDB that we support are not compatible)
    • find an archive like mysql-server_8.0.28-1debian10_amd64.deb-bundle.tar
    • extract with dpkg-deb -x
    • create database "openmetadata" and user with access
    • create a systemd unit, could look like:
[Unit]
Description=MySQL server, version 8
[Service]
Type=simple
Environment=LD_LIBRARY_PATH=/srv/data-catalog-tmp/mysql-server-chroot/usr/lib/x86_64-linux-gnu
ExecStart=/srv/data-catalog-tmp/mysql-server-chroot/usr/sbin/mysqld --defaults-file=/srv/data-catalog-tmp/mysql-server-chroot/etc/mysql/mysql.cnf
[Install]
WantedBy=multi-user.target
  • OpenSearch (to satisfy ElasticSearch req.)
    • download and extract
    • in ./config/opensearch.yml add plugins.security.disabled: true
    • create a systemd unit, ours looks like:
[Unit]
Description=OpenSearch server, running wih security plugin disabled (pw admin:admin)
[Service]
Type=simple
Environment=LD_LIBRARY_PATH=/srv/data-catalog-tmp/mysql-server-chroot/usr/lib/x86_64-linux-gnu
ExecStart=/home/milimetric/opensearch-1.2.4/bin/opensearch
[Install]
WantedBy=multi-user.target
  • download and install OpenMetadata
    • use Java 11: export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
    • with mysql running, database and user created, config OpenMetadata to point to it
    • run ./bootstrap/bootstrap_storage.sh migrate to create tables
    • point to OpenSearch from config (never dug into this)
    • systemd unit:
[Unit]
Description=OpenMetadata service
[Service]
Type=forking
Environment=JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
ExecStart=/home/milimetric/openmetadata-0.8.0/bin/openmetadata.sh start
ExecStop=/home/milimetric/openmetadata-0.8.0/bin/openmetadata.sh stop
[Install]
WantedBy=multi-user.target

Metadata Ingestion

conda create -n airflow_py_38 python=3.8
conda activate airflow_py_38
export https_proxy=http://webproxy.eqiad.wmnet:8080
pip install wheel
pip install hmsclient
pip install apache-airflow[hdfs,hive,kerberos]
pip install flask-admin==1.4.0
pip install pyarrow

pip install openmetadata-ingestion[hive,data-profiler]

Bug with the Hive connector, I filed and they fixed, hacked around in the meantime.
More problems configuring this, opened an issue and hacked around the limitations.
Finally our ingestion config looked like:

current_user = pwd.getpwuid(os.getuid()).pw_name

config = """
{
  "source": {
    "type": "hive",
    "config": {
      "database": "wmf",
      "host_port": "analytics-hive.eqiad.wmnet",
      "service_name": "hive_test_cluster",
      "generate_sample_data": "true",
      "scheme": "hive",
      "query": "select * from {}.{} where year=2022 limit 50",
      "data_profiler_enabled": "false",
      "data_profiler_offset": "0",
      "data_profiler_limit": "50000",
      "connect_args": {
        "auth": "KERBEROS",
        "username": """ + '"' + current_user + '"' + """,
        "kerberos_service_name": "hive"
      }
    }
  },
  "sink": {
    "type": "metadata-rest",
    "config": {}
  },
  "metadata_server": {
    "type": "metadata-server",
    "config": {
      "api_endpoint": "http://localhost:8585/api",
      "auth_provider_type": "no-auth"
    }
  }
}
"""

(for the record, the initial notes along with a failed attempt to compile MySQL 8 are here: https://app.slack.com/docs/T024KLHS4/F02V248G4BV?origin_team=T024KLHS4&origin_channel=C02UCDD7FKK, the relevant parts were copied above)