Page MenuHomePhabricator

Evaluate Amundsen as a Data Catalog
Closed, ResolvedPublic

Description

Conclusions

Amundsen is a good contender, but ultimately relies on Atlas for good Hive integration, and that would complicate our deployment quite a bit.

Pros
  • simple architecture of 3 flask services all in python (as opposed to Datahub using java and python)
  • ingestion architecture is simple: python scripts or airflow dags that make http api requests
  • "social" ui features, like frequent users and owners
  • loose coupling means you can use a relational database as the data store rather than neo4j (https://github.com/amundsen-io/amundsenrds)
Cons
  • seems like the community is losing steam: https://github.com/amundsen-io/amundsen#blog-posts-and-interviews has a flurry of events in 2019/2020 but nothing in 2021
  • only supports polling for data updates, unless we also deploy atlas. Push ingest api is on their roadmap
  • documentation is somewhat lacking; few ingestion examples, and broken links in docs
  • some dependencies are getting out of date: elasticsearch version 6 (v7 was released 2019), nodejs version 12 (v13 was released 2019)

Run

Steps to Reproduce Installation

  • ElasticSearch (we'll use OpenSearch here as well)
  • Neo4j with some trouble setting up SSL: T300756#7677142
  • Configure and launch all the services as mentioned in documentation and T300756#7673715

Ingestion

(see slack thread)

Event Timeline

As can be seen in the architecture diagram there are 6 components:

image.png (759×600 px, 64 KB)

  • Databuilder ingestion framework (amundsen)
  • Elasticsearch (3rdparty)
  • Neo4j (3rdparty)
  • Search service (amundsen)
  • Frontend service (amundsen)
  • Metadata service (amundsen)

It looks like it uses OpenID Connect for authentication (oauth).

Here are the instructions for the metadata service: https://www.amundsen.io/amundsen/metadata/

I followed the steps for the metadata service:

295  python3 -m venv venv
296  export https_proxy=http://webproxy:8080
297  source venv/bin/activate
298  pip3 install amundsen-metadata
299  python3 metadata_service/metadata_wsgi.py

However I got an ssl error when it tried to connect to bolt, a part of neo4j:

INFO:metadata_service.proxy.neo4j_proxy:NEO4J endpoint: bolt://0.0.0.0:7687
DEBUG:neobolt:[#0000]  C: <RESOLVE> Address(host='0.0.0.0', port=7687)
DEBUG:neobolt:[#0000]  C: <OPEN> ('0.0.0.0', 7687)
DEBUG:neobolt:[#BC1E]  C: <SECURE> 0.0.0.0
INFO:werkzeug:127.0.0.1 - - [03/Feb/2022 06:53:29] "GET /healthcheck HTTP/1.1" 500 -
Traceback (most recent call last):
  File "/srv/home/razzi/amundsen/metadata/venv/lib/python3.7/site-packages/flask/app.py", line 2309, in __call__
    return self.wsgi_app(environ, start_response)
  File "/srv/home/razzi/amundsen/metadata/venv/lib/python3.7/site-packages/flask/app.py", line 2295, in wsgi_app
    response = self.handle_exception(e)
  File "/srv/home/razzi/amundsen/metadata/venv/lib/python3.7/site-packages/flask_restful/__init__.py", line 271, in error_router
    return original_handler(e)
  File "/srv/home/razzi/amundsen/metadata/venv/lib/python3.7/site-packages/flask/app.py", line 1741, in handle_exception
    reraise(exc_type, exc_value, tb)
  File "/srv/home/razzi/amundsen/metadata/venv/lib/python3.7/site-packages/flask/_compat.py", line 34, in reraise
    raise value.with_traceback(tb)
  File "/srv/home/razzi/amundsen/metadata/venv/lib/python3.7/site-packages/flask/app.py", line 2292, in wsgi_app
    response = self.full_dispatch_request()
  File "/srv/home/razzi/amundsen/metadata/venv/lib/python3.7/site-packages/flask/app.py", line 1815, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/srv/home/razzi/amundsen/metadata/venv/lib/python3.7/site-packages/flask_restful/__init__.py", line 271, in error_router
    return original_handler(e)
  File "/srv/home/razzi/amundsen/metadata/venv/lib/python3.7/site-packages/flask/app.py", line 1718, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/srv/home/razzi/amundsen/metadata/venv/lib/python3.7/site-packages/flask/_compat.py", line 34, in reraise
    raise value.with_traceback(tb)
  File "/srv/home/razzi/amundsen/metadata/venv/lib/python3.7/site-packages/flask/app.py", line 1813, in full_dispatch_request
    rv = self.dispatch_request()
  File "/srv/home/razzi/amundsen/metadata/venv/lib/python3.7/site-packages/flask/app.py", line 1799, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/srv/home/razzi/amundsen/metadata/venv/lib/python3.7/site-packages/flask_restful/__init__.py", line 467, in wrapper
    resp = resource(*args, **kwargs)
  File "/srv/home/razzi/amundsen/metadata/venv/lib/python3.7/site-packages/flask/views.py", line 87, in view
    self = view.view_class(*class_args, **class_kwargs)
  File "/srv/home/razzi/amundsen/metadata/metadata_service/api/healthcheck.py", line 18, in __init__
    self.client = get_proxy_client()
  File "/srv/home/razzi/amundsen/metadata/metadata_service/proxy/__init__.py", line 47, in get_proxy_client
    client_kwargs=client_kwargs)
  File "/srv/home/razzi/amundsen/metadata/metadata_service/proxy/neo4j_proxy.py", line 89, in __init__
    trust=trust)  # type: Driver
  File "/srv/home/razzi/amundsen/metadata/venv/lib/python3.7/site-packages/neo4j/__init__.py", line 120, in driver
    return Driver(uri, **config)
  File "/srv/home/razzi/amundsen/metadata/venv/lib/python3.7/site-packages/neo4j/__init__.py", line 161, in __new__
    return subclass(uri, **config)
  File "/srv/home/razzi/amundsen/metadata/venv/lib/python3.7/site-packages/neo4j/__init__.py", line 235, in __new__
    pool.release(pool.acquire())
  File "/srv/home/razzi/amundsen/metadata/venv/lib/python3.7/site-packages/neobolt/direct.py", line 715, in acquire
    return self.acquire_direct(self.address)
  File "/srv/home/razzi/amundsen/metadata/venv/lib/python3.7/site-packages/neobolt/direct.py", line 608, in acquire_direct
    connection = self.connector(address, error_handler=self.connection_error_handler)
  File "/srv/home/razzi/amundsen/metadata/venv/lib/python3.7/site-packages/neo4j/__init__.py", line 232, in connector
    return connect(address, **dict(config, **kwargs))
  File "/srv/home/razzi/amundsen/metadata/venv/lib/python3.7/site-packages/neobolt/direct.py", line 972, in connect
    raise last_error
  File "/srv/home/razzi/amundsen/metadata/venv/lib/python3.7/site-packages/neobolt/direct.py", line 963, in connect
    s, der_encoded_server_certificate = _secure(s, host, security_plan.ssl_context, **config)
  File "/srv/home/razzi/amundsen/metadata/venv/lib/python3.7/site-packages/neobolt/direct.py", line 854, in _secure
    s = ssl_context.wrap_socket(s, server_hostname=host if HAS_SNI and host else None)
  File "/usr/lib/python3.7/ssl.py", line 412, in wrap_socket
    session=session
  File "/usr/lib/python3.7/ssl.py", line 853, in _create
    self.do_handshake()
  File "/usr/lib/python3.7/ssl.py", line 1117, in do_handshake
    self._sslobj.do_handshake()
OSError: [Errno 0] Error

The neo4j it is using is the same one that @BTullis set up for datahub. I'm not sure how ssl works with neo4j / bolt but perhaps the amundsen client needs to have ssl disabled, as it's just a local insecure neo4j to my knowledge.

One other interesting thing about Amundsen is the section it has on deploying the Amundsen metadata service on top of Atlas: https://www.amundsen.io/amundsen/metadata/#apache-atlas:

Apache Atlas is so far the only proxy in Amundsen supporting both push and pull for collecting metadata: - Push method by leveraging Apache Atlas Hive Hook. It’s an event listener running alongside Hive Metastore, translating Hive Metastore events into Apache Atlas entities and pushing them to Kafka topic, from which Apache Atlas ingests the data by internal processes

Indeed the ability to have changes pushed (event-driven) rather than pulling (polling) changes would be useful. As I understand we have only proven pulling metadata, with Datahub.

The ssl error has been fixed, here's the current diff of the amundsen repository on stat1008.eqiad.wmnet:/srv/home/razzi/amundsen showing config changes:

razzi@stat1008:/srv/home/razzi/amundsen$ git diff
diff --git a/frontend/amundsen_application/config.py b/frontend/amundsen_application/config.py
index 13bf82fa..5455ecc0 100644
--- a/frontend/amundsen_application/config.py
+++ b/frontend/amundsen_application/config.py
@@ -52,7 +52,7 @@ class Config:
     POPULAR_RESOURCES_PERSONALIZATION = False  # type: bool

     # Request Timeout Configurations in Seconds
-    REQUEST_SESSION_TIMEOUT_SEC = 3
+    REQUEST_SESSION_TIMEOUT_SEC = 30

     # Frontend Application
     FRONTEND_BASE = ''
@@ -154,7 +154,8 @@ class LocalConfig(Config):
     METADATA_PORT = '5002'

     # If installing using the Docker bootstrap, this should be modified to the docker host ip.
-    LOCAL_HOST = '0.0.0.0'
+    LOCAL_HOST = '127.0.0.1'
+    # LOCAL_HOST = 'localhost'

     FRONTEND_BASE = os.environ.get('FRONTEND_BASE',
                                    'http://{LOCAL_HOST}:{PORT}'.format(
diff --git a/metadata/metadata_service/config.py b/metadata/metadata_service/config.py
index 98ff2b07..3d3ded07 100644
--- a/metadata/metadata_service/config.py
+++ b/metadata/metadata_service/config.py
@@ -50,7 +50,7 @@ class Config:
     PROXY_USER = os.environ.get('CREDENTIALS_PROXY_USER', 'neo4j')
     PROXY_PASSWORD = os.environ.get('CREDENTIALS_PROXY_PASSWORD', 'test')

-    PROXY_ENCRYPTED = True
+    PROXY_ENCRYPTED = False
     """Whether the connection to the proxy should use SSL/TLS encryption."""

     # Prior to enable PROXY_VALIDATE_SSL, you need to configure SSL.

I've now got Amundsen up and running, with an imported collection of Hive tables.
If you'd like to check it out you can do:

ssh -N -L 5000:localhost:5000 stat1008.eqiad.wmnet

Followed by http://localhost:5002

There is no user management at the moment, so all access shares a single user.
The interface is almost entirely based around search, so entering a term into the search box is the best way to start exploring.

image.png (886×1 px, 144 KB)

The import was done with an ad-hoc python script, which was based on the example here: https://github.com/amundsen-io/amundsen/blob/main/databuilder/example/scripts/sample_data_loader.py
...with modifications to include the sample hive metadata extractor here: https://www.amundsen.io/amundsen/databuilder/#hivetablemetadataextractor

The resulting script is at home/razzi/amundsen/databuilder/python hive_metadata_extractor.py on stat1008.
I used the MySQL password for the hive user, which I obtained from the hive-site.xml file on the coordinators, but I have subsequently removed this password from the script.

BTullis triaged this task as High priority.Feb 4 2022, 5:43 PM

I will try a druid metadata import, to see how well that works.

I have successfully imported datasets from the analytics-druid cluster.

image.png (896×1 px, 107 KB)

The job configuration is like this:

def conn_string():
    return 'druid+http://an-druid1001.eqiad.wmnet:8082/druid/v2/sql/'

def create_druid_publisher_job():
    tmp_folder = f'/home/razzi/amundsen/databuilder/tmp'
    node_files_folder = f'{tmp_folder}/nodes'
    relationship_files_folder = f'{tmp_folder}/relationships'

    where_clause_suffix = ""
    job_config = ConfigFactory.from_dict({
        'extractor.druid_metadata.{}'.format(DruidMetadataExtractor.WHERE_CLAUSE_SUFFIX_KEY): where_clause_suffix,
        'extractor.druid_metadata.extractor.sqlalchemy.{}'.format(SQLAlchemyExtractor.CONN_STRING): conn_string(),
        'loader.filesystem_csv_neo4j.node_dir_path': node_files_folder,
        'loader.filesystem_csv_neo4j.relationship_dir_path': relationship_files_folder,
        'publisher.neo4j.node_files_directory': node_files_folder,
        'publisher.neo4j.relation_files_directory': relationship_files_folder,
        'publisher.neo4j.neo4j_endpoint': neo4j_endpoint,
        'publisher.neo4j.neo4j_user': neo4j_user,
        'publisher.neo4j.neo4j_password': neo4j_password,
        'publisher.neo4j.neo4j_encrypted': False,
        'publisher.neo4j.job_publish_tag': 'druid_analytics'})
    job = DefaultJob(
        conf=job_config,
        task=DefaultTask(
            extractor=DruidMetadataExtractor(),
            loader=FsNeo4jCSVLoader()),
        publisher=Neo4jCsvPublisher())
    job.launch()
Milimetric renamed this task from Technical evaluation of Amundsen to Evaluate Amundsen as a Data Catalog.Feb 9 2022, 9:20 PM
Milimetric updated the task description. (Show Details)