Page MenuHomePhabricator

Evaluate Atlas as a Data Catalog
Closed, ResolvedPublic

Description

Conclusions

Apache Atlas seems like the kind of product you struggle to integrate only to regret choosing it. It's complicated and not maintained in a way that's easy to deploy. The reason we looked into it was the supposed great integration with the Hadoop ecosystem, and especially the Hive metastore. But here they drop support for the Hive 2.x or lower branches, which is a major blocker for us. Beyond that, it doesn't seem like good strategy since so many people still run Hive 2.x

Pros

  • Oldest project in this space, integrations to it from many other candidates
  • integration with Apache Ranger for fine-grained access control

Cons

  • Unresponsive community, sent several messages to the lists and nobody answered
  • out of date and incorrect documentation
  • no backwards compatibility for Hive ingestion, would mean we have to migrate to Hadoop 3+, which is on our roadmap but would block this project for too long
Run
  • Tunnel with ssh -NL 21000:data-catalog-evaluation.analytics.eqiad1.wikimedia.cloud:21000 data-catalog-evaluation.analytics.eqiad1.wikimedia.cloud
  • http://localhost:21000

Event Timeline

Currently adding as a hive docker container

Found a dockerfile for hive: https://github.com/IBM/docker-hive/

Built a docker image:

razzi@data-catalog-evaluation:~/mnt/docker-hive$ docker build . -t hive-metastore:upstream

Ran the docker image:

docker run -d -p 9093:9093 hive-metastore:upstream (produced container id 8d66326230ead4)

Connected to docker image:

razzi@data-catalog-evaluation:~/mnt/docker-hive$ docker exec -it 8d66326230ead4 /bin/bash

Checked if port was open with curl:

hive@8d66326230ea:~$ curl localhost:9083
curl: (52) Empty reply from server

It gives an error, but it connects (good). In the hive logs, I see an error which indicates it got the request (but it doesn't speak http)

2022-01-13T22:46:54,844 ERROR [pool-8-thread-6] server.TThreadPoolServer: Thrift error occurred during processing of message.
org.apache.thrift.protocol.TProtocolException: Missing version in readMessageBegin, old client?

However from the host machine curl did not connect:

razzi@data-catalog-evaluation:~$ curl localhost:9083
curl: (7) Failed to connect to localhost port 9083: Connection refused

This is strange because I have port forwarding from 9083. Since the host machine can't even connect to hive, atlas won't be able to since it's running in a separate docker container.

Incorporating the hive docker image into the atlas docker-compose.yaml (data-catalog-evaluation.analytics.eqiad1.wikimedia.cloud:/home/razzi/apache-atlas-docker/docker-compose.yml) would probably work.

Atlas evaluation is complete. In summary, the main blocker on using Atlas is that the current version of Atlas is not compatible with our Hive version..

Milimetric renamed this task from Run Atlas on cloud services cluster to Evaluate Atlas as a Data Catalog.Feb 9 2022, 9:28 PM
Milimetric updated the task description. (Show Details)