Page MenuHomePhabricator

Run Atlas on test cluster
Open, HighPublic

Description

Connect Atlas to hive

Event Timeline

razzi removed razzi as the assignee of this task.Nov 29 2021, 7:42 PM
razzi edited projects, added User-razzi; removed Epic.
razzi moved this task from Default to Ready for action on the User-razzi board.

Progress: using a public docker-compose configuration, https://github.com/sonnyhcl/apache-atlas-docker, I have gotten atlas 2.2 running on my local machine:

image.png (589×1 px, 122 KB)

I have also learned the various dependencies of Atlas from the architecture diagram and the installation guide:

image.png (966×1 px, 287 KB)

  • HBase (or BerkeleyDB)
  • Solr
  • Zookeeper
  • JanusGraph (can be configured to use Elasticsearch)

Here's what the atlas ui looks like after loading their sample dataset by running /opt/atlas/bin/quick_start.py (username and password are admin, loading takes about 10 minutes)

image.png (1×1 px, 399 KB)

I attempted to run the install steps on an-test-coord1001, but the download requests timed out because it wasn't using the proxy:

razzi@an-test-coord1001:~/apache-atlas-sources-2.2.0$ mvn clean -DskipTests install
[INFO] Scanning for projects...
Downloading from central: https://repo1.maven.org/maven2/org/apache/apache/17/apache-17.pom
Downloading from hortonworks.repo: https://repo.hortonworks.com/content/repositories/releases/org/apache/apache/17/apache-17.pom
Downloading from apache.snapshots.repo: https://repository.apache.org/content/groups/snapshots/org/apache/apache/17/apache-17.pom
Downloading from apache-staging: https://repository.apache.org/content/groups/staging/org/apache/apache/17/apache-17.pom
Downloading from default: https://repository.apache.org/content/groups/public/org/apache/apache/17/apache-17.pom
Downloading from java.net-Public: https://maven.java.net/content/groups/public/org/apache/apache/17/apache-17.pom
Downloading from repository.jboss.org-public: https://repository.jboss.org/nexus/content/groups/public/org/apache/apache/17/apache-17.pom
Downloading from typesafe: https://repo.typesafe.com/typesafe/releases/org/apache/apache/17/apache-17.pom
[ERROR] [ERROR] Some problems were encountered while processing the POMs:
[FATAL] Non-resolvable parent POM for org.apache.atlas:apache-atlas:2.2.0: Could not transfer artifact org.apache:apache:pom:17 from/to central (https://repo1.maven.org/maven2): Connect to repo1.maven.org:443 [repo1.maven.org/146.75.36.209] failed: Connection timed out (Connection timed out) and 'parent.relativePath' points at wrong local POM @ line 23, column 13
 @
[ERROR] The build could not read 1 project -> [Help 1]
[ERROR]
[ERROR]   The project org.apache.atlas:apache-atlas:2.2.0 (/home/razzi/apache-atlas-sources-2.2.0/pom.xml) has 1 error
[ERROR]     Non-resolvable parent POM for org.apache.atlas:apache-atlas:2.2.0: Could not transfer artifact org.apache:apache:pom:17 from/to central (https://repo1.maven.org/maven2): Connect to repo1.maven.org:443 [repo1.maven.org/146.75.36.209] failed: Connection timed out (Connection timed out) and 'parent.relativePath' points at wrong local POM @ line 23, column 13 -> [Help 2]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/ProjectBuildingException
[ERROR] [Help 2] http://cwiki.apache.org/confluence/display/MAVEN/UnresolvableModelException

(with the following proxy settings)

export http_proxy=http://webproxy:8080
export https_proxy=http://webproxy:8080

@Ottomata chimed in and gave me the mvn flag to enable proxies: -Djava.net.useSystemProxies=true

I then got a different error:

razzi@an-test-coord1001:~/apache-atlas-sources-2.2.0$ mvn clean -DskipTests install -Djava.net.useSystemProxies=true
[...]
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-checkstyle-plugin:2.9.1:check (checkstyle-check) on project apache-atlas: Execution checkstyle-check of goal org.apache.maven.plugins:maven-checkstyle-plugin:2.9.1:check failed: Plugin org.apache.maven.plugins:maven-checkstyle-plugin:2.9.1 or one of its dependencies could not be resolved: Failure to find org.apache.atlas:atlas-buildtools:jar:1.0 in https://repo.maven.apache.org/maven2 was cached in the local repository, resolution will not be reattempted until the update interval of central has elapsed or updates are forced -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/PluginResolutionException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn <goals> -rf :apache-atlas

@Milimetric had the idea to check the maven repository at https://repo.maven.apache.org/maven2/org/apache/atlas/atlas-buildtools/, and sure enough, there is no atlas-buildtools 1.0, only 0.8.1. Changing the pom.xml source to use 0.8.1 (this is a workaround, I'll email user@atlas.apache.org to see if this is a bug) got past that error, and I got a different error:

[...lots output and 20 minutes]
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  21:55 min
[INFO] Finished at: 2021-12-09T19:39:34Z
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal com.github.eirslett:frontend-maven-plugin:1.4:npm (npm install) on project atlas-dashboardv2: Failed to run task: 'npm install' failed. org.apache.commons.exec.ExecuteException: Process exited with an error: 254 (Exit value: 254) -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn <args> -rf :atlas-dashboardv2

The relevat output which caused the build to error is:

[INFO] --- frontend-maven-plugin:1.4:npm (npm install) @ atlas-dashboardv2 ---
[INFO] Running 'npm install' in /sync/apache-atlas-sources-2.2.0/dashboardv2/target
[ERROR] npm ERR! code ENOENT
[ERROR] npm ERR! syscall open
[ERROR] npm ERR! path /sync/apache-atlas-sources-2.2.0/dashboardv2/target/node_modules/argparse/node_modules/sprintf-js/package.json.886288497
[ERROR] npm ERR! errno -2
[ERROR] npm ERR! enoent ENOENT: no such file or directory, open '/sync/apache-atlas-sources-2.2.0/dashboardv2/target/node_modules/argparse/node_modules/sprintf-js/package.json.886288497'
[ERROR] npm ERR! enoent This is related to npm not being able to find a file.
[ERROR] npm ERR! enoent
[ERROR]
[ERROR] npm ERR! A complete log of this run can be found in:
[ERROR] npm ERR!     /home/vagrant/.npm/_logs/2021-12-09T19_39_34_071Z-debug.log

_ Failed to execute goal com.github.eirslett:frontend-maven-plugin:1.4:npm (npm install) on project atlas-dashboardv2: Failed to run task: 'npm install' failed. org.apache.commons.exec.ExecuteException: Process exited with an error: 254 (Exit value: 254) -> [Help 1]

Huh, yeah npm isn't available in production. You could probably work aroudn this by creating and activating a conda environment, which has npm. Probably could even just activate anaconda-wmf, which has npm: source /usr/lib/anaconda-wmf/bin/activate, and then try your mvn install process.

Alternatively, could you mvn package instead of mvn install? If so, you might be able to build locally and then copy the resulting jars over to an-test-coord1001? Not sure what this will result in though, good be tons of jars to copy.

I notice that Atlas will need to contact a zookeeper cluster when it runs.
Whilst it might be possible to include a version of zookeeper and run it on an-test-coord1001, one other option is to use the zookeeper instance that is running on an-test-druid1001.

This is what I ended up doing when I was testing Alluxio on the test cluster and needed a zookeeper instance. See T266641#7291377 for details.

I created a follow-up ticket to T289056: Create analytics-test-eqiad zookeeper cluster but it hasn't been prioritized. Just wanted to mention it in case zookeeper was becoming a blocker in any way to getting Atlas tested.

@BTullis there is a test-zookeeper1002.eqiad.wmnet node, but it's not accessible from the analytics vlan. I think we should be able to punch a hole in it using something like https://wikitech.wikimedia.org/wiki/Network_cheat_sheet#Edit_ACLs_for_Network_ports

@elukey care to weigh in on whether we're on the right track for zookeeper in the analytics test cluster?

We should punch a hole to test-zookeeper. However!

one other option is to use the zookeeper instance that is running on an-test-druid1001.

@razzi this would help you move forward now, it should be fine to use the druid test zk.

I have been doing some more work on this too, given its priority. I'm currently blocked by Kerberos when trying to import metadata from Hive.
I've joined the mailing list and sent a message to user@atlas.apache.org

https://lists.apache.org/thread/d3z4jlp6663dj45xmnk0rpg6fkjjowr6

I've got an instance of Atlas 3.0.0-SNAPSHOT currently running on port 21000 on an-test-coord1001
I tried 2.2.0 but then replaced it with the tip from https://github.com/apache/atlas in order to address a logging bug.
It hasn't affected the Kerberos bug, though so rolling back to 2.2.0 is still easy.

I decided to build it with the embedded BerkeleyDB & Apache Solr profile, so that uses the maven command:

mvn clean -DskipTests package -Pdist,berkeley-solr

This version starts a local zookeeper server as well, whereas some of the other profiles skip that.

Then I started trying to run bin/import-hive.sh in order to connect to hive and import the metadata.

The first issue that I found was strange configuration errors.
The cause was the script importing all of the .jar files from the hadoop classpath, which meant that an older, incompatible version of commons-configuration was imported and used in preference to the version that is bundled with Atlas.

I modified the bin/import-hive.sh script as follows, to work around that issue.

# Multiple jars in HADOOP_CP_EXCLUDE_LIST can be added using "\|" separator
# Ex: HADOOP_CP_EXCLUDE_LIST="javax.ws.rs-api\|jersey-multipart"
HADOOP_CP_EXCLUDE_LIST="commons-configuration"
HADOOP_CP=
for i in $(hadoop classpath | tr : "\n")
do
  for j in $(find $i -name "*.jar" | grep -v "$HADOOP_CP_EXCLUDE_LIST")
  do
    HADOOP_CP="${HADOOP_CP}:$j"
  done
done

I've made several more steps forward on this, with sincere thanks to @elukey. Unfortunately we have now hit a serious blocker in terms of the Hive integration with Atlas.

The blocking element is this bug: https://issues.apache.org/jira/browse/ATLAS-3905

Essentially, version 2.0 and above are incompatible with our version 2.3.6 of Hive. It requires Hive version 3.1 or above: From this comment:

Atlas 2.1.0 uses Hive 3.1.0, if your local environment is with earlier hive version ( e.g 1.1.0 ) they are incompatible due the fact that your local hive does not know about the getDatabaseName method. Simply adding different hive-metastore-X.X.jar would not help as the jar have other dependencies too and they need to be satisfied. The easiest thing would be to migrate to hive-version compatible with your environment, either upgrade hive locally or downgrading to atlas 1.X until you migrate to the next hive version.

This is evident from the most recent stack trace that we have seen.

Exception in thread "main" java.lang.NoSuchMethodError: org.apache.hadoop.hive.metastore.api.Database.getCatalogName()Ljava/lang/String;
	at org.apache.atlas.hive.bridge.HiveMetaStoreBridge.getDatabaseName(HiveMetaStoreBridge.java:688)
	at org.apache.atlas.hive.bridge.HiveMetaStoreBridge.toDbEntity(HiveMetaStoreBridge.java:668)
	at org.apache.atlas.hive.bridge.HiveMetaStoreBridge.toDbEntity(HiveMetaStoreBridge.java:660)
	at org.apache.atlas.hive.bridge.HiveMetaStoreBridge.registerDatabase(HiveMetaStoreBridge.java:527)
	at org.apache.atlas.hive.bridge.HiveMetaStoreBridge.importDatabases(HiveMetaStoreBridge.java:391)
	at org.apache.atlas.hive.bridge.HiveMetaStoreBridge.importDataDirectlyToAtlas(HiveMetaStoreBridge.java:351)
	at org.apache.atlas.hive.bridge.HiveMetaStoreBridge.main(HiveMetaStoreBridge.java:185)
Failed to import Hive Meta Data! Check logs at: /home/btullis/atlas/apache-atlas-3.0.0-SNAPSHOT/logs//import-hive.log for details.

Getting to this point required quite a bit of work to get Kerberos authentication working for the bin/import-hive.sh script, which I will summarise here:

Copies of both hive-site.xml and hadoop-site.xml were required in the $HIVE_CONF directory, along with a copy of atlas-application.properties

A jaas-application.properties file was required in the $ATLAS_CONF directory with the following content:

atlas.jaas.hive.loginModuleName = com.sun.security.auth.module.Krb5LoginModule
atlas.jaas.hive.loginModuleControlFlag = required
atlas.jaas.hive.option.useKeyTab = false
atlas.jaas.hive.option.useTicketCache = true
atlas.jaas.hive.option.storeKey = true
atlas.jaas.hive.option.principal = btullis@WIKIMEIDA

A jaas_hive.conf file was also required in the $ATLAS_CONF directory with the following content:

com.sun.security.jgss.krb5.initiate {
   com.sun.security.auth.module.Krb5LoginModule required
   useKeyTab=false
   useTicketCache=true
   doNotPrompt=true
   principal="btullis@WIKIMEDIA"
   debug=true;
};

When the org.apache.atlas.hive.bridge.HiveMetaStoreBridge process was launched by the script, the following flags were passed to the JRE.

/usr/bin/java -Datlas.log.dir=/home/btullis/atlas/apache-atlas-3.0.0-SNAPSHOT/logs/ -Datlas.log.file=import-hive.log -Dlog4j.configuration=atlas-hive-import-log4j.xml -Djavax.security.auth.useSubjectCredsOnly=false -Djava.security.auth.login.config=/home/btullis/atlas/apache-atlas-3.0.0-SNAPSHOT/conf/jaas_hive.conf -Djava.security.krb5.conf=/etc/krb5.conf -Dsun.security.krb5.debug=true -Djava.security.debug=gssloginconfig,configfile,configparser,logincontext -cp <snip><snip> <lots of jars> org.apache.atlas.hive.bridge.HiveMetaStoreBridge

Unfortunately, unless we want to continue the investigation with version 1.2.0 (released in June 2019), or do without the Hive integration, I can't see any other reasonable way forward with Atlas.

Or...can we upgrade Hive as recommended?

OK, agreed that is another way forward. I will look into it. I had assumed that it would be a lot more work than we had bargained for, but maybe not.

It might indeed be more than we bargained for. :/

In case it helps, the UI for Atlas on the test cluster can be accessed by using an SSH tunnel like so:
ssh -NL 21000:an-test-coord1001.eqiad.wmnet:21000 an-test-coord1001.eqiad.wmnet and then browsing to http://localhost:21000

Username and password are both *admin*

I haven't run the bin/quick_start.py so it has not generated the sample data. I was hoping to import it from Hive.