Fix outstanding bugs preventing the use of prometheus jmx agent for Hive/Oozie
Open, Needs TriagePublic

Description

While working on T177458 we discovered that Hive and Oozie have some bugs in the way they handle JVM parameters that prevents the jmx agents to work properly. We'd need to figure out the root causes, fix them and finally enable the agents on analytics1003.

Oozie

For oozie I keep seeing the log (in catalina.out):

Exception in thread "main" java.lang.reflect.InvocationTargetException
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at sun.instrument.InstrumentationImpl.loadClassAndStartAgent(InstrumentationImpl.java:382)
	at sun.instrument.InstrumentationImpl.loadClassAndCallPremain(InstrumentationImpl.java:397)
Caused by: java.lang.IllegalArgumentException: Collector already registered that provides name: jmx_scrape_duration_seconds
	at io.prometheus.jmx.shaded.io.prometheus.client.CollectorRegistry.register(CollectorRegistry.java:54)
	at io.prometheus.jmx.shaded.io.prometheus.client.Collector.register(Collector.java:128)
	at io.prometheus.jmx.shaded.io.prometheus.client.Collector.register(Collector.java:121)
	at io.prometheus.jmx.shaded.io.prometheus.jmx.JavaAgent.premain(JavaAgent.java:36)
	... 6 more
FATAL ERROR in native method: processing of -javaagent failed

That is probably related to this duplication of JVM parameters:

Jan 10 14:19:36 hadoop-coordinator-1 oozie[7611]: Using   CATALINA_OPTS:        -Doozie.https.port=11443 -javaagent:/usr/share/java/prometheus/jmx_prometheus_javaagent.jar=10.68.19.233:12000:/etc/oozie/prometheus_oozie_server_jmx_exporter.yaml  -Doozie.https.port=11443 -Doozie.https.keystore.pass=password -javaagent:/usr/share/java/prometheus/jmx_prometheus_javaagent.jar=10.68.19.233:12000:/etc/oozie/prometheus_oozie_server_jmx_exporter.yaml -Djava.util.logging.config.file=/etc/oozie/logging.properties -Dderby.stream.error.file=/var/log/oozie/derby.log
I haven't found any open/closed bug but something doesn't sound right in either our config or oozie's internal one.

It seems an issue of the init.d script, but we are not sure where is the bug.

Hive

The hive server/metastore issue is more subtle: everything starts, the jmx agent returns metrics correctly but the daemons do not bind to their ports (so they are not working):

hive     13423  7.9  1.4 2744856 119332 ?      Sl   16:11   0:02 /usr/lib/jvm/default-java/bin/java -Xmx256m -Dhive.log.dir=/var/log/hive -Dhive.log.file=hive-server2.log -Dhive.log.threshold=INFO -Xmx2048m -javaagent:/usr/share/java/prometheus/jmx_prometheus_javaagent.jar=10.68.19.233:12000:/etc/oozie/prometheus_oozie_server_jmx_exporter.yaml -Dcom.sun.management.jmxremote.port=9978 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dhadoop.log.dir=/usr/lib/hadoop/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/usr/lib/hadoop -Dhadoop.id.str= -Dhadoop.root.logger=INFO,console -Djava.library.path=/usr/lib/hadoop/lib/native -
Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Dhadoop.security.logger=INFO,NullAppender org.apache.hadoop.util.VersionInfo

elukey@hadoop-coordinator-1:~$ curl 10.68.19.233:12000/metrics -s | grep -v "#" | sort
jmx_config_reload_failure_total 0.0
jmx_config_reload_success_total 0.0
jmx_scrape_duration_seconds 0.00108968
jmx_scrape_error 0.0
jvm_classes_loaded 1858.0
jvm_classes_loaded_total 1858.0
jvm_classes_unloaded_total 0.0
jvm_gc_collection_seconds_count{gc="PS MarkSweep",} 0.0
jvm_gc_collection_seconds_count{gc="PS Scavenge",} 2.0
jvm_gc_collection_seconds_sum{gc="PS MarkSweep",} 0.0
jvm_gc_collection_seconds_sum{gc="PS Scavenge",} 0.027
jvm_info{version="1.7.0_151-b01",vendor="Oracle Corporation",} 1.0
jvm_memory_bytes_committed{area="heap",} 1.23731968E8
jvm_memory_bytes_committed{area="nonheap",} 2.4576E7
jvm_memory_bytes_max{area="heap",} 1.908932608E9
jvm_memory_bytes_max{area="nonheap",} 2.24395264E8
jvm_memory_bytes_used{area="heap",} 1.6198616E7
jvm_memory_bytes_used{area="nonheap",} 1.193108E7
jvm_memory_pool_bytes_committed{pool="Code Cache",} 2555904.0
jvm_memory_pool_bytes_committed{pool="PS Eden Space",} 3.3554432E7
jvm_memory_pool_bytes_committed{pool="PS Old Gen",} 8.7031808E7
jvm_memory_pool_bytes_committed{pool="PS Perm Gen",} 2.2020096E7
jvm_memory_pool_bytes_committed{pool="PS Survivor Space",} 3145728.0
jvm_memory_pool_bytes_max{pool="Code Cache",} 5.0331648E7
jvm_memory_pool_bytes_max{pool="PS Eden Space",} 7.077888E8
jvm_memory_pool_bytes_max{pool="PS Old Gen",} 1.43130624E9
jvm_memory_pool_bytes_max{pool="PS Perm Gen",} 1.74063616E8
jvm_memory_pool_bytes_max{pool="PS Survivor Space",} 3145728.0
jvm_memory_pool_bytes_used{pool="Code Cache",} 689024.0
jvm_memory_pool_bytes_used{pool="PS Eden Space",} 1.333768E7
jvm_memory_pool_bytes_used{pool="PS Old Gen",} 8192.0
jvm_memory_pool_bytes_used{pool="PS Perm Gen",} 1.1242056E7
jvm_memory_pool_bytes_used{pool="PS Survivor Space",} 2852744.0
jvm_threads_current 15.0
jvm_threads_daemon 10.0
jvm_threads_deadlocked 0.0
jvm_threads_deadlocked_monitor 0.0
jvm_threads_peak 15.0
jvm_threads_started_total 18.0
process_cpu_seconds_total 2.98
process_max_fds 65536.0
process_open_fds 433.0
process_resident_memory_bytes 1.2312576E8
process_start_time_seconds 1.515600666772E9
process_virtual_memory_bytes 2.812837888E9

No indication in the logs/journalctl/syslog of what is happening. systemctl status hive* returns all green.

elukey created this task.Fri, Jan 12, 11:46 AM
elukey claimed this task.Fri, Jan 12, 11:54 AM
elukey moved this task from Next Up to In Progress on the Analytics-Kanban board.
elukey moved this task from Backlog to In Progress on the User-Elukey board.Fri, Jan 12, 2:29 PM
elukey added a comment.EditedFri, Jan 12, 3:19 PM

In https://oozie.apache.org/docs/4.1.0/AG_Install.html -> Advanced/Custom Environment Settings I can't see any CATALINA_OPTS listed for oozie-env.sh, so we might need to set it elsewhere.

EDIT: not true, it is listed above as usable in oozie-env.sh, back to blame the init.d script.

elukey added a comment.EditedFri, Jan 12, 3:47 PM

This is the journalctl snippet of oozie sourcing various files:

Jan 12 15:36:21 hadoop-coordinator-1 oozie[9079]: Setting OOZIE_HOME:          /usr/lib/oozie
Jan 12 15:36:21 hadoop-coordinator-1 oozie[9079]: Sourcing:                    /usr/lib/oozie/bin/oozie-env.sh
Jan 12 15:36:21 hadoop-coordinator-1 oozie[9079]: setting OOZIE_DATA=/var/lib/oozie
Jan 12 15:36:21 hadoop-coordinator-1 oozie[9079]: setting OOZIE_CATALINA_HOME=/usr/lib/bigtop-tomcat
Jan 12 15:36:21 hadoop-coordinator-1 oozie[9079]: setting CATALINA_TMPDIR=/var/lib/oozie
Jan 12 15:36:21 hadoop-coordinator-1 oozie[9079]: setting CATALINA_PID=/var/run/oozie/oozie.pid
Jan 12 15:36:21 hadoop-coordinator-1 oozie[9079]: setting CATALINA_BASE=/var/lib/oozie/tomcat-deployment
Jan 12 15:36:21 hadoop-coordinator-1 oozie[9079]: setting OOZIE_HTTPS_PORT=11443
Jan 12 15:36:21 hadoop-coordinator-1 oozie[9079]: setting OOZIE_HTTPS_KEYSTORE_PASS=password
Jan 12 15:36:21 hadoop-coordinator-1 oozie[9079]: setting CATALINA_OPTS="$CATALINA_OPTS -Doozie.https.port=${OOZIE_HTTPS_PORT}"
Jan 12 15:36:21 hadoop-coordinator-1 oozie[9079]: setting CATALINA_OPTS="$CATALINA_OPTS -Doozie.https.keystore.pass=${OOZIE_HTTPS_KEYSTORE_PASS}"
Jan 12 15:36:21 hadoop-coordinator-1 oozie[9079]: setting CATALINA_OPTS="$CATALINA_OPTS -Xmx2048m"
Jan 12 15:36:21 hadoop-coordinator-1 oozie[9079]: setting OOZIE_CONFIG=/etc/oozie/conf
Jan 12 15:36:21 hadoop-coordinator-1 oozie[9079]: setting OOZIE_LOG=/var/log/oozie
Jan 12 15:36:21 hadoop-coordinator-1 oozie[9079]: setting JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre
Jan 12 15:36:21 hadoop-coordinator-1 oozie[9079]: Using   OOZIE_CONFIG:        /etc/oozie/conf

Jan 12 15:36:21 hadoop-coordinator-1 oozie[9079]: Sourcing:                    /etc/oozie/conf/oozie-env.sh
Jan 12 15:36:21 hadoop-coordinator-1 oozie[9079]: setting OOZIE_DATA=/var/lib/oozie
Jan 12 15:36:21 hadoop-coordinator-1 oozie[9079]: setting OOZIE_CATALINA_HOME=/usr/lib/bigtop-tomcat
Jan 12 15:36:21 hadoop-coordinator-1 oozie[9079]: setting CATALINA_TMPDIR=/var/lib/oozie
Jan 12 15:36:21 hadoop-coordinator-1 oozie[9079]: setting CATALINA_PID=/var/run/oozie/oozie.pid
Jan 12 15:36:21 hadoop-coordinator-1 oozie[9079]: setting CATALINA_BASE=/var/lib/oozie/tomcat-deployment
Jan 12 15:36:21 hadoop-coordinator-1 oozie[9079]: setting OOZIE_HTTPS_PORT=11443
Jan 12 15:36:21 hadoop-coordinator-1 oozie[9079]: setting OOZIE_HTTPS_KEYSTORE_PASS=password
Jan 12 15:36:21 hadoop-coordinator-1 oozie[9079]: setting CATALINA_OPTS="$CATALINA_OPTS -Doozie.https.port=${OOZIE_HTTPS_PORT}"
Jan 12 15:36:21 hadoop-coordinator-1 oozie[9079]: setting CATALINA_OPTS="$CATALINA_OPTS -Doozie.https.keystore.pass=${OOZIE_HTTPS_KEYSTORE_PASS}"
Jan 12 15:36:21 hadoop-coordinator-1 oozie[9079]: setting CATALINA_OPTS="$CATALINA_OPTS -Xmx2048m"
Jan 12 15:36:21 hadoop-coordinator-1 oozie[9079]: setting OOZIE_CONFIG=/etc/oozie/conf
Jan 12 15:36:21 hadoop-coordinator-1 oozie[9079]: setting OOZIE_LOG=/var/log/oozie
Jan 12 15:36:21 hadoop-coordinator-1 oozie[9079]: setting JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre

The highlight should be on these two lines:

Jan 12 15:36:21 hadoop-coordinator-1 oozie[9079]: Sourcing:                    /usr/lib/oozie/bin/oozie-env.sh
[..]
Jan 12 15:36:21 hadoop-coordinator-1 oozie[9079]: Sourcing:                    /etc/oozie/conf/oozie-env.sh

These two files are the same:

elukey@hadoop-coordinator-1:~$ ls -l /usr/lib/oozie/bin/oozie-env.sh
lrwxrwxrwx 1 root root 28 Jan 20  2017 /usr/lib/oozie/bin/oozie-env.sh -> /etc/oozie/conf/oozie-env.sh

This is the call chain:

/etc/init.d/oozie start() --> /usr/lib/oozie/bin/oozied.sh --> /usr/lib/oozie/bin/oozie-sys.sh,

oozie-sys.sh executes print "Setting OOZIE_HOME: ${OOZIE_HOME}" and then:

# if the installation has a env file, source it
# this is for native packages installations
#
if [ -e "${OOZIE_HOME}/bin/oozie-env.sh" ]; then
  print "Sourcing:                    ${OOZIE_HOME}/bin/oozie-env.sh"
  source ${OOZIE_HOME}/bin/oozie-env.sh
  grep "^ *export " ${OOZIE_HOME}/bin/oozie-env.sh | sed 's/ *export/  setting/'
fi

This one calls /usr/lib/oozie/bin/oozie-env.sh that sets CATALINA_OPTS.

Finally found the root cause. Each time that oozied.sh does start/stop from the init.d's script it starts with a clean environment. Then the duplication happens in oozie-sys.sh due to the symlink pointed out above and these:

if [ -e "${OOZIE_HOME}/bin/oozie-env.sh" ]; then
  print "Sourcing:                    ${OOZIE_HOME}/bin/oozie-env.sh"
  source ${OOZIE_HOME}/bin/oozie-env.sh
  grep "^ *export " ${OOZIE_HOME}/bin/oozie-env.sh | sed 's/ *export/  setting/'
fi

[..]

# if the configuration dir has a env file, source it
#
if [ -e "${OOZIE_CONFIG}/oozie-env.sh" ]; then
  print "Sourcing:                    ${OOZIE_CONFIG}/oozie-env.sh"
  source ${OOZIE_CONFIG}/oozie-env.sh
  grep "^ *export " ${OOZIE_CONFIG}/oozie-env.sh | sed 's/ *export/  setting/'
fi

If we find why and how the symlink is done we should be able to solve the issue.

The problem seems to be in the oozie debian package itself:

elukey@hadoop-coordinator-1:~/oozie-4.1.0+cdh5.10.0+389/debian$ grep -rni oozie-env.sh
install_oozie.sh:199:cp ${EXTRA_DIR}/oozie-env.sh ${CONF_DIR}
install_oozie.sh:218:ln -s -f /etc/oozie/conf/oozie-env.sh ${SERVER_LIB_DIR}/bin
oozie.init:79:. /usr/lib/oozie/bin/oozie-env.sh

Called via rules:

override_dh_auto_install:
        bash -x debian/install_oozie.sh --extra-dir=debian/ --build-dir=. --server-dir=./debian/oozie --client-dir=./debian/oozie-client --docs-dir=./debian/oozie-client/usr/share/doc/oozie --initd-dir=./debian/oozie/etc/init.d --conf-dir=./debian/oozie/etc/oozie/conf.dist

About Hive, I tried to re-apply the changes to the metastore and this is the difference in ps:

elukey@hadoop-coordinator-1:~$ ps aux | grep hive
hive     18157  3.6  5.3 2778096 435044 ?      Sl   14:51   0:11 /usr/lib/jvm/default-java/bin/java -Xmx256m -Dhive.log.dir=/var/log/hive -Dhive.log.file=hive-metastore.log -Dhive.log.threshold=INFO -Xmx2048m -Dcom.sun.management.jmxremote.port=9979 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dhadoop.log.dir=/usr/lib/hadoop/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/usr/lib/hadoop -Dhadoop.id.str= -Dhadoop.root.logger=INFO,console -Djava.library.path=/usr/lib/hadoop/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Dhadoop.security.logger=INFO,NullAppender org.apache.hadoop.util.RunJar /usr/lib/hive/lib/hive-service-1.1.0-cdh5.10.0.jar org.apache.hadoop.hive.metastore.HiveMetaStore

elukey@hadoop-coordinator-1:~$ ps aux | grep hive
hive     17406  0.0  0.0  13348  1752 ?        S    14:50   0:00 bash /usr/lib/hive/bin/hive --service metastore
hive     17407  3.5  1.4 2740744 116300 ?      Sl   14:50   0:01 /usr/lib/jvm/default-java/bin/java -Xmx256m -Dhive.log.dir=/var/log/hive -Dhive.log.file=hive-metastore.log -Dhive.log.threshold=INFO -Xmx2048m -javaagent:/usr/share/java/prometheus/jmx_prometheus_javaagent.jar=10.68.19.233:9183:/etc/hive/prometheus_hive_jmx_exporter.yaml -Dcom.sun.management.jmxremote.port=9979 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dhadoop.log.dir=/usr/lib/hadoop/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/usr/lib/hadoop -Dhadoop.id.str= -Dhadoop.root.logger=INFO,console -Djava.library.path=/usr/lib/hadoop/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Dhadoop.security.logger=INFO,NullAppender org.apache.hadoop.util.VersionInfo
hive     17408  0.0  0.0   7412   940 ?        S    14:50   0:00 awk {if (NR == 1) {print $2;}}

The latter seems to be missing the final call for org.apache.hadoop.hive.metastore.HiveMetaStore, hence it kinda explains why ports are not binding (the service is not started).

So with a better ps what happens is clear:

hive       318  0.0  0.0  13396  3180 ?        S    15:35   0:00 bash /usr/lib/hive/bin/hive --service metastore
hive       336  0.0  0.0  13396  1800 ?        S    15:35   0:00  \_ bash /usr/lib/hive/bin/hive --service metastore
hive       339  0.8  1.3 2740744 113688 ?      Sl   15:35   0:02      \_ /usr/lib/jvm/default-java/bin/java -Xmx256m -Dhive.log.dir=/var/log/hive -Dhive.log.file=hive-metastore.log -Dhive.log.threshold=INFO -Xmx2048m -javaagent:/usr/share/java/prometheus/jmx_prometheus_javaagent.jar=10.68.19.233:9183:/etc/hive/prometheus_hive_jmx_exporter.yaml -Dcom.sun.management.jmxremote.port=9979 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dhadoop.log.dir=/usr/lib/hadoop/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/usr/lib/hadoop -Dhadoop.id.str= -Dhadoop.root.logger=INFO,console -Djava.library.path=/usr/lib/hadoop/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Dhadoop.security.logger=INFO,NullAppender org.apache.hadoop.util.VersionInfo
hive       340  0.0  0.0   7412   992 ?        S    15:35   0:00      \_ awk {if (NR == 1) {print $2;}}

I've added set -x to /usr/lib/hive/bin/hive (that is called by the init script) and traced where the above awk gets blocked:

# Make sure we're using a compatible version of Hadoop
if [ "x$HADOOP_VERSION" == "x" ]; then
    HADOOP_VERSION=$($HADOOP version | awk '{if (NR == 1) {print $2;}}');
fi

So the HADOOP_OPTS that contains the javagent cause the hadoop command to hang and it doesn't return anymore, ending up not starting the hive jars.

I added a timeout 3 bash command and it worked fine, but then a similar issue re-happened when I tried to restart the metastore service. Hive is surely not really able to run the jmx_agent at the moment, and oozie is on a similar boat. I am a bit worried about the other hadoop daemons though, everything went fine up to now but not sure if the confusing hadoop init.d scripts might have some bug waiting to bite in the future.

Maybe an option could be to have the jmx exporter to run as standalone daemon?

In metastore.sh it is used the HIVE_METASTORE_HADOOP_OPTS that works fine (just tested), but there seems to be no equivalent for Hive Server (https://issues.apache.org/jira/browse/HIVE-12582).