Fix outstanding bugs preventing the use of prometheus jmx agent for Hive/Oozie
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	elukey
	Jan 12 2018, 11:46 AM

Description

While working on T177458 we discovered that Hive and Oozie have some bugs in the way they handle JVM parameters that prevents the jmx agents to work properly. We'd need to figure out the root causes, fix them and finally enable the agents on analytics1003.

Oozie

For oozie I keep seeing the log (in catalina.out):

Exception in thread "main" java.lang.reflect.InvocationTargetException
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at sun.instrument.InstrumentationImpl.loadClassAndStartAgent(InstrumentationImpl.java:382)
	at sun.instrument.InstrumentationImpl.loadClassAndCallPremain(InstrumentationImpl.java:397)
Caused by: java.lang.IllegalArgumentException: Collector already registered that provides name: jmx_scrape_duration_seconds
	at io.prometheus.jmx.shaded.io.prometheus.client.CollectorRegistry.register(CollectorRegistry.java:54)
	at io.prometheus.jmx.shaded.io.prometheus.client.Collector.register(Collector.java:128)
	at io.prometheus.jmx.shaded.io.prometheus.client.Collector.register(Collector.java:121)
	at io.prometheus.jmx.shaded.io.prometheus.jmx.JavaAgent.premain(JavaAgent.java:36)
	... 6 more
FATAL ERROR in native method: processing of -javaagent failed

That is probably related to this duplication of JVM parameters:

Jan 10 14:19:36 hadoop-coordinator-1 oozie[7611]: Using   CATALINA_OPTS:        -Doozie.https.port=11443 -javaagent:/usr/share/java/prometheus/jmx_prometheus_javaagent.jar=10.68.19.233:12000:/etc/oozie/prometheus_oozie_server_jmx_exporter.yaml  -Doozie.https.port=11443 -Doozie.https.keystore.pass=password -javaagent:/usr/share/java/prometheus/jmx_prometheus_javaagent.jar=10.68.19.233:12000:/etc/oozie/prometheus_oozie_server_jmx_exporter.yaml -Djava.util.logging.config.file=/etc/oozie/logging.properties -Dderby.stream.error.file=/var/log/oozie/derby.log
I haven't found any open/closed bug but something doesn't sound right in either our config or oozie's internal one.

It seems an issue of the init.d script, but we are not sure where is the bug.

Hive

The hive server/metastore issue is more subtle: everything starts, the jmx agent returns metrics correctly but the daemons do not bind to their ports (so they are not working):

hive     13423  7.9  1.4 2744856 119332 ?      Sl   16:11   0:02 /usr/lib/jvm/default-java/bin/java -Xmx256m -Dhive.log.dir=/var/log/hive -Dhive.log.file=hive-server2.log -Dhive.log.threshold=INFO -Xmx2048m -javaagent:/usr/share/java/prometheus/jmx_prometheus_javaagent.jar=10.68.19.233:12000:/etc/oozie/prometheus_oozie_server_jmx_exporter.yaml -Dcom.sun.management.jmxremote.port=9978 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dhadoop.log.dir=/usr/lib/hadoop/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/usr/lib/hadoop -Dhadoop.id.str= -Dhadoop.root.logger=INFO,console -Djava.library.path=/usr/lib/hadoop/lib/native -
Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Dhadoop.security.logger=INFO,NullAppender org.apache.hadoop.util.VersionInfo

elukey@hadoop-coordinator-1:~$ curl 10.68.19.233:12000/metrics -s | grep -v "#" | sort
jmx_config_reload_failure_total 0.0
jmx_config_reload_success_total 0.0
jmx_scrape_duration_seconds 0.00108968
jmx_scrape_error 0.0
jvm_classes_loaded 1858.0
jvm_classes_loaded_total 1858.0
jvm_classes_unloaded_total 0.0
jvm_gc_collection_seconds_count{gc="PS MarkSweep",} 0.0
jvm_gc_collection_seconds_count{gc="PS Scavenge",} 2.0
jvm_gc_collection_seconds_sum{gc="PS MarkSweep",} 0.0
jvm_gc_collection_seconds_sum{gc="PS Scavenge",} 0.027
jvm_info{version="1.7.0_151-b01",vendor="Oracle Corporation",} 1.0
jvm_memory_bytes_committed{area="heap",} 1.23731968E8
jvm_memory_bytes_committed{area="nonheap",} 2.4576E7
jvm_memory_bytes_max{area="heap",} 1.908932608E9
jvm_memory_bytes_max{area="nonheap",} 2.24395264E8
jvm_memory_bytes_used{area="heap",} 1.6198616E7
jvm_memory_bytes_used{area="nonheap",} 1.193108E7
jvm_memory_pool_bytes_committed{pool="Code Cache",} 2555904.0
jvm_memory_pool_bytes_committed{pool="PS Eden Space",} 3.3554432E7
jvm_memory_pool_bytes_committed{pool="PS Old Gen",} 8.7031808E7
jvm_memory_pool_bytes_committed{pool="PS Perm Gen",} 2.2020096E7
jvm_memory_pool_bytes_committed{pool="PS Survivor Space",} 3145728.0
jvm_memory_pool_bytes_max{pool="Code Cache",} 5.0331648E7
jvm_memory_pool_bytes_max{pool="PS Eden Space",} 7.077888E8
jvm_memory_pool_bytes_max{pool="PS Old Gen",} 1.43130624E9
jvm_memory_pool_bytes_max{pool="PS Perm Gen",} 1.74063616E8
jvm_memory_pool_bytes_max{pool="PS Survivor Space",} 3145728.0
jvm_memory_pool_bytes_used{pool="Code Cache",} 689024.0
jvm_memory_pool_bytes_used{pool="PS Eden Space",} 1.333768E7
jvm_memory_pool_bytes_used{pool="PS Old Gen",} 8192.0
jvm_memory_pool_bytes_used{pool="PS Perm Gen",} 1.1242056E7
jvm_memory_pool_bytes_used{pool="PS Survivor Space",} 2852744.0
jvm_threads_current 15.0
jvm_threads_daemon 10.0
jvm_threads_deadlocked 0.0
jvm_threads_deadlocked_monitor 0.0
jvm_threads_peak 15.0
jvm_threads_started_total 18.0
process_cpu_seconds_total 2.98
process_max_fds 65536.0
process_open_fds 433.0
process_resident_memory_bytes 1.2312576E8
process_start_time_seconds 1.515600666772E9
process_virtual_memory_bytes 2.812837888E9

No indication in the logs/journalctl/syslog of what is happening. systemctl status hive* returns all green.

Details

Subject	Repo	Branch	Lines +/-
hive: introduce HIVE_SERVER2_HADOOP_OPTS	operations/puppet/cdh	master	+67 -1
role::analytics_cluster_coordinator: enable prometheus metrics for hive	operations/puppet	production	+2 -2
hive: replace HADOOP_OPTS with HIVE_SERVER2_HADOOP_OPTS for hive-server2	operations/puppet/cdh	master	+2 -2
role::prometheus::analytics: add Hive Prometheus poll config	operations/puppet	production	+13 -2
role::analytics_cluster::coordinator: rollback oozie prometheus config	operations/puppet	production	+1 -10
role::analytics_cluster::coordinator: enable mon. for oozie\|hive	operations/puppet	production	+2 -2
hive-env.sh: use HIVE_METASTORE_HADOOP_OPTS to configure the metastore	operations/puppet/cdh	master	+2 -2
profile::oozie::server: remove symlink installed by Oozie's package	operations/puppet	production	+9 -0
Update the cdh module to its latest change	operations/puppet	production	+1 -1

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		None	T175344 Move away from jmxtrans in favor of prometheus jmx_exporter
		Resolved		elukey	T184794 Fix outstanding bugs preventing the use of prometheus jmx agent for Hive/Oozie

Event Timeline

elukey created this task.Jan 12 2018, 11:46 AM

elukey claimed this task.Jan 12 2018, 11:54 AM

elukey moved this task from Next Up to In Progress on the Analytics-Kanban board.

elukey moved this task from Backlog to In Progress on the User-Elukey board.Jan 12 2018, 2:29 PM

In https://oozie.apache.org/docs/4.1.0/AG_Install.html -> Advanced/Custom Environment Settings I can't see any CATALINA_OPTS listed for oozie-env.sh, so we might need to set it elsewhere.

EDIT: not true, it is listed above as usable in oozie-env.sh, back to blame the init.d script.

This is the journalctl snippet of oozie sourcing various files:

Jan 12 15:36:21 hadoop-coordinator-1 oozie[9079]: Setting OOZIE_HOME:          /usr/lib/oozie
Jan 12 15:36:21 hadoop-coordinator-1 oozie[9079]: Sourcing:                    /usr/lib/oozie/bin/oozie-env.sh
Jan 12 15:36:21 hadoop-coordinator-1 oozie[9079]: setting OOZIE_DATA=/var/lib/oozie
Jan 12 15:36:21 hadoop-coordinator-1 oozie[9079]: setting OOZIE_CATALINA_HOME=/usr/lib/bigtop-tomcat
Jan 12 15:36:21 hadoop-coordinator-1 oozie[9079]: setting CATALINA_TMPDIR=/var/lib/oozie
Jan 12 15:36:21 hadoop-coordinator-1 oozie[9079]: setting CATALINA_PID=/var/run/oozie/oozie.pid
Jan 12 15:36:21 hadoop-coordinator-1 oozie[9079]: setting CATALINA_BASE=/var/lib/oozie/tomcat-deployment
Jan 12 15:36:21 hadoop-coordinator-1 oozie[9079]: setting OOZIE_HTTPS_PORT=11443
Jan 12 15:36:21 hadoop-coordinator-1 oozie[9079]: setting OOZIE_HTTPS_KEYSTORE_PASS=password
Jan 12 15:36:21 hadoop-coordinator-1 oozie[9079]: setting CATALINA_OPTS="$CATALINA_OPTS -Doozie.https.port=${OOZIE_HTTPS_PORT}"
Jan 12 15:36:21 hadoop-coordinator-1 oozie[9079]: setting CATALINA_OPTS="$CATALINA_OPTS -Doozie.https.keystore.pass=${OOZIE_HTTPS_KEYSTORE_PASS}"
Jan 12 15:36:21 hadoop-coordinator-1 oozie[9079]: setting CATALINA_OPTS="$CATALINA_OPTS -Xmx2048m"
Jan 12 15:36:21 hadoop-coordinator-1 oozie[9079]: setting OOZIE_CONFIG=/etc/oozie/conf
Jan 12 15:36:21 hadoop-coordinator-1 oozie[9079]: setting OOZIE_LOG=/var/log/oozie
Jan 12 15:36:21 hadoop-coordinator-1 oozie[9079]: setting JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre
Jan 12 15:36:21 hadoop-coordinator-1 oozie[9079]: Using   OOZIE_CONFIG:        /etc/oozie/conf

Jan 12 15:36:21 hadoop-coordinator-1 oozie[9079]: Sourcing:                    /etc/oozie/conf/oozie-env.sh
Jan 12 15:36:21 hadoop-coordinator-1 oozie[9079]: setting OOZIE_DATA=/var/lib/oozie
Jan 12 15:36:21 hadoop-coordinator-1 oozie[9079]: setting OOZIE_CATALINA_HOME=/usr/lib/bigtop-tomcat
Jan 12 15:36:21 hadoop-coordinator-1 oozie[9079]: setting CATALINA_TMPDIR=/var/lib/oozie
Jan 12 15:36:21 hadoop-coordinator-1 oozie[9079]: setting CATALINA_PID=/var/run/oozie/oozie.pid
Jan 12 15:36:21 hadoop-coordinator-1 oozie[9079]: setting CATALINA_BASE=/var/lib/oozie/tomcat-deployment
Jan 12 15:36:21 hadoop-coordinator-1 oozie[9079]: setting OOZIE_HTTPS_PORT=11443
Jan 12 15:36:21 hadoop-coordinator-1 oozie[9079]: setting OOZIE_HTTPS_KEYSTORE_PASS=password
Jan 12 15:36:21 hadoop-coordinator-1 oozie[9079]: setting CATALINA_OPTS="$CATALINA_OPTS -Doozie.https.port=${OOZIE_HTTPS_PORT}"
Jan 12 15:36:21 hadoop-coordinator-1 oozie[9079]: setting CATALINA_OPTS="$CATALINA_OPTS -Doozie.https.keystore.pass=${OOZIE_HTTPS_KEYSTORE_PASS}"
Jan 12 15:36:21 hadoop-coordinator-1 oozie[9079]: setting CATALINA_OPTS="$CATALINA_OPTS -Xmx2048m"
Jan 12 15:36:21 hadoop-coordinator-1 oozie[9079]: setting OOZIE_CONFIG=/etc/oozie/conf
Jan 12 15:36:21 hadoop-coordinator-1 oozie[9079]: setting OOZIE_LOG=/var/log/oozie
Jan 12 15:36:21 hadoop-coordinator-1 oozie[9079]: setting JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre

The highlight should be on these two lines:

Jan 12 15:36:21 hadoop-coordinator-1 oozie[9079]: Sourcing:                    /usr/lib/oozie/bin/oozie-env.sh
[..]
Jan 12 15:36:21 hadoop-coordinator-1 oozie[9079]: Sourcing:                    /etc/oozie/conf/oozie-env.sh

These two files are the same:

elukey@hadoop-coordinator-1:~$ ls -l /usr/lib/oozie/bin/oozie-env.sh
lrwxrwxrwx 1 root root 28 Jan 20  2017 /usr/lib/oozie/bin/oozie-env.sh -> /etc/oozie/conf/oozie-env.sh

This is the call chain:

/etc/init.d/oozie start() --> /usr/lib/oozie/bin/oozied.sh --> /usr/lib/oozie/bin/oozie-sys.sh,

oozie-sys.sh executes print "Setting OOZIE_HOME: ${OOZIE_HOME}" and then:

# if the installation has a env file, source it
# this is for native packages installations
#
if [ -e "${OOZIE_HOME}/bin/oozie-env.sh" ]; then
  print "Sourcing:                    ${OOZIE_HOME}/bin/oozie-env.sh"
  source ${OOZIE_HOME}/bin/oozie-env.sh
  grep "^ *export " ${OOZIE_HOME}/bin/oozie-env.sh | sed 's/ *export/  setting/'
fi

This one calls /usr/lib/oozie/bin/oozie-env.sh that sets CATALINA_OPTS.

Finally found the root cause. Each time that oozied.sh does start/stop from the init.d's script it starts with a clean environment. Then the duplication happens in oozie-sys.sh due to the symlink pointed out above and these:

if [ -e "${OOZIE_HOME}/bin/oozie-env.sh" ]; then
  print "Sourcing:                    ${OOZIE_HOME}/bin/oozie-env.sh"
  source ${OOZIE_HOME}/bin/oozie-env.sh
  grep "^ *export " ${OOZIE_HOME}/bin/oozie-env.sh | sed 's/ *export/  setting/'
fi

[..]

# if the configuration dir has a env file, source it
#
if [ -e "${OOZIE_CONFIG}/oozie-env.sh" ]; then
  print "Sourcing:                    ${OOZIE_CONFIG}/oozie-env.sh"
  source ${OOZIE_CONFIG}/oozie-env.sh
  grep "^ *export " ${OOZIE_CONFIG}/oozie-env.sh | sed 's/ *export/  setting/'
fi

If we find why and how the symlink is done we should be able to solve the issue.

The problem seems to be in the oozie debian package itself:

elukey@hadoop-coordinator-1:~/oozie-4.1.0+cdh5.10.0+389/debian$ grep -rni oozie-env.sh
install_oozie.sh:199:cp ${EXTRA_DIR}/oozie-env.sh ${CONF_DIR}
install_oozie.sh:218:ln -s -f /etc/oozie/conf/oozie-env.sh ${SERVER_LIB_DIR}/bin
oozie.init:79:. /usr/lib/oozie/bin/oozie-env.sh

Called via rules:

override_dh_auto_install:
        bash -x debian/install_oozie.sh --extra-dir=debian/ --build-dir=. --server-dir=./debian/oozie --client-dir=./debian/oozie-client --docs-dir=./debian/oozie-client/usr/share/doc/oozie --initd-dir=./debian/oozie/etc/init.d --conf-dir=./debian/oozie/etc/oozie/conf.dist

Tried to open https://community.cloudera.com/t5/CDH-Manual-Installation/Oozie-duplicates-CATALINA-OPTS-variables-in-oozie-env-sh/m-p/63654#M1607, not sure if it is the best place but let's see if anybody answers.

About Hive, I tried to re-apply the changes to the metastore and this is the difference in ps:

elukey@hadoop-coordinator-1:~$ ps aux | grep hive
hive     18157  3.6  5.3 2778096 435044 ?      Sl   14:51   0:11 /usr/lib/jvm/default-java/bin/java -Xmx256m -Dhive.log.dir=/var/log/hive -Dhive.log.file=hive-metastore.log -Dhive.log.threshold=INFO -Xmx2048m -Dcom.sun.management.jmxremote.port=9979 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dhadoop.log.dir=/usr/lib/hadoop/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/usr/lib/hadoop -Dhadoop.id.str= -Dhadoop.root.logger=INFO,console -Djava.library.path=/usr/lib/hadoop/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Dhadoop.security.logger=INFO,NullAppender org.apache.hadoop.util.RunJar /usr/lib/hive/lib/hive-service-1.1.0-cdh5.10.0.jar org.apache.hadoop.hive.metastore.HiveMetaStore

elukey@hadoop-coordinator-1:~$ ps aux | grep hive
hive     17406  0.0  0.0  13348  1752 ?        S    14:50   0:00 bash /usr/lib/hive/bin/hive --service metastore
hive     17407  3.5  1.4 2740744 116300 ?      Sl   14:50   0:01 /usr/lib/jvm/default-java/bin/java -Xmx256m -Dhive.log.dir=/var/log/hive -Dhive.log.file=hive-metastore.log -Dhive.log.threshold=INFO -Xmx2048m -javaagent:/usr/share/java/prometheus/jmx_prometheus_javaagent.jar=10.68.19.233:9183:/etc/hive/prometheus_hive_jmx_exporter.yaml -Dcom.sun.management.jmxremote.port=9979 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dhadoop.log.dir=/usr/lib/hadoop/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/usr/lib/hadoop -Dhadoop.id.str= -Dhadoop.root.logger=INFO,console -Djava.library.path=/usr/lib/hadoop/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Dhadoop.security.logger=INFO,NullAppender org.apache.hadoop.util.VersionInfo
hive     17408  0.0  0.0   7412   940 ?        S    14:50   0:00 awk {if (NR == 1) {print $2;}}

The latter seems to be missing the final call for org.apache.hadoop.hive.metastore.HiveMetaStore, hence it kinda explains why ports are not binding (the service is not started).

So with a better ps what happens is clear:

hive       318  0.0  0.0  13396  3180 ?        S    15:35   0:00 bash /usr/lib/hive/bin/hive --service metastore
hive       336  0.0  0.0  13396  1800 ?        S    15:35   0:00  \_ bash /usr/lib/hive/bin/hive --service metastore
hive       339  0.8  1.3 2740744 113688 ?      Sl   15:35   0:02      \_ /usr/lib/jvm/default-java/bin/java -Xmx256m -Dhive.log.dir=/var/log/hive -Dhive.log.file=hive-metastore.log -Dhive.log.threshold=INFO -Xmx2048m -javaagent:/usr/share/java/prometheus/jmx_prometheus_javaagent.jar=10.68.19.233:9183:/etc/hive/prometheus_hive_jmx_exporter.yaml -Dcom.sun.management.jmxremote.port=9979 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dhadoop.log.dir=/usr/lib/hadoop/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/usr/lib/hadoop -Dhadoop.id.str= -Dhadoop.root.logger=INFO,console -Djava.library.path=/usr/lib/hadoop/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Dhadoop.security.logger=INFO,NullAppender org.apache.hadoop.util.VersionInfo
hive       340  0.0  0.0   7412   992 ?        S    15:35   0:00      \_ awk {if (NR == 1) {print $2;}}

I've added set -x to /usr/lib/hive/bin/hive (that is called by the init script) and traced where the above awk gets blocked:

# Make sure we're using a compatible version of Hadoop
if [ "x$HADOOP_VERSION" == "x" ]; then
    HADOOP_VERSION=$($HADOOP version | awk '{if (NR == 1) {print $2;}}');
fi

So the HADOOP_OPTS that contains the javagent cause the hadoop command to hang and it doesn't return anymore, ending up not starting the hive jars.

I added a timeout 3 bash command and it worked fine, but then a similar issue re-happened when I tried to restart the metastore service. Hive is surely not really able to run the jmx_agent at the moment, and oozie is on a similar boat. I am a bit worried about the other hadoop daemons though, everything went fine up to now but not sure if the confusing hadoop init.d scripts might have some bug waiting to bite in the future.

Maybe an option could be to have the jmx exporter to run as standalone daemon?

In metastore.sh it is used the HIVE_METASTORE_HADOOP_OPTS that works fine (just tested), but there seems to be no equivalent for Hive Server (https://issues.apache.org/jira/browse/HIVE-12582).

Mentioned in SAL (#wikimedia-operations) [2018-02-02T13:59:48Z] <elukey> reboot meitnerium via gnt-instance reboot on ganeti1005 to pick up new disk config - T184794

elukey moved this task from In Progress to Stalled on the User-Elukey board.Feb 13 2018, 11:36 AM

elukey mentioned this in T166248: Upgrade Analytics Cluster to Java 8.Feb 14 2018, 5:12 PM

Change 411192 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::oozie::server: remove symlink installed by Oozie's package

https://gerrit.wikimedia.org/r/411192

Change 411192 merged by Elukey:
[operations/puppet@production] profile::oozie::server: remove symlink installed by Oozie's package

https://gerrit.wikimedia.org/r/411192

Change 411195 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet/cdh@master] hive-env.sh: use HIVE_METASTORE_HADOOP_OPTS to configure the metastore

https://gerrit.wikimedia.org/r/411195

Change 411195 merged by Elukey:
[operations/puppet/cdh@master] hive-env.sh: use HIVE_METASTORE_HADOOP_OPTS to configure the metastore

https://gerrit.wikimedia.org/r/411195

Change 411197 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Update the cdh module to its latest change

https://gerrit.wikimedia.org/r/411197

Change 411197 merged by Elukey:
[operations/puppet@production] Update the cdh module to its latest change

https://gerrit.wikimedia.org/r/411197

elukey moved this task from Stalled to In Progress on the User-Elukey board.Feb 20 2018, 5:44 PM

Change 413189 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::analytics_cluster::coordinator: enable mon. for oozie|hive

https://gerrit.wikimedia.org/r/413189

Change 413189 merged by Elukey:
[operations/puppet@production] role::analytics_cluster::coordinator: enable mon. for oozie|hive

https://gerrit.wikimedia.org/r/413189

Change 413316 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::analytics_cluster::coordinator: rollback oozie prometheus config

https://gerrit.wikimedia.org/r/413316

Change 413316 merged by Elukey:
[operations/puppet@production] role::analytics_cluster::coordinator: rollback oozie prometheus config

https://gerrit.wikimedia.org/r/413316

Change 413320 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::prometheus::analytics: add Hive Prometheus poll config

https://gerrit.wikimedia.org/r/413320

Change 413320 merged by Elukey:
[operations/puppet@production] role::prometheus::analytics: add Hive Prometheus poll config

https://gerrit.wikimedia.org/r/413320

created https://grafana.wikimedia.org/dashboard/db/analytics-hive for the Hive Metastore.

Oozie failed to boot due to the following, so I had to rollback:

Feb 22, 2018 8:38:17 AM org.apache.catalina.core.StandardContext resourcesStart
SEVERE: Error starting static Resources
java.lang.IllegalArgumentException: Invalid or unreadable WAR file : zip file is empty
        at org.apache.naming.resources.WARDirContext.setDocBase(WARDirContext.java:135)
        at org.apache.catalina.core.StandardContext.resourcesStart(StandardContext.java:4350)
        at org.apache.catalina.core.StandardContext.start(StandardContext.java:4523)
        at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:803)
        at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:780)
        at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:583)
        at org.apache.catalina.startup.HostConfig.deployWAR(HostConfig.java:944)
        at org.apache.catalina.startup.HostConfig.deployWARs(HostConfig.java:779)
        at org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java:505)
        at org.apache.catalina.startup.HostConfig.check(HostConfig.java:1388)
        at org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:307)
        at org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:142)
        at org.apache.catalina.core.ContainerBase.backgroundProcess(ContainerBase.java:1393)
        at org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor.processChildren(ContainerBase.java:1657)
        at org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor.processChildren(ContainerBase.java:1666)
        at org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor.run(ContainerBase.java:1646)
        at java.lang.Thread.run(Thread.java:748)

Apparently this is due to the absence of the /usr/lib/oozie/bin/oozie-env.sh link to /etc/hive/conf/hive-env.sh

elukey moved this task from In Progress to Paused on the Analytics-Kanban board.Feb 22 2018, 5:01 PM

elukey moved this task from In Progress to Analytics Backlog on the User-Elukey board.Feb 27 2018, 12:56 PM

elukey moved this task from Analytics Backlog to Stalled on the User-Elukey board.Mar 15 2018, 5:04 PM

Sadly after a long battle there seems not to be a good way to add prometheus for the hive/oozie jvms, we'll revisit later on when upgrading to newer cdh versions.

SG9tZVBoYWJyaWNhdG9yCk5vIG1lc3NhZ2VzLiBObyBub3RpZmljYXRpb25zLgoKICAgIFNlYXJjaAoKQ3JlYXRlIFRhc2sKTWFuaXBoZXN0ClQxOTcyODEKRml4IGZhaWxpbmcgd2VicmVxdWVzdCBob3VycyAodXBsb2FkIGFuZCB0ZXh0IDtyBDQy1CWS1TQSC3IEdQTApZb3VyIGJyb3dzZXIgdGltZXpvbmUgc2V0dGluZyBkaWZmZXJzIGZyb20gdGhlIHRpbWV6b25lIHNldHRpbmcgaW4geW91ciBwcm9maWxlLCBjbGljayB0byByZWNvbmNpbGUu

Aklapper raised the priority of this task from Lowest to Needs Triage.Jun 15 2018, 12:43 PM

Aklapper removed projects: HAWelcome, HHVM, Language-2018-Jan-Mar, Language-2018-Apr-June, Ladies-That-FOSS-MediaWiki, LabsDB-Auditor, Hashtags, Data-release, AbuseFilter, Scap (Scap3-Adoption-Phase2), Wikimedia-VE-Campaigns (S2-2018), acl*security, Analytics-Kanban.

Aklapper edited subscribers, added: Aklapper; removed: • 238482n375.

Aklapper assigned this task to elukey.Jun 15 2018, 12:45 PM

Aklapper added a project: Analytics-Kanban.

Change 469562 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet/cdh@master] hive: introduce HIVE_SERVER2_HADOOP_OPTS

https://gerrit.wikimedia.org/r/469562

Change 469562 merged by Elukey:
[operations/puppet/cdh@master] hive: introduce HIVE_SERVER2_HADOOP_OPTS

https://gerrit.wikimedia.org/r/469562

Change 469584 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet/cdh@master] hive: replace HADOOP_OPTS with HIVE_SERVER2_HADOOP_OPTS for hive-server2

https://gerrit.wikimedia.org/r/469584

Change 469584 merged by Elukey:
[operations/puppet/cdh@master] hive: replace HADOOP_OPTS with HIVE_SERVER2_HADOOP_OPTS for hive-server2

https://gerrit.wikimedia.org/r/469584

Change 469585 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::analytics_cluster_coordinator: enable prometheus metrics for hive

https://gerrit.wikimedia.org/r/469585

Change 469585 merged by Elukey:
[operations/puppet@production] role::analytics_cluster_coordinator: enable prometheus metrics for hive

https://gerrit.wikimedia.org/r/469585

Hello Team,
Can anyone please help me how you guys has passed the java agent for hive metastore via hive-env.sh file ?

@Som_Marcin https://gerrit.wikimedia.org/r/c/operations/puppet/+/469585/2/hieradata/role/common/analytics_cluster/coordinator.yaml#29 should have the CLI opts that are passed to the JVM for the prometheus exporter.

@Ottomata Just wondering can we pass via hive-env.sh i am using apache-hive-2.3.7 tried below one but not working

export HADOOP_CLIENT_OPTS="$HADOOP_CLIENT_OPTS -javaagent:/opt/hive/jmx_prometheus_javaagent-0.13.0.jar=8008:/opt/hive/jmx-exporter-config.yml"

Ah I get your question now. Just looked, it seems we export this to the metastore via the HIVE_METASTORE_HADOOP_OPTS variable.

export HIVE_METASTORE_HADOOP_OPTS="$HIVE_METASTORE_HADOOP_OPTS -Xms4g -Xmx4g -XX:+UseG1GC -XX:+UseStringDeduplication -XX:MaxGCPauseMillis=1000 -Djava.net.preferIPv4Stack=false -javaagent:/usr/share/java/prometheus/jmx_prometheus_javaagent.jar=[::]:9183:/etc/prometheus/hive_metastore_jmx_exporter.yaml"

@Ottomata have you set any other java option getting this error: ?

Caused by: java.net.SocketException: Protocol family unavailable

Can't say I know what that is about. Just in case, don't just copy paste what I got there, you probably only want the javaagent:/usr/share/java/prometheus/jmx_prometheus_javaagent.jar=[::]:9183:/etc/prometheus/hive_metastore_jmx_exporter.yaml" (with the proper path to your jar and exporter config file.

yes i configured the right path seems to have some java issue.

@Ottomata Is it possible you can ping me the file link/content ?

Thanks @Ottomata for your help figure out the issue IPv6 was not supported made use of IPv4.

Fix outstanding bugs preventing the use of prometheus jmx agent for Hive/OozieClosed, ResolvedPublicActions