While working on T177458 we discovered that Hive and Oozie have some bugs in the way they handle JVM parameters that prevents the jmx agents to work properly. We'd need to figure out the root causes, fix them and finally enable the agents on analytics1003.
Oozie
For oozie I keep seeing the log (in catalina.out):
Exception in thread "main" java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at sun.instrument.InstrumentationImpl.loadClassAndStartAgent(InstrumentationImpl.java:382) at sun.instrument.InstrumentationImpl.loadClassAndCallPremain(InstrumentationImpl.java:397) Caused by: java.lang.IllegalArgumentException: Collector already registered that provides name: jmx_scrape_duration_seconds at io.prometheus.jmx.shaded.io.prometheus.client.CollectorRegistry.register(CollectorRegistry.java:54) at io.prometheus.jmx.shaded.io.prometheus.client.Collector.register(Collector.java:128) at io.prometheus.jmx.shaded.io.prometheus.client.Collector.register(Collector.java:121) at io.prometheus.jmx.shaded.io.prometheus.jmx.JavaAgent.premain(JavaAgent.java:36) ... 6 more FATAL ERROR in native method: processing of -javaagent failed
That is probably related to this duplication of JVM parameters:
Jan 10 14:19:36 hadoop-coordinator-1 oozie[7611]: Using CATALINA_OPTS: -Doozie.https.port=11443 -javaagent:/usr/share/java/prometheus/jmx_prometheus_javaagent.jar=10.68.19.233:12000:/etc/oozie/prometheus_oozie_server_jmx_exporter.yaml -Doozie.https.port=11443 -Doozie.https.keystore.pass=password -javaagent:/usr/share/java/prometheus/jmx_prometheus_javaagent.jar=10.68.19.233:12000:/etc/oozie/prometheus_oozie_server_jmx_exporter.yaml -Djava.util.logging.config.file=/etc/oozie/logging.properties -Dderby.stream.error.file=/var/log/oozie/derby.log I haven't found any open/closed bug but something doesn't sound right in either our config or oozie's internal one.
It seems an issue of the init.d script, but we are not sure where is the bug.
Hive
The hive server/metastore issue is more subtle: everything starts, the jmx agent returns metrics correctly but the daemons do not bind to their ports (so they are not working):
hive 13423 7.9 1.4 2744856 119332 ? Sl 16:11 0:02 /usr/lib/jvm/default-java/bin/java -Xmx256m -Dhive.log.dir=/var/log/hive -Dhive.log.file=hive-server2.log -Dhive.log.threshold=INFO -Xmx2048m -javaagent:/usr/share/java/prometheus/jmx_prometheus_javaagent.jar=10.68.19.233:12000:/etc/oozie/prometheus_oozie_server_jmx_exporter.yaml -Dcom.sun.management.jmxremote.port=9978 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dhadoop.log.dir=/usr/lib/hadoop/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/usr/lib/hadoop -Dhadoop.id.str= -Dhadoop.root.logger=INFO,console -Djava.library.path=/usr/lib/hadoop/lib/native - Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Dhadoop.security.logger=INFO,NullAppender org.apache.hadoop.util.VersionInfo elukey@hadoop-coordinator-1:~$ curl 10.68.19.233:12000/metrics -s | grep -v "#" | sort jmx_config_reload_failure_total 0.0 jmx_config_reload_success_total 0.0 jmx_scrape_duration_seconds 0.00108968 jmx_scrape_error 0.0 jvm_classes_loaded 1858.0 jvm_classes_loaded_total 1858.0 jvm_classes_unloaded_total 0.0 jvm_gc_collection_seconds_count{gc="PS MarkSweep",} 0.0 jvm_gc_collection_seconds_count{gc="PS Scavenge",} 2.0 jvm_gc_collection_seconds_sum{gc="PS MarkSweep",} 0.0 jvm_gc_collection_seconds_sum{gc="PS Scavenge",} 0.027 jvm_info{version="1.7.0_151-b01",vendor="Oracle Corporation",} 1.0 jvm_memory_bytes_committed{area="heap",} 1.23731968E8 jvm_memory_bytes_committed{area="nonheap",} 2.4576E7 jvm_memory_bytes_max{area="heap",} 1.908932608E9 jvm_memory_bytes_max{area="nonheap",} 2.24395264E8 jvm_memory_bytes_used{area="heap",} 1.6198616E7 jvm_memory_bytes_used{area="nonheap",} 1.193108E7 jvm_memory_pool_bytes_committed{pool="Code Cache",} 2555904.0 jvm_memory_pool_bytes_committed{pool="PS Eden Space",} 3.3554432E7 jvm_memory_pool_bytes_committed{pool="PS Old Gen",} 8.7031808E7 jvm_memory_pool_bytes_committed{pool="PS Perm Gen",} 2.2020096E7 jvm_memory_pool_bytes_committed{pool="PS Survivor Space",} 3145728.0 jvm_memory_pool_bytes_max{pool="Code Cache",} 5.0331648E7 jvm_memory_pool_bytes_max{pool="PS Eden Space",} 7.077888E8 jvm_memory_pool_bytes_max{pool="PS Old Gen",} 1.43130624E9 jvm_memory_pool_bytes_max{pool="PS Perm Gen",} 1.74063616E8 jvm_memory_pool_bytes_max{pool="PS Survivor Space",} 3145728.0 jvm_memory_pool_bytes_used{pool="Code Cache",} 689024.0 jvm_memory_pool_bytes_used{pool="PS Eden Space",} 1.333768E7 jvm_memory_pool_bytes_used{pool="PS Old Gen",} 8192.0 jvm_memory_pool_bytes_used{pool="PS Perm Gen",} 1.1242056E7 jvm_memory_pool_bytes_used{pool="PS Survivor Space",} 2852744.0 jvm_threads_current 15.0 jvm_threads_daemon 10.0 jvm_threads_deadlocked 0.0 jvm_threads_deadlocked_monitor 0.0 jvm_threads_peak 15.0 jvm_threads_started_total 18.0 process_cpu_seconds_total 2.98 process_max_fds 65536.0 process_open_fds 433.0 process_resident_memory_bytes 1.2312576E8 process_start_time_seconds 1.515600666772E9 process_virtual_memory_bytes 2.812837888E9
No indication in the logs/journalctl/syslog of what is happening. systemctl status hive* returns all green.