Page MenuHomePhabricator

an-coord1001 hive metastore not listening on ipv6
Closed, ResolvedPublic5 Estimated Story Points

Description

Hive metastore should be listening for thrift connections on port 9083. When pointing airflow at an-coord1001.eqiad.wmnet:9083 it errors out with connection refused, as dns returned an ipv6 address but metastore only appears to be listening on ipv4. For the moment i've hardcoded an-coord1001's ipv4 address into the airflow config, but ideally hive should listen on both.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Interesting, I have never realized this. The hive daemons are running with -Djava.net.preferIPv4Stack=true, probably similar to all the other hadoop daemons (see T225296#5295016). We can try to set -Djava.net.preferIPv4Stack=false in Hadoop testing and see how it goes.

I doubt we want to prefer IPv6, (do we?) but maybe we can make Hive listen on both IPs?

Change 556198 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::analytics_test_cluster::coordinator: use IPv6 in Hive

https://gerrit.wikimedia.org/r/556198

Change 556198 merged by Elukey:
[operations/puppet@production] role::analytics_test_cluster::coordinator: use IPv6 in Hive

https://gerrit.wikimedia.org/r/556198

Very interesting:

/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -Xmx256m -Xms4g -Xmx10g -Xms4g -Xmx10g -Djava.net.preferIPv4Stack=false -Dcom.sun.management.jmxremote.port=9979 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dhive.log.dir=/var/log/hive -Dhive.log.file=hive-metastore.log -Dhive.log.threshold=INFO -Dhadoop.log.dir=/usr/lib/hadoop/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/usr/lib/hadoop -Dhadoop.id.str= -Dhadoop.root.logger=INFO,console -Djava.library.path=/usr/lib/hadoop/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Djava.util.logging.config.file=/etc/hive/conf.analytics-test-hadoop/java-logging.properties -Dhadoop.security.logger=INFO,NullAppender org.apache.hadoop.util.RunJar /usr/lib/hive/lib/hive-service-1.1.0-cdh5.16.1.jar org.apache.hadoop.hive.metastore.HiveMetaStore

As you can see adding -Djava.net.preferIPv4Stack=false in hive-env.sh didn't add it as last, so it gets overridden by java.net.preferIPv4Stack=true eventually. I was expecting to find the =true occurrence in /usr/lib/hive, but I was wrong. Will need to do some more research :)

Ok I know what happens, this is the chain of events:

  1. /etc/init.d/hive-metastore eventually calls /usr/lib/hive/bin/ext/metastore.sh
  2. the file contains
export HADOOP_OPTS="$HIVE_METASTORE_HADOOP_OPTS $HADOOP_OPTS"
exec $HADOOP jar $JAR $CLASS "$@"
  1. The $HADOOP var is /usr/lib/hadoop/bin/hadoop that calls (eventually) /usr/lib/hadoop/libexec/hadoop-config.sh that contains
# Disable ipv6 as it can cause issues
HADOOP_OPTS="$HADOOP_OPTS -Djava.net.preferIPv4Stack=true"

For the other Hadoop daemons I circumvented the issue simply adding the java.net.preferIPv4Stack=false, that was appended at the end overriding. In the hive case, java.net.preferIPv4Stack=true is added before the Hadoop one. I'll try to find a trick in puppet to make the /usr/lib/hadoop/libexec/hadoop-config.sh line commented, there is nothing Hive specific against ipv6.

Change 556337 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] cdh::hadoop: remove ipv6 constraints

https://gerrit.wikimedia.org/r/556337

Change 556337 merged by Elukey:
[operations/puppet@production] cdh::hadoop: remove ipv6 constraints

https://gerrit.wikimedia.org/r/556337

Change 556633 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] cdh::hadoop: replace augeas with a file resource

https://gerrit.wikimedia.org/r/556633

Change 556633 merged by Elukey:
[operations/puppet@production] cdh::hadoop: replace augeas with a file resource

https://gerrit.wikimedia.org/r/556633

Change 556641 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] hadoop: remove ipv6 constraint workaround

https://gerrit.wikimedia.org/r/556641

Change 556641 merged by Elukey:
[operations/puppet@production] hadoop: remove ipv6 constraint workaround

https://gerrit.wikimedia.org/r/556641

Looks better now!

elukey@stat1004:~$ telnet an-coord1001.eqiad.wmnet 9083
Trying 2620:0:861:105:10:64:21:104...
Connected to an-coord1001.eqiad.wmnet.
Escape character is '^]'.

@EBernhardson can you re-check and confirm the fix?

Mentioned in SAL (#wikimedia-analytics) [2019-12-12T12:59:25Z] <elukey> roll restart hadoop workers to pick up the new settings (removed prefer ipv4 false after T240255)

elukey triaged this task as Medium priority.
elukey set the point value for this task to 5.Dec 13 2019, 7:21 AM
elukey moved this task from In Code Review to Done on the Analytics-Kanban board.

Change 583631 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] cdh::hadoop: allow hadoop daemons to override ipv6 settings

https://gerrit.wikimedia.org/r/583631

Change 583631 merged by Elukey:
[operations/puppet@production] cdh::hadoop: allow hadoop daemons to override ipv6 settings

https://gerrit.wikimedia.org/r/583631