Over the course of the current OpsWeek, I have noticed multiple network issues on an-launcher1002.eqiad.wmnet over multiple use cases.
Here are the detected symptoms:
Job: refine_event_sanitized_analytics_immediate
Occurrences as per email timestamps (ET) on my opsweek:
Aug 21, 3:14 AM Aug 24, 3:07 AM Aug 26, 4:14 AM
There is, however, email evidence of same stack since at least Thu, Jul 17, 9:29 AM ET.
Stack trace:
25/08/26 08:14:43 INFO RetryInvocationHandler: java.net.ConnectException: Call From an-launcher1002/10.64.21.109 to an-master1004.eqiad.wmnet:8032 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused, while invoking ApplicationClientProtocolPBClientImpl.getClusterMetrics over an-master1004-eqiad-wmnet after 449 failover attempts. Trying to failover after sleeping for 2022ms.
25/08/26 08:14:45 INFO ConfiguredRMFailoverProxyProvider: Failing over to an-master1003-eqiad-wmnet
Exception in thread "main" java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: "an-master1003.eqiad.wmnet":8032; java.net.UnknownHostException; For more details see: http://wiki.apache.org/hadoop/UnknownHost
at sun.reflect.GeneratedConstructorAccessor1.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:831)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:768)
at org.apache.hadoop.ipc.Client$Connection.<init>(Client.java:449)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1552)
at org.apache.hadoop.ipc.Client.call(Client.java:1403)
at org.apache.hadoop.ipc.Client.call(Client.java:1367)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
at com.sun.proxy.$Proxy7.getClusterMetrics(Unknown Source)
at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterMetrics(ApplicationClientProtocolPBClientImpl.java:271)
at sun.reflect.GeneratedMethodAccessor1.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
at com.sun.proxy.$Proxy8.getClusterMetrics(Unknown Source)
at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getYarnClusterMetrics(YarnClientImpl.java:605)
at org.apache.spark.deploy.yarn.Client.$anonfun$submitApplication$1(Client.scala:179)
at org.apache.spark.internal.Logging.logInfo(Logging.scala:57)
at org.apache.spark.internal.Logging.logInfo$(Logging.scala:56)
at org.apache.spark.deploy.yarn.Client.logInfo(Client.scala:65)
at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:179)
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1227)
at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1634)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.net.UnknownHostException
at org.apache.hadoop.ipc.Client$Connection.<init>(Client.java:450)
... 31 moreWe have also seen this in recent sqoop failures.
