We are seeing a lot of jobs failing for the following error:
_1485458133961_46156_r_000000_0, java.io.IOException: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: aqs1009-a.eqiad.wmnet/10.64.48.122:9042 (com.datastax.driver.core.TransportException: [aqs1009-a.eqiad.wmnet/10.64.48.122:9042] Cannot connect)) at org.wikimedia.analytics.refinery.cassandra.CqlRecordWriter$RangeClient.attempt_connect(CqlRecordWriter.java:365) at org.wikimedia.analytics.refinery.cassandra.CqlRecordWriter$RangeClient.run(CqlRecordWriter.java:332) Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: aqs1009-a.eqiad.wmnet/10.64.48.122:9042 (com.datastax.driver.core.TransportException: [aqs1009-a.eqiad.wmnet/10.64.48.122:9042] Cannot connect)) at com.datastax.driver.core.ControlConnection.reconnectInternal(ControlConnection.java:229) at com.datastax.driver.core.ControlConnection.connect(ControlConnection.java:84) at com.datastax.driver.core.Cluster$Manager.init(Cluster.java:1269) at com.datastax.driver.core.Cluster.init(Cluster.java:158) at com.datastax.driver.core.Cluster.connect(Cluster.java:248) at org.wikimedia.analytics.refinery.cassandra.CqlRecordWriter$RangeClient.attempt_connect(CqlRecordWriter.java:345) ... 1 more
This might be due to the new aqs1009-a Cassandra instance that finished the bootstrap this night, together with the absence of the related network ACLs on the routers (T157435).
It is not clear to me why all of a sudden aqs1009-a has been picked up.