Page MenuHomePhabricator

Cassandra loading job are causing Pageview stale data
Closed, ResolvedPublic

Description

We are seeing a lot of jobs failing for the following error:

_1485458133961_46156_r_000000_0, java.io.IOException: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: aqs1009-a.eqiad.wmnet/10.64.48.122:9042 (com.datastax.driver.core.TransportException: [aqs1009-a.eqiad.wmnet/10.64.48.122:9042] Cannot connect))
	at org.wikimedia.analytics.refinery.cassandra.CqlRecordWriter$RangeClient.attempt_connect(CqlRecordWriter.java:365)
	at org.wikimedia.analytics.refinery.cassandra.CqlRecordWriter$RangeClient.run(CqlRecordWriter.java:332)
Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: aqs1009-a.eqiad.wmnet/10.64.48.122:9042 (com.datastax.driver.core.TransportException: [aqs1009-a.eqiad.wmnet/10.64.48.122:9042] Cannot connect))
	at com.datastax.driver.core.ControlConnection.reconnectInternal(ControlConnection.java:229)
	at com.datastax.driver.core.ControlConnection.connect(ControlConnection.java:84)
	at com.datastax.driver.core.Cluster$Manager.init(Cluster.java:1269)
	at com.datastax.driver.core.Cluster.init(Cluster.java:158)
	at com.datastax.driver.core.Cluster.connect(Cluster.java:248)
	at org.wikimedia.analytics.refinery.cassandra.CqlRecordWriter$RangeClient.attempt_connect(CqlRecordWriter.java:345)
	... 1 more

This might be due to the new aqs1009-a Cassandra instance that finished the bootstrap this night, together with the absence of the related network ACLs on the routers (T157435).

It is not clear to me why all of a sudden aqs1009-a has been picked up.

Event Timeline

elukey triaged this task as High priority.Feb 8 2017, 8:45 AM

To be on the safe side, we are going to wait for the network operations experts before proceeding in changing the ACLs on the routers, since this is not a critical outage.

Network rules added, all jobs restarted and proceeding normally.