Cassandra loading job are causing Pageview stale data
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	elukey
	Feb 8 2017, 8:44 AM

Description

We are seeing a lot of jobs failing for the following error:

_1485458133961_46156_r_000000_0, java.io.IOException: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: aqs1009-a.eqiad.wmnet/10.64.48.122:9042 (com.datastax.driver.core.TransportException: [aqs1009-a.eqiad.wmnet/10.64.48.122:9042] Cannot connect))
	at org.wikimedia.analytics.refinery.cassandra.CqlRecordWriter$RangeClient.attempt_connect(CqlRecordWriter.java:365)
	at org.wikimedia.analytics.refinery.cassandra.CqlRecordWriter$RangeClient.run(CqlRecordWriter.java:332)
Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: aqs1009-a.eqiad.wmnet/10.64.48.122:9042 (com.datastax.driver.core.TransportException: [aqs1009-a.eqiad.wmnet/10.64.48.122:9042] Cannot connect))
	at com.datastax.driver.core.ControlConnection.reconnectInternal(ControlConnection.java:229)
	at com.datastax.driver.core.ControlConnection.connect(ControlConnection.java:84)
	at com.datastax.driver.core.Cluster$Manager.init(Cluster.java:1269)
	at com.datastax.driver.core.Cluster.init(Cluster.java:158)
	at com.datastax.driver.core.Cluster.connect(Cluster.java:248)
	at org.wikimedia.analytics.refinery.cassandra.CqlRecordWriter$RangeClient.attempt_connect(CqlRecordWriter.java:345)
	... 1 more

This might be due to the new aqs1009-a Cassandra instance that finished the bootstrap this night, together with the absence of the related network ACLs on the routers (T157435).

It is not clear to me why all of a sudden aqs1009-a has been picked up.

Related Objects

Mentioned In: T157806: Review the Analytics Firewall rules on cr1/cr2
T157435: Review ACLs for the Analytics VLAN
Mentioned Here: T157435: Review ACLs for the Analytics VLAN

Event Timeline

elukey created this task.Feb 8 2017, 8:44 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 8 2017, 8:44 AM

elukey triaged this task as High priority.Feb 8 2017, 8:45 AM

elukey updated the task description. (Show Details)Feb 8 2017, 9:24 AM

To be on the safe side, we are going to wait for the network operations experts before proceeding in changing the ACLs on the routers, since this is not a critical outage.

Network rules added, all jobs restarted and proceeding normally.

elukey mentioned this in T157435: Review ACLs for the Analytics VLAN.Feb 8 2017, 5:51 PM

elukey mentioned this in T157806: Review the Analytics Firewall rules on cr1/cr2.Feb 10 2017, 2:44 PM

Aklapper removed a project: Analytics.Jul 4 2020, 7:59 AM

Cassandra loading job are causing Pageview stale dataClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Cassandra loading job are causing Pageview stale data
Closed, ResolvedPublic
Actions