Page MenuHomePhabricator

Cassandra outage: restbase1009-a.eqiad.wmnet
Closed, ResolvedPublic

Description

restbase1009-a.eqiad.wmnet went down today (2016-09-28) at 14:07, as the result of an OOM. The node appeared to be performing a slice query, and OOMd when deserializing an incoming message from restbase1007-a.eqiad.wmnet.

1[ ... ]
2
3DEBUG [SharedPool-Worker-83] 2016-09-28 14:06:28,018 SliceQueryPager.java:92 - Querying next page of slice query; new filter: SliceQueryFilter [reversed=true, slices=[[0010bfb7d3d2858411e6bad660cd128a07fd01, ]], count=2, toGroup = 1]
4DEBUG [SharedPool-Worker-17] 2016-09-28 14:06:28,019 AbstractQueryPager.java:95 - Fetched 1 live rows
5DEBUG [SharedPool-Worker-17] 2016-09-28 14:06:28,019 AbstractQueryPager.java:133 - Remaining rows to page: 2147483646
6ERROR [MessagingService-Incoming-/10.64.0.230] 2016-09-28 14:07:13,828 CassandraDaemon.java:185 - Exception in thread Thread[MessagingService-Incoming-/10.64.0.230,5,main]
7java.lang.OutOfMemoryError: Java heap space
8 at org.apache.cassandra.net.CompactEndpointSerializationHelper.deserialize(CompactEndpointSerializationHelper.java:36) ~[apache-cassandra-2.2.6.jar:2.2.6]
9 at org.apache.cassandra.net.MessageIn.read(MessageIn.java:62) ~[apache-cassandra-2.2.6.jar:2.2.6]
10 at org.apache.cassandra.net.IncomingTcpConnection.receiveMessage(IncomingTcpConnection.java:200) ~[apache-cassandra-2.2.6.jar:2.2.6]
11 at org.apache.cassandra.net.IncomingTcpConnection.receiveMessages(IncomingTcpConnection.java:177) ~[apache-cassandra-2.2.6.jar:2.2.6]
12 at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:91) ~[apache-cassandra-2.2.6.jar:2.2.6]
13DEBUG [SharedPool-Worker-6] 2016-09-28 14:07:13,832 FileCacheService.java:102 - Evicting cold readers for /srv/cassandra-a/data/local_group_wikimedia_T_parsoid_html/data-89cb8780f90411e492369fbfa298c4b0/la-10307-big-Data.db
14DEBUG [SharedPool-Worker-6] 2016-09-28 14:07:13,832 FileCacheService.java:102 - Evicting cold readers for /srv/cassandra-a/data/local_group_wikipedia_T_mobileapps_remaining/data-3648aad08e0911e5878e89a54413a7f6/la-31152-big-Data.db
15DEBUG [SharedPool-Worker-70] 2016-09-28 14:07:13,832 StorageProxy.java:1893 - Range slice timeout; received 0 of 1 responses for range 1 of 1
16
17[ ... ]

NOTE: The node came back up at 14:16; It was automatically restarted by Puppet.

Event Timeline

Eevans triaged this task as Medium priority.Sep 28 2016, 2:43 PM
Eevans updated the task description. (Show Details)
Eevans renamed this task from Cassandra outage: restbase1009-a.eqiadw.wmnet to Cassandra outage: restbase1009-a.eqiad.wmnet.Sep 28 2016, 8:04 PM

From the information available, the only conclusion that I can come to here is that this was a query that tripped over a very wide row, something that we are working to address in T94121: Understand and solve wide row issues for frequently edited and re-rendered pages.

Closing this issue.