restbase1009-a.eqiad.wmnet went down today (2016-09-28) at 14:07, as the result of an OOM. The node appeared to be performing a slice query, and OOMd when deserializing an incoming message from restbase1007-a.eqiad.wmnet.
1 | [ ... ] |
---|---|
2 | |
3 | DEBUG [SharedPool-Worker-83] 2016-09-28 14:06:28,018 SliceQueryPager.java:92 - Querying next page of slice query; new filter: SliceQueryFilter [reversed=true, slices=[[0010bfb7d3d2858411e6bad660cd128a07fd01, ]], count=2, toGroup = 1] |
4 | DEBUG [SharedPool-Worker-17] 2016-09-28 14:06:28,019 AbstractQueryPager.java:95 - Fetched 1 live rows |
5 | DEBUG [SharedPool-Worker-17] 2016-09-28 14:06:28,019 AbstractQueryPager.java:133 - Remaining rows to page: 2147483646 |
6 | ERROR [MessagingService-Incoming-/10.64.0.230] 2016-09-28 14:07:13,828 CassandraDaemon.java:185 - Exception in thread Thread[MessagingService-Incoming-/10.64.0.230,5,main] |
7 | java.lang.OutOfMemoryError: Java heap space |
8 | at org.apache.cassandra.net.CompactEndpointSerializationHelper.deserialize(CompactEndpointSerializationHelper.java:36) ~[apache-cassandra-2.2.6.jar:2.2.6] |
9 | at org.apache.cassandra.net.MessageIn.read(MessageIn.java:62) ~[apache-cassandra-2.2.6.jar:2.2.6] |
10 | at org.apache.cassandra.net.IncomingTcpConnection.receiveMessage(IncomingTcpConnection.java:200) ~[apache-cassandra-2.2.6.jar:2.2.6] |
11 | at org.apache.cassandra.net.IncomingTcpConnection.receiveMessages(IncomingTcpConnection.java:177) ~[apache-cassandra-2.2.6.jar:2.2.6] |
12 | at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:91) ~[apache-cassandra-2.2.6.jar:2.2.6] |
13 | DEBUG [SharedPool-Worker-6] 2016-09-28 14:07:13,832 FileCacheService.java:102 - Evicting cold readers for /srv/cassandra-a/data/local_group_wikimedia_T_parsoid_html/data-89cb8780f90411e492369fbfa298c4b0/la-10307-big-Data.db |
14 | DEBUG [SharedPool-Worker-6] 2016-09-28 14:07:13,832 FileCacheService.java:102 - Evicting cold readers for /srv/cassandra-a/data/local_group_wikipedia_T_mobileapps_remaining/data-3648aad08e0911e5878e89a54413a7f6/la-31152-big-Data.db |
15 | DEBUG [SharedPool-Worker-70] 2016-09-28 14:07:13,832 StorageProxy.java:1893 - Range slice timeout; received 0 of 1 responses for range 1 of 1 |
16 | |
17 | [ ... ] |
NOTE: The node came back up at 14:16; It was automatically restarted by Puppet.