Page MenuHomePhabricator

Log input from cassandra caused logstash process to crash repeatedly
Closed, InvalidPublic

Description

[05:21] <icinga-wm> PROBLEM - logstash process on logstash1003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 998 (logstash), command name java, args logstash

Logs on logstash1003 showed that process had died from heap exhaustion:

Error: Your application used more memory than the safety cap of 500M.
Specify -J-Xmx##m to increase it (## = cap size in MB).
Specify -w for full OutOfMemoryError stack trace

Network traffic to logstash cluster was up as was outbound from restbase cluster:

graph-1.png (387×747 px, 29 KB)

graph-2.png (387×747 px, 45 KB)

graph-3.png (237×577 px, 15 KB)

Event Timeline

bd808 raised the priority of this task from to Needs Triage.
bd808 updated the task description. (Show Details)
bd808 added a project: Wikimedia-Logstash.
bd808 subscribed.

The logstash process would crash each time it was restarted during the elevated traffic period. Once the traffic died down it was able to stay up.

that seem to coincide with a load test / scrub on cassandra staging cluster, from cassandra logs on cerium a ton of these

INFO  [RMI TCP Connection(909)-10.64.16.147] 2016-02-11 05:58:49,561 ColumnFamilyStore.java:1929 - Spinning trying to capture a
ll readers [SSTableReader(path='/srv/cassandra-a/data/local_group_wikipedia_T_parsoid_html/data-f3c38310f28b11e4852cad59d68785c
5/local_group_wikipedia_T_parsoid_html-data-ka-50907-Data.db'), SSTableReader(path='/srv/cassandra-a/data/local_group_wikipedia
_T_parsoid_html/data-f3c38310f28b11e4852cad59d68785c5/local_group_wikipedia_T_parsoid_html-data-ka-50842-Data.db'), SSTableRead
er(path='/srv/cassandra-a/data/local_group_wikipedia_T_parsoid_html/data-f3c38310f28b11e4852cad59d68785c5/local_group_wikipedia
_T_parsoid_html-data-ka-50815-Data.db'), SSTableReader(path='/srv/cassandra-a/data/local_group_wikipedia_T_parsoid_html/data-f3
c38310f28b11e4852cad59d68785c5/local_group_wikipedia_T_parsoid_html-data-ka-50880-Data.db'), SSTableReader(path='/srv/cassandra-a/data/local_group_wikipedia_T_parsoid_html/data-f3c38310f28b11e4852cad59d68785c5/local_group_wikipedia_T_parsoid_html-data-ka-50823-Data.db'), SSTableReader(path='/srv/cassandra-a/data/local_group_wikipedia_T_parsoid_html/data-f3c38310f28b11e4852cad59d68785c5/local_group_wikipedia_T_parsoid_html-data-ka-50888-Data.db'), SSTableReader(path='/srv/cassandra-a/data/local_group_wikipedia_T_parsoid_html/data-f3c38310f28b11e4852cad59d68785c5/local_group_wikipedia_T_parsoid_html-data-ka-50834-Data.db'), SSTableReader(path='/srv/cassandra-a/data/local_group_wikipedia_T_parsoid_html/data-f3c38310f28b11e4852cad59d68785c5/local_group_wikipedia_T_parsoid_html-data-ka-50807-Data.db'), SSTableReader(path='/srv/cassandra-a/data/local_group_wikipedia_T_parsoid_html/data-f3c38310f28b11e4852cad59d68785c5/local_group_wikipedia_T_parsoid_html-data-ka-50899-Data.db'), SSTableReader(path='/srv/cassandra-a/data/local_group_wikipedia_T_parsoid_html/data-f3c38310f28b11e4852cad59d68785c5/local_group_wikipedia_T_parsoid_html-data-ka-50850-Data.db'), SSTableReader(path='/srv/cassandra-a/data/local_group_wikipedia_T_parsoid_html/data-f3c38310f28b11e4852cad59d68785c5/local_group_wikipedia_T_parsoid_html-data-ka-50872-Data.db'), SSTableReader(path='/srv/cassandra-a/data/local_group_wikipedia_T_parsoid_html/data-f3c38310f28b11e4852cad59d68785c5/local_group_wikipedia_T_parsoid_html-data-ka-50923-Data.db'), SSTableReader(path='/srv/cassandra-a/data/local_group_wikipedia_T_parsoid_html/data-f3c38310f28b11e4852cad59d68785c5/local_group_wikipedia_T_parsoid_html-data-ka-50915-Data.db'), SSTableReader(path='/srv/cassandra-a/data/local_group_wikipedia_T_parsoid_html/data-f3c38310f28b11e4852cad59d68785c5/local_group_wikipedia_T_parsoid_html-data-ka-50820-Data.db'), SSTableReader(path='/srv/cassandra-a/data/local_group_wikipedia_T_parsoid_html/data-f3c38310f28b11e4852cad59d68785c5/local_group_wikipedia_T_parsoid_html-data-ka-50839-Data.db'), SSTableReader(path='/srv/cassandra-a/data/local_group_wikipedia_T_parsoid_html/data-f3c38310f28b11e4852cad59d68785c5/local_group_wikipedia_T_parsoid_html-data-ka-50883-Data.db'), SSTableReader(path='/srv/cassandra-a/data/local_group_wikipedia_T_parsoid_html/data-f3c38310f28b11e4852cad59d68785c5/local_group_wikipedia_T_parsoid_html-data-ka-50891-Data.db'), SSTableReader(path='/srv/cassandra-a/data/local_group_wikipedia_T_parsoid_html/data-f3c38310f28b11e4852cad59d68785c5/local_group_wikipedia_T_parsoid_html-data-ka-50828-Data.db'), SSTableReader(path='/srv/cassandra-a/data/local_group_wikipedia_T_parsoid_html/data-f3c38310f28b11e4852cad59d68785c5/local_group_wikipedia_T_parsoid_html-data-ka-50831-Data.db'), SSTableReader(pa

which filled the log rotation

-rw-r--r--  1 cassandra cassandra  1116039 Feb 11 05:58 system-a.log.21.zip
...
-rw-r--r--  1 cassandra cassandra  1116286 Feb 11 05:58 system-a.log.1.zip

cc @Eevans @GWicke @mobrovac

From the Logstash side the part that is disturbing to me is that the service could be DOS'ed by one client.

This looks a lot like https://issues.apache.org/jira/browse/CASSANDRA-10829, which is fixed in 2.1.13.

Lets consider upgrading to 2.1.13, or possibly even 2.2.5.

From the Logstash side the part that is disturbing to me is that the service could be DOS'ed by one client.

indeed, I wondered if logstash could support rate limiting, it seems it does with a recipe here https://github.com/logstash-plugins/logstash-output-elasticsearch/issues/338#issuecomment-176521668 by using throttle plus sleep filters.
of course it might be the case that logstash or the jvm was crashing just upon encountering large payloads, that's much harder to rate limit on the logstash side so perhaps the producer side could also be rate limited.
another observation since this is all udp is that as a last resort protection we could ratelimit at iptables level on a per source ip/port basis

From the Logstash side the part that is disturbing to me is that the service could be DOS'ed by one client.

indeed, I wondered if logstash could support rate limiting, it seems it does with a recipe here https://github.com/logstash-plugins/logstash-output-elasticsearch/issues/338#issuecomment-176521668 by using throttle plus sleep filters.
of course it might be the case that logstash or the jvm was crashing just upon encountering large payloads, that's much harder to rate limit on the logstash side so perhaps the producer side could also be rate limited.
another observation since this is all udp is that as a last resort protection we could ratelimit at iptables level on a per source ip/port basis

I did a quick search to see if logback (framework used by Cassandra) has any in-built rate-limiting functionality, and turned up nothing. We could implement a rate-limiter rather trivially though by banging together a Filter implementation and a Guava rate limiter. It wouldn't be much code, but of course it would be something else to maintain, deploy, etc.

No useful logging was produced by the crashes, so I don't know if the OOM issue was due to decoding the inbound message stream, creating the outbound bulk update data for Elasticsearch, or a combination of both. Elasticsearch was not impacted however so it's unlikely that rate limiting the output plugin would have improved matters.

AFAIK we haven't seen this bug reoccurring, tentatively resolving.