Log input from cassandra caused logstash process to crash repeatedly
Closed, InvalidPublic
Actions

Assigned To

None

Authored By

	bd808
	Feb 11 2016, 6:43 AM

Description

[05:21] <icinga-wm> PROBLEM - logstash process on logstash1003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 998 (logstash), command name java, args logstash

Logs on logstash1003 showed that process had died from heap exhaustion:

Error: Your application used more memory than the safety cap of 500M.
Specify -J-Xmx##m to increase it (## = cap size in MB).
Specify -w for full OutOfMemoryError stack trace

Network traffic to logstash cluster was up as was outbound from restbase cluster:

Related Objects
Search...

Status	Assigned	Task
Invalid	None	T126582 Log input from cassandra caused logstash process to crash repeatedly
Resolved	Eevans	T126629 Cassandra 2.2.6
Resolved	None	T127365 Cassandra upgrades in staging attempted to start root instance
Open	None	T137419 Investigate aberrant disk read throughput in Cassandra (affects 2.2.x and 3.x)
Resolved	Eevans	T137474 Investigate lack of recency bias in Cassandra histogram metrics
Resolved	Eevans	T139639 RESTBase: Cassandra 2.2.6 post-upgrade checklist

Event Timeline

bd808 created this task.Feb 11 2016, 6:43 AM

bd808 raised the priority of this task from to Needs Triage.

bd808 updated the task description. (Show Details)

bd808 added a project: Wikimedia-Logstash.

bd808 subscribed.

Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptFeb 11 2016, 6:43 AM

The logstash process would crash each time it was restarted during the elevated traffic period. Once the traffic died down it was able to stay up.

that seem to coincide with a load test / scrub on cassandra staging cluster, from cassandra logs on cerium a ton of these

INFO  [RMI TCP Connection(909)-10.64.16.147] 2016-02-11 05:58:49,561 ColumnFamilyStore.java:1929 - Spinning trying to capture a
ll readers [SSTableReader(path='/srv/cassandra-a/data/local_group_wikipedia_T_parsoid_html/data-f3c38310f28b11e4852cad59d68785c
5/local_group_wikipedia_T_parsoid_html-data-ka-50907-Data.db'), SSTableReader(path='/srv/cassandra-a/data/local_group_wikipedia
_T_parsoid_html/data-f3c38310f28b11e4852cad59d68785c5/local_group_wikipedia_T_parsoid_html-data-ka-50842-Data.db'), SSTableRead
er(path='/srv/cassandra-a/data/local_group_wikipedia_T_parsoid_html/data-f3c38310f28b11e4852cad59d68785c5/local_group_wikipedia
_T_parsoid_html-data-ka-50815-Data.db'), SSTableReader(path='/srv/cassandra-a/data/local_group_wikipedia_T_parsoid_html/data-f3
c38310f28b11e4852cad59d68785c5/local_group_wikipedia_T_parsoid_html-data-ka-50880-Data.db'), SSTableReader(path='/srv/cassandra-a/data/local_group_wikipedia_T_parsoid_html/data-f3c38310f28b11e4852cad59d68785c5/local_group_wikipedia_T_parsoid_html-data-ka-50823-Data.db'), SSTableReader(path='/srv/cassandra-a/data/local_group_wikipedia_T_parsoid_html/data-f3c38310f28b11e4852cad59d68785c5/local_group_wikipedia_T_parsoid_html-data-ka-50888-Data.db'), SSTableReader(path='/srv/cassandra-a/data/local_group_wikipedia_T_parsoid_html/data-f3c38310f28b11e4852cad59d68785c5/local_group_wikipedia_T_parsoid_html-data-ka-50834-Data.db'), SSTableReader(path='/srv/cassandra-a/data/local_group_wikipedia_T_parsoid_html/data-f3c38310f28b11e4852cad59d68785c5/local_group_wikipedia_T_parsoid_html-data-ka-50807-Data.db'), SSTableReader(path='/srv/cassandra-a/data/local_group_wikipedia_T_parsoid_html/data-f3c38310f28b11e4852cad59d68785c5/local_group_wikipedia_T_parsoid_html-data-ka-50899-Data.db'), SSTableReader(path='/srv/cassandra-a/data/local_group_wikipedia_T_parsoid_html/data-f3c38310f28b11e4852cad59d68785c5/local_group_wikipedia_T_parsoid_html-data-ka-50850-Data.db'), SSTableReader(path='/srv/cassandra-a/data/local_group_wikipedia_T_parsoid_html/data-f3c38310f28b11e4852cad59d68785c5/local_group_wikipedia_T_parsoid_html-data-ka-50872-Data.db'), SSTableReader(path='/srv/cassandra-a/data/local_group_wikipedia_T_parsoid_html/data-f3c38310f28b11e4852cad59d68785c5/local_group_wikipedia_T_parsoid_html-data-ka-50923-Data.db'), SSTableReader(path='/srv/cassandra-a/data/local_group_wikipedia_T_parsoid_html/data-f3c38310f28b11e4852cad59d68785c5/local_group_wikipedia_T_parsoid_html-data-ka-50915-Data.db'), SSTableReader(path='/srv/cassandra-a/data/local_group_wikipedia_T_parsoid_html/data-f3c38310f28b11e4852cad59d68785c5/local_group_wikipedia_T_parsoid_html-data-ka-50820-Data.db'), SSTableReader(path='/srv/cassandra-a/data/local_group_wikipedia_T_parsoid_html/data-f3c38310f28b11e4852cad59d68785c5/local_group_wikipedia_T_parsoid_html-data-ka-50839-Data.db'), SSTableReader(path='/srv/cassandra-a/data/local_group_wikipedia_T_parsoid_html/data-f3c38310f28b11e4852cad59d68785c5/local_group_wikipedia_T_parsoid_html-data-ka-50883-Data.db'), SSTableReader(path='/srv/cassandra-a/data/local_group_wikipedia_T_parsoid_html/data-f3c38310f28b11e4852cad59d68785c5/local_group_wikipedia_T_parsoid_html-data-ka-50891-Data.db'), SSTableReader(path='/srv/cassandra-a/data/local_group_wikipedia_T_parsoid_html/data-f3c38310f28b11e4852cad59d68785c5/local_group_wikipedia_T_parsoid_html-data-ka-50828-Data.db'), SSTableReader(path='/srv/cassandra-a/data/local_group_wikipedia_T_parsoid_html/data-f3c38310f28b11e4852cad59d68785c5/local_group_wikipedia_T_parsoid_html-data-ka-50831-Data.db'), SSTableReader(pa

which filled the log rotation

-rw-r--r--  1 cassandra cassandra  1116039 Feb 11 05:58 system-a.log.21.zip
...
-rw-r--r--  1 cassandra cassandra  1116286 Feb 11 05:58 system-a.log.1.zip

cc @Eevans @GWicke @mobrovac

From the Logstash side the part that is disturbing to me is that the service could be DOS'ed by one client.

This looks a lot like https://issues.apache.org/jira/browse/CASSANDRA-10829, which is fixed in 2.1.13.

Lets consider upgrading to 2.1.13, or possibly even 2.2.5.

Eevans mentioned this in T126629: Cassandra 2.2.6.Feb 11 2016, 5:33 PM

Eevans added a subtask: T126629: Cassandra 2.2.6.Feb 11 2016, 5:35 PM

In T126582#2018760, @bd808 wrote:

From the Logstash side the part that is disturbing to me is that the service could be DOS'ed by one client.

indeed, I wondered if logstash could support rate limiting, it seems it does with a recipe here https://github.com/logstash-plugins/logstash-output-elasticsearch/issues/338#issuecomment-176521668 by using throttle plus sleep filters.
of course it might be the case that logstash or the jvm was crashing just upon encountering large payloads, that's much harder to rate limit on the logstash side so perhaps the producer side could also be rate limited.
another observation since this is all udp is that as a last resort protection we could ratelimit at iptables level on a per source ip/port basis

In T126582#2030765, @fgiunchedi wrote:

In T126582#2018760, @bd808 wrote:

From the Logstash side the part that is disturbing to me is that the service could be DOS'ed by one client.

indeed, I wondered if logstash could support rate limiting, it seems it does with a recipe here https://github.com/logstash-plugins/logstash-output-elasticsearch/issues/338#issuecomment-176521668 by using throttle plus sleep filters.
of course it might be the case that logstash or the jvm was crashing just upon encountering large payloads, that's much harder to rate limit on the logstash side so perhaps the producer side could also be rate limited.
another observation since this is all udp is that as a last resort protection we could ratelimit at iptables level on a per source ip/port basis

I did a quick search to see if logback (framework used by Cassandra) has any in-built rate-limiting functionality, and turned up nothing. We could implement a rate-limiter rather trivially though by banging together a Filter implementation and a Guava rate limiter. It wouldn't be much code, but of course it would be something else to maintain, deploy, etc.

No useful logging was produced by the crashes, so I don't know if the OOM issue was due to decoding the inbound message stream, creating the outbound bulk update data for Elasticsearch, or a combination of both. Elasticsearch was not impacted however so it's unlikely that rate limiting the output plugin would have improved matters.

Eevans closed subtask T126629: Cassandra 2.2.6 as Resolved.Nov 29 2016, 9:35 PM

AFAIK we haven't seen this bug reoccurring, tentatively resolving.

fgiunchedi added a project: observability.Aug 19 2019, 2:28 PM

Log input from cassandra caused logstash process to crash repeatedlyClosed, InvalidPublicActions

Description

Related ObjectsSearch...

Event Timeline

Log input from cassandra caused logstash process to crash repeatedly
Closed, InvalidPublic
Actions

Related Objects
Search...