Page MenuHomePhabricator

Have LogStash report per-channel log message rate to Graphite
Closed, ResolvedPublic

Description

@bd808 observes that there exists a class of cluster issues that LogStash is especially good at disclosing -- namely, any issue that results in a spike in the volume of log messages sent on a particular channel. If the rate of log records per channel were forwarded to Graphite, we could easily set up anomaly-detection-based alerting. LogStash apparently has a Graphite sink that makes this straightforward to do: https://www.elastic.co/guide/en/logstash/current/plugins-outputs-graphite.html

Details

Related Gerrit Patches:

Related Objects

Event Timeline

ori created this task.May 29 2015, 12:10 AM
ori raised the priority of this task from to Needs Triage.
ori updated the task description. (Show Details)
ori added a project: Wikimedia-Logstash.
ori added subscribers: ori, bd808.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 29 2015, 12:10 AM
bd808 triaged this task as Normal priority.Jun 5 2015, 9:41 PM
ori assigned this task to bd808.Jul 16 2015, 9:31 PM
ori set Security to None.
hashar added a subscriber: hashar.Jul 16 2015, 9:42 PM

Change 230233 had a related patch set uploaded (by BryanDavis):
logstash: Count MediaWiki log events with statsd

https://gerrit.wikimedia.org/r/230233

bd808 moved this task from Needs Review/Feedback to In Dev/Progress on the User-bd808 board.

On the gerrit patch @fgiunchedi wrote:

I'm assuming the statsd ruby client underneath will send a statsd sample for each line? depending on the volume it might pose problems, see also T89857#1519939

@bd808 responded:

Based on what I'm seeing in logstash for the last 48 hours I think we can expect this to record 750-900 events per second. Looking at the plugin itself (https://github.com/logstash-plugins/logstash-output-statsd/blob/master/lib/logstash/outputs/statsd.rb), I would guess that these events are sent individually to the statsd endpoint (no batching).
The metrics being generated here are unique to Logstash so we could aggregate them with a dedicated statsd endpoint. They will be coming from three Logstash frontends so we do need a single point of aggregation rather than allowing each Logstash host to have its own local statsd. That could be changed by inserting an identifier for the collecting Logstash host into the metric names ("logstash.$HOST.rate.mediawiki.$CHANNEL.$LEVEL") at a cost of 3x data storage and slightly more complicated queries to generate graphs from the aggregate data.

I've been thinking about this a bit more and have an idea that might not suck:

On each host that is running Logstash we could also provision a statsite service configured to flush every 1s (or 10s) using a flush script that knows how to relay data to another statsite service. These would be pointed at the central statsdlb/statsite service. Since the metrics we are looking to record here are simple counts, having multiple layers of aggregation won't hurt anything.

@fgiunchedi said on the patch that 1k/s wasn't too horrible of an event rate so no fancy gymnastics will be required \o/

Change 230233 merged by Filippo Giunchedi:
logstash: Count MediaWiki log events with statsd

https://gerrit.wikimedia.org/r/230233

Change 231704 had a related patch set uploaded (by BryanDavis):
Add icinga alert for anomalous logstash.rate.mediawiki.memcached.ERROR.count

https://gerrit.wikimedia.org/r/231704

bd808 closed this task as Resolved.Aug 14 2015, 10:38 PM
bd808 moved this task from In Dev/Progress to Archive on the Wikimedia-Logstash board.
bd808 moved this task from In Dev/Progress to Done on the User-bd808 board.
bd808 moved this task from Done to Archive on the User-bd808 board.Nov 4 2015, 4:41 AM