Have Logstash report per-channel log message rate to Graphite
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	ori
	May 29 2015, 12:10 AM

Description

@bd808 observes that there exists a class of cluster issues that Logstash is especially good at disclosing -- namely, any issue that results in a spike in the volume of log messages sent on a particular channel. If the rate of log records per channel were forwarded to Graphite, we could easily set up anomaly-detection-based alerting. LogStash apparently has a Graphite sink that makes this straightforward to do: https://www.elastic.co/guide/en/logstash/current/plugins-outputs-graphite.html

Details

	Subject	Repo	Branch	Lines +/-
	logstash: Count MediaWiki log events with statsd	operations/puppet	production	+107 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	bd808	T69817 Monitor for anomalies/spikes in read failures of memcached
Resolved	bd808	T100735 Have Logstash report per-channel log message rate to Graphite
Resolved	bd808	T99735 Upgrade Logstash to 1.5.3
Resolved	bd808	T97545 reinstall logstash1001-1003
Resolved	bd808	T96692 Rack and Setup (3) Logstash Servers
Resolved	RobH	T84958 eqiad: (3) servers for logstash service
Declined	bd808	T87078 Upgrade RAM for logstash100[123] to 64G
Resolved	RobH	T89402 purchase 3 additional logstash nodes
Declined	RobH	T87460 Allocate temporary Elasticsearch nodes from spares pool for Logstash
Resolved	faidon	T97481 jessie installs fail - mirror issue due to jessie release?
Resolved	bd808	T97645 Elasticsearch not starting on Jessie hosts
Resolved	bd808	T101541 Build jessie based elasticsearch/logstash/kibana (ELK) host for beta testing
Resolved	• MoritzMuehlenhoff	T98042 Update Wikimedia apt repo to include debs for Elasticsearch & Logstash on jessie
Resolved	bd808	T105101 Upgrade Logstash Elasticsearch cluster to 1.6.0
Resolved	fgiunchedi	T104035 logstash partman recipe huge root partition
Resolved	bd808	T107083 Java class org.bouncycastle.jcajce.provider.digest.MD5$Digest not found for Logstash on logstash1001 (jessie)
Declined	bd808	T107121 Setup rsyncable git fat store to host Logstash plugins
Resolved	• MoritzMuehlenhoff	T107916 Import logstash 1.5.3 into apt.wm.o

Event Timeline

ori created this task.May 29 2015, 12:10 AM

ori raised the priority of this task from to Needs Triage.

ori updated the task description. (Show Details)

ori added a project: Wikimedia-Logstash.

ori added subscribers: ori, bd808.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 29 2015, 12:10 AM

bd808 added a project: User-bd808.May 29 2015, 12:10 AM

bd808 mentioned this in T100780: mw1150 spams "memcached error for key" since May 29 3:00am UTC.May 29 2015, 4:25 PM

bd808 triaged this task as Medium priority.Jun 5 2015, 9:41 PM

bd808 added a subtask: T99735: Upgrade Logstash to 1.5.3.

Smalyshev subscribed.Jun 8 2015, 7:55 PM

fgiunchedi subscribed.Jun 17 2015, 10:15 PM

ori assigned this task to bd808.Jul 16 2015, 9:31 PM

ori set Security to None.

hashar subscribed.Jul 16 2015, 9:42 PM

Tgr mentioned this in T106066: Don't show "Nonce already used" error on memcache failure.Jul 16 2015, 10:08 PM

Tgr added a parent task: T106066: Don't show "Nonce already used" error on memcache failure.

bd808 added a parent task: T69817: Monitor for anomalies/spikes in read failures of memcached.Jul 17 2015, 2:45 PM

Change 230233 had a related patch set uploaded (by BryanDavis):
logstash: Count MediaWiki log events with statsd

https://gerrit.wikimedia.org/r/230233

gerritbot added a project: Patch-For-Review.Aug 7 2015, 9:29 PM

Working on Beta-Cluster-Infrastructure: https://graphite.wmflabs.org/dashboard/#BetaClusterLogEvents

bd808 moved this task from To Do to Needs Review/Feedback on the User-bd808 board.Aug 7 2015, 11:38 PM

bd808 moved this task from Needs Review/Feedback to In Dev/Progress on the User-bd808 board.

bd808 moved this task from Backlog to In Dev/Progress on the Wikimedia-Logstash board.Aug 7 2015, 11:44 PM

On the gerrit patch @fgiunchedi wrote:

I'm assuming the statsd ruby client underneath will send a statsd sample for each line? depending on the volume it might pose problems, see also T89857#1519939

@bd808 responded:

Based on what I'm seeing in logstash for the last 48 hours I think we can expect this to record 750-900 events per second. Looking at the plugin itself (https://github.com/logstash-plugins/logstash-output-statsd/blob/master/lib/logstash/outputs/statsd.rb), I would guess that these events are sent individually to the statsd endpoint (no batching).

The metrics being generated here are unique to Logstash so we could aggregate them with a dedicated statsd endpoint. They will be coming from three Logstash frontends so we do need a single point of aggregation rather than allowing each Logstash host to have its own local statsd. That could be changed by inserting an identifier for the collecting Logstash host into the metric names ("logstash.$HOST.rate.mediawiki.$CHANNEL.$LEVEL") at a cost of 3x data storage and slightly more complicated queries to generate graphs from the aggregate data.

I've been thinking about this a bit more and have an idea that might not suck:

On each host that is running Logstash we could also provision a statsite service configured to flush every 1s (or 10s) using a flush script that knows how to relay data to another statsite service. These would be pointed at the central statsdlb/statsite service. Since the metrics we are looking to record here are simple counts, having multiple layers of aggregation won't hurt anything.

@fgiunchedi said on the patch that 1k/s wasn't too horrible of an event rate so no fancy gymnastics will be required \o/

bd808 closed subtask T99735: Upgrade Logstash to 1.5.3 as Resolved.Aug 11 2015, 4:35 PM

Change 230233 merged by Filippo Giunchedi:
logstash: Count MediaWiki log events with statsd

https://gerrit.wikimedia.org/r/230233

bd808 mentioned this in rOPUPdd582780a7b3: logstash: Count MediaWiki log events with statsd.Aug 12 2015, 5:24 PM

Change 231704 had a related patch set uploaded (by BryanDavis):
Add icinga alert for anomalous logstash.rate.mediawiki.memcached.ERROR.count

https://gerrit.wikimedia.org/r/231704

bd808 closed this task as Resolved.Aug 14 2015, 10:38 PM

bd808 moved this task from In Dev/Progress to Archive on the Wikimedia-Logstash board.

bd808 moved this task from In Dev/Progress to Done on the User-bd808 board.

bd808 moved this task from Done to Archive on the User-bd808 board.Nov 4 2015, 4:41 AM

bd808 merged a task: T64227: Send aggregate php error counts from logstash to graphite.Jun 21 2016, 5:40 PM

bd808 added subscribers: greg, • wikibugs-l-list, aaron.

fgiunchedi added a project: observability.Aug 19 2019, 2:33 PM

Tgr removed a parent task: T106066: Don't show "Nonce already used" error on memcache failure.Feb 18 2020, 1:15 AM

Krinkle renamed this task from Have LogStash report per-channel log message rate to Graphite to Have Logstash report per-channel log message rate to Graphite.Mar 16 2020, 7:02 PM

Krinkle updated the task description. (Show Details)

Krinkle removed a subscriber: • wikibugs-l-list.

	F754264: render.png
	Aug 7 2015, 10:20 PM

Have Logstash report per-channel log message rate to GraphiteClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Have Logstash report per-channel log message rate to Graphite
Closed, ResolvedPublic
Actions

Related Objects
Search...