Monitor for anomalies/spikes in read failures of memcached
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	greg
	Jul 10 2014, 6:28 PM

Description

(Came out of https://wikitech.wikimedia.org/wiki/Incident_documentation/20140517-bits )

Discussion:

Timo: In retrospect we saw that we had data in logstash clearly indicating a massive increase in read failures from this memcached instance (basicaly from < 1% to nearly 100%). This could and should be monitored by icinga and reported to ops automatically. This would've helped us catch it much earlier.

Chat from irc on 2014-07-09:
[10:16] <Krinkle> For something that is logged in logstash (e.g. memcached errors). What is the strategy you'd typically take to monitor it icinga? Is there a step in between or would you actually have icinga use logstash?
[10:16] <Krinkle> I think the latter should be possible for more complex queries or aggregated data. Though I reckon in case of memcached there's probably a more direct approach possible.
[10:18] <bd808> Good question. Logstash by itself can do point-in-time monitoring, but it really has no useful way to alert on trends itself.
[10:19] <Krinkle> I think most critical thigns should probably be polled by icinga directly. But more on multiple ocasions have I used logstash to quite easily pinpoint where an error came from. And it'd be useful to have those trends also result in pings to ops (perhaps not as critical via text but at least an irc ping would be useful).
[10:20] <bd808> One way™ to do it would be to graph trends in graphite driven by counts made by logstash and alert with icinga when the trend does something.
[10:20] <Krinkle> Right now logstash is mostly polling and digging manually, after the fact. That's immensly useful and it's good at that. But I think it has more potential.
[10:20] <Krinkle> Ah, I see. So it'd go to graphite after logstash. Interesting.
[10:20] <bd808> We aren't doing it now, but logstash can feed graphite in a statsd fashion
[10:20] <Krinkle> Right.
[10:21] <Krinkle> For some reason I thought they might also be able to feed graphite from the source that feeds logstash.
[10:21] <Krinkle> guess that's still possible, unless the source is distributed (or if the query is more advanced). In which case using logstash in between makes sense
[10:21] < Krinkle> (or if the query is more advanced)
[10:22] bd808 nods
[10:22] <Krinkle> cool

Version: wmf-deployment
Severity: normal

Details

Reference: bz67817

	Subject	Repo	Branch	Lines +/-
	Change memcached icinga alert from anomaly to threshold	operations/puppet	production	+8 -10
	Add icinga alert for anomalous logstash.rate.mediawiki.memcached.ERROR.count	operations/puppet	production	+16 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	bd808	T69817 Monitor for anomalies/spikes in read failures of memcached
Resolved	bd808	T100735 Have Logstash report per-channel log message rate to Graphite
Resolved	bd808	T99735 Upgrade Logstash to 1.5.3
Resolved	bd808	T97545 reinstall logstash1001-1003
Resolved	bd808	T96692 Rack and Setup (3) Logstash Servers
Resolved	RobH	T84958 eqiad: (3) servers for logstash service
Declined	bd808	T87078 Upgrade RAM for logstash100[123] to 64G
Resolved	RobH	T89402 purchase 3 additional logstash nodes
Declined	RobH	T87460 Allocate temporary Elasticsearch nodes from spares pool for Logstash
Resolved	faidon	T97481 jessie installs fail - mirror issue due to jessie release?
Resolved	bd808	T97645 Elasticsearch not starting on Jessie hosts
Resolved	bd808	T101541 Build jessie based elasticsearch/logstash/kibana (ELK) host for beta testing
Resolved	MoritzMuehlenhoff	T98042 Update Wikimedia apt repo to include debs for Elasticsearch & Logstash on jessie
Resolved	bd808	T105101 Upgrade Logstash Elasticsearch cluster to 1.6.0
Resolved	fgiunchedi	T104035 logstash partman recipe huge root partition
Resolved	bd808	T107083 Java class org.bouncycastle.jcajce.provider.digest.MD5$Digest not found for Logstash on logstash1001 (jessie)
Declined	bd808	T107121 Setup rsyncable git fat store to host Logstash plugins
Resolved	MoritzMuehlenhoff	T107916 Import logstash 1.5.3 into apt.wm.o

Event Timeline

• bzimport raised the priority of this task from to Medium.Nov 22 2014, 3:35 AM

• bzimport added a project: WMF-General-or-Unknown.

• bzimport set Reference to bz67817.

• bzimport added a subscriber: Unknown Object (MLST).

greg created this task.Jul 10 2014, 6:28 PM

Krinkle added a project: MediaWiki-Debug-Logger.Jun 1 2015, 12:54 PM

Krinkle set Security to None.

Krinkle added a project: Sustainability.

Krinkle edited subscribers, added: ori; removed: Unknown Object (MLST).

Related: T100181, T102916.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 17 2015, 6:57 AM

bd808 added a subtask: T100735: Have Logstash report per-channel log message rate to Graphite.Jul 17 2015, 2:45 PM

Change 231704 had a related patch set uploaded (by BryanDavis):
Add icinga alert for anomalous logstash.rate.mediawiki.memcached.ERROR.count

https://gerrit.wikimedia.org/r/231704

gerritbot added a project: Patch-For-Review.Aug 14 2015, 10:37 PM

bd808 closed subtask T100735: Have Logstash report per-channel log message rate to Graphite as Resolved.Aug 14 2015, 10:38 PM

Change 231704 merged by Ori.livneh:
Add icinga alert for anomalous logstash.rate.mediawiki.memcached.ERROR.count

https://gerrit.wikimedia.org/r/231704

bd808 mentioned this in rOPUP84652bf24caa: Add icinga alert for anomalous logstash.rate.mediawiki.memcached.ERROR.count.Aug 14 2015, 10:42 PM

https://gerrit.wikimedia.org/r/231704 adds an icinga check for the rate of memcached errors reported to graphite via logstash by MediaWiki.

Updated postmortem page and announced on ops-l and engineering-l:

I submitted a puppet patch [0] which Ori merged that creates an icinga
alert when the rate of memcached errors seen by MediaWiki exceeds the
Holt-Winters forecast for the metric. This was asked for as a
postmortem outcome [1] for a bits outage in May [2].

If the alert goes off, it won't tell you exactly what is wrong
unfortunately, but we have a Kibana memcached dashboard [3] that will
usually show the host(s) having problems pretty clearly via the "Top
Hosts" table. Typically the problem is a nutcracker instance on the
host that has gone bananas. Restarting nutcracker almost always clears
the error.

Ori has asked that the next host we see the broken nutcracker problem
on be de-pooled and left in the bad state so that he or someone else
can poke around with gdb and see if the root cause for the nutcracker
failures can be tracked down and fixed once and for all.

[0]: https://gerrit.wikimedia.org/r/#/c/231704/
[1]: https://phabricator.wikimedia.org/T69817
[2]: https://wikitech.wikimedia.org/wiki/Incident_documentation/20140517-bits
[3]: https://logstash.wikimedia.org/#/dashboard/elasticsearch/memcached

bd808 added a project: User-bd808.Aug 14 2015, 11:09 PM

bd808 moved this task from To Do to Done on the User-bd808 board.

The nutcracker process on mw1142 went nuts around 2015-08-20T03:02 with the error rate shooting up from ~150/m to ~20K/m. It stayed elevated until nutcracker was restarted around 2015-08-20T12:48. This alert didn't trip however.

Looking at the data in graphite it seems that the Holt-Winters bands rose along with the error rate. Changing the alert to a plain threshold may be better.

bd808 moved this task from Done to In Dev/Progress on the User-bd808 board.Aug 20 2015, 8:22 PM

Change 233071 had a related patch set uploaded (by BryanDavis):
Change memcached icinga alert from anomaly to threshold

https://gerrit.wikimedia.org/r/233071

Change 233071 merged by Filippo Giunchedi:
Change memcached icinga alert from anomaly to threshold

https://gerrit.wikimedia.org/r/233071

fgiunchedi mentioned this in rOPUP476445ad487d: Change memcached icinga alert from anomaly to threshold.Aug 24 2015, 4:22 PM

Check changed check last 5 minutes of graphite samples and:

warn if 40%+ >= 1000
critical if 40%+ >= 5000

bd808 moved this task from In Dev/Progress to Done on the User-bd808 board.Nov 4 2015, 4:41 AM

bd808 moved this task from Done to Archive on the User-bd808 board.

Monitor for anomalies/spikes in read failures of memcachedClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Monitor for anomalies/spikes in read failures of memcached
Closed, ResolvedPublic
Actions

Related Objects
Search...