Page MenuHomePhabricator

Alert on logstash index failures on too many fields
Closed, ResolvedPublic

Description

This is a followup from T234564: Logstash discards messages from MediaWiki if they contain uncommon keys in the $context array and specifically about getting alerted when we're hitting elasticsearch's per-index field limits, which in turn usually indicates a "fields explosion" problem.

Details

Related Gerrit Patches:
operations/puppet : productionprometheus: lower threshold for logstash indexing failures
operations/puppet : productionlogstash: move ingestion alerts to be site-local
operations/puppet : productionlogstash: alert on indexing failures
operations/puppet : productionmtail: export logstash ES index failure details
operations/puppet : productionprometheus: collect logstash mtail metrics
operations/puppet : productionhieradata: fix mtail::logs location for logstash role
operations/puppet : productionprofile: add mtail to logstash
operations/puppet : productionmtail: add logstash program

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 24 2019, 8:04 AM
fgiunchedi moved this task from Backlog to Up next on the observability board.Oct 28 2019, 2:16 PM
herron added a subscriber: herron.Oct 28 2019, 7:31 PM
fgiunchedi moved this task from Backlog to Up next on the User-fgiunchedi board.Oct 31 2019, 12:27 PM
fgiunchedi moved this task from Up next to Doing on the User-fgiunchedi board.Nov 4 2019, 1:42 PM

Change 548280 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] mtail: add logstash program

https://gerrit.wikimedia.org/r/548280

Change 548281 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] profile: add mtail to logstash

https://gerrit.wikimedia.org/r/548281

Change 548280 merged by Filippo Giunchedi:
[operations/puppet@production] mtail: add logstash program

https://gerrit.wikimedia.org/r/548280

Change 548281 merged by Filippo Giunchedi:
[operations/puppet@production] profile: add mtail to logstash

https://gerrit.wikimedia.org/r/548281

Change 548975 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: fix mtail::logs location for logstash role

https://gerrit.wikimedia.org/r/548975

Change 548975 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: fix mtail::logs location for logstash role

https://gerrit.wikimedia.org/r/548975

Change 550446 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: collect logstash mtail metrics

https://gerrit.wikimedia.org/r/550446

Change 550446 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: collect logstash mtail metrics

https://gerrit.wikimedia.org/r/550446

Change 550471 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] logstash: alert on indexing failures

https://gerrit.wikimedia.org/r/550471

Change 550640 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] mtail: export logstash ES index failure details

https://gerrit.wikimedia.org/r/550640

Change 550640 merged by Filippo Giunchedi:
[operations/puppet@production] mtail: export logstash ES index failure details

https://gerrit.wikimedia.org/r/550640

Change 550471 merged by Filippo Giunchedi:
[operations/puppet@production] logstash: alert on indexing failures

https://gerrit.wikimedia.org/r/550471

fgiunchedi closed this task as Resolved.Wed, Nov 13, 11:22 AM
fgiunchedi claimed this task.

This is completed, surges of indexing errors will result in an alert now. Unfortunately the thresholds are a little higher than I expected because of background noise of errors/conflicts (tracked in T238196: Logging fields conflicts (tracking))

Change 550678 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] logstash: move ingestion alerts to be site-local

https://gerrit.wikimedia.org/r/550678

Change 550678 merged by Filippo Giunchedi:
[operations/puppet@production] logstash: move ingestion alerts to be site-local

https://gerrit.wikimedia.org/r/550678

Change 552492 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: lower threshold for logstash indexing failures

https://gerrit.wikimedia.org/r/552492

Change 552492 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: lower threshold for logstash indexing failures

https://gerrit.wikimedia.org/r/552492